Meta allegedly used 'crude tactics' to close in on OpenAI’s 2-year lead building AI uncontested — Sam Altman admitted creating ChatGPT without copyrighted data is impossible

Text of LlaMA by Meta on a phone's display resting on a laptop's keyboard.

Meta reportedly resorted to "crude tactics" to narrow OpenAI’s 2-year AI lead. (Image credit: Getty Images | SOPA)

Meta's lagging AI efforts are making news again. Microsoft CEO Satya Nadella recently admitted that OpenAI had a 2-year runway in the AI race to work uncontested and build ChatGPT. While other top AI labs, such as Anthropic and Google, are swiftly picking up the slack, Meta is seemingly having a long day at the office trying to keep up.

According to internal communications within Meta Inc. during a major copyright lawsuit battle, the company allegedly used copyrighted content to train its AI models and seemingly tried to cover its tracks to avoid copyright infringement-related issues (via The Verge).

Interestingly, the company's deceitful tactics aimed to expedite the process of catching up with OpenAI's rapid progression in the AI landscape. An email sent to Meta AI researcher Hugo Touvron by the company's VP of gen AI revealed the company's “needs to be GPT4,” which would involve learning "how to build frontier and win this race.”

You may like

However, intricated details about the Facebook maker's plans to achieve these goals reportedly involved the book piracy site Library Genesis (LibGen), which would be used to train its models.

The Verge's damning report further revealed another email from Meta's Director of Product, Sony Theakanath, to Joelle Pineau, VP of AI Research, seeking clarity on whether to use LibGen's data internally for benchmarks included in a blog post or use the site's data to train a model. In the email, Theakanath indicated Gen AI had been approved to use LibGen for Llama3 but with several mitigations, including scrapping data labeled as pirated or stolen without indicating that the model was trained using data from the site.

According to Theakanath, “Libgen is essential to meet SOTA [state-of-the-art] numbers.” He further indicated that “it is known that OpenAI and Mistral are using the library for their models (through word of mouth)” after escalating the issue to an executive within the organization under MZ, presumably Meta CEO Mark Zuckerberg.

The email also highlighted potential policy risks caused by training the AI models using copyrighted content, including regulatory response and intervention measures following media coverage, highlighting Meta's copyright infringement practices. “This may undermine our negotiating position with regulators on these issues,” added Theakanath.

Meta reportedly turned to crafty measures to cover its tracks after using LibGen's data to train its AI models, including removing copyright headers and document identifiers such as the copyright symbol. The document also disclosed comments by employees to further blur the lines, including scrapping metadata “to avoid potential legal complications.”

Copyright infringement is seemingly crucial for AI model training

A photo taken on February 26, 2024 shows the logo of the ChatGPT application developed by US artificial intelligence research organization OpenAI on a smartphone screen.

(Image credit: Getty Images | KIRILL KUDRYAVTSEV )

Microsoft and OpenAI have been wrapped up in countless copyright infringement lawsuits. And while some of these cases are still in court, OpenAI CEO Sam Altman admitted that training AI models without copyrighted content is virtually impossible. He further indicated that almost everything on the internet is copyrighted, deeming the use of copyrighted content to train AI models as fair use. He argued the copyright law doesn't categorically prohibit training of AI models using copyrighted content.

More recently, reports indicated that top AI labs, including OpenAI and Anthropic, are struggling to develop advanced AI systems due to a lack of high-quality content. However, leaders in the AI landscape, including Sam Altman and the former Google CEO, have disputed the claims, citing no evidence showing scaling laws have begun; "there's no wall."

TOPICS

CATEGORIES

Kevin Okemwa is a seasoned tech journalist based in Nairobi, Kenya with lots of experience covering the latest trends and developments in the industry at Windows Central. With a passion for innovation and a keen eye for detail, he has written for leading publications such as OnMSFT, MakeUseOf, and Windows Report, providing insightful analysis and breaking news on everything revolving around the Microsoft ecosystem. You'll also catch him occasionally contributing at iMore about Apple and AI. While AFK and not busy following the ever-emerging trends in tech, you can find him exploring the world or listening to music.