Is DeepSeek's AI a brand-new secondhand ChatGPT? A "unanimous jury" rules its AI-generated text matches OpenAI models by 74%

The logos of OpenAI and DeepSeek artificial intelligence apps on mobile phones.
DeepSeek and ChatGPT are far more similar than first thought. (Image credit: Getty Images | Bloomberg)

Chinese AI startup DeepSeek burst into the AI scene earlier this year with its ultra-cost-effective, R1 V3-powered AI model. Consequently, it raised concerns among investors, especially after it surpassed OpenAI's o1 reasoning model across a wide range of benchmarks, including math, science, and coding at a fraction of the cost.

While DeepSeek researchers claimed the company spent approximately $6 million to train its cost-effective model, multiple reports suggest that it cut corners by using Microsoft and OpenAI's copyrighted content to train its model.

Another report claimed that the Chinese AI startup spent up to $1.6 billion on hardware, including 50,000 NVIDIA Hopper GPUs. OpenAI lodged a complaint, indicating the company used to train its models to train its cost-effective AI model.

The ChatGPT maker claimed DeepSeek used "distillation" to train its R1 model. For context, distillation is the process whereby a company, in this case, DeepSeek leverages preexisting model's output (OpenAI) to train a new model.

As such, the company reduces the exorbitant amount of money required to develop and train an AI model. And as it now seems, OpenAI's accusations seemingly hold some water.

A new study by AI detection firm Copyleaks reveals that DeepSeek's AI-generated outputs are reminiscent of OpenAI's ChatGPT. Perhaps more concerning, the study'd findings revealed a 74.2% resemblance (via Forbes).

Did DeepSeek train its AI model using OpenAI's copyrighted content? The tell-tale signs suggest as much

Did DeepSeek copy OpenAI's homework? (Image credit: Getty Images | Bloomberg)

Copyleaks uses screening tech and algorithm classifiers to identify text generate by AI models. For this specific study, the classifiers unanimously voted that DeepSeek's outputs were generated using OpenAI's models.

Interestingly, the AI detection firm has used this approach to identify text generated by AI models, including OpenAI, Claude, Gemini, Llama, which it distinguished as unique to each model. Classifiers use unanimous voting as standard practice to reduce false positives.

Shai Nisan, head of data science at Copyleaks indicated:

“Our research utilized a ‘unanimous jury’ approach and identified a strong stylistic similarity between DeepSeek and OpenAI’s models, which wasn’t found with other inspected models."

While investors had begun raising concern about the large amounts invested in developing and training AI models, the study's findings raises questions about DeepSeek's AI model training and development and whether its approach was truly cost-effective.

As highlighted by Nissan:

“While this similarity doesn’t definitively prove or declare DeepSeek as a derivative, it does raise questions about its development. Our research specifically focuses on writing style; within that domain, the similarity to OpenAI is significant. Considering OpenAI’s market lead, our findings suggest that further investigation into DeepSeek’s architecture, training data and development process is necessary.”

So what comes next if DeepSeek is found to have copied? (Image credit: Getty Images | Anadolu)

While the study's findings suggest DeepSeek's AI-generated texts resemble OpenAI's ChatGPT by 74.2%, it doesn't necessarily rule out the AI model as a carbon copy. However, it could brew more trouble for the AI startup, riddling it with IP rights and copyright infringement issues.

And that DeepSeek didn't categorically indicate that it used OpenAI's models to train its entry makes the situation worse, with significant legal and financial setbacks.

According to Copyleaks' Head of Data Science:

“The research strongly suggests that transparency and strong IP protections are paramount in the future of AI development and regulation. Regulators are likely to consider requiring companies to disclose detailed information about the datasets and model outputs used in training their models.”

OpenAI is no stranger to copyright infringement issues. (Image credit: Getty Images | NurPhoto)

As you may know, OpenAI and Microsoft are no strangers in the corridors of justice, especially pertaining to copyright infringement issues due to their AI efforts. For instance, eight news publishers filed copyright infringement lawsuits against Microsoft and OpenAI earlier this year in May 2024.

OpenAI CEO Sam Altman argued that copyright law doesn't categorically prohibit the use of copyrighted content for training AI models. However, the executive admitted developing ChatGPT-like tools without copyrighted content is virtually impossible.

To that end, with the rapid emergence of AI-powered tools, copyright infringement is seemingly trapped in a grey area, making it difficult to establish the fine line when AI firms outrightly steal content from publishers and other internet sources.

CATEGORIES
Kevin Okemwa
Contributor

Kevin Okemwa is a seasoned tech journalist based in Nairobi, Kenya with lots of experience covering the latest trends and developments in the industry at Windows Central. With a passion for innovation and a keen eye for detail, he has written for leading publications such as OnMSFT, MakeUseOf, and Windows Report, providing insightful analysis and breaking news on everything revolving around the Microsoft ecosystem. You'll also catch him occasionally contributing at iMore about Apple and AI. While AFK and not busy following the ever-emerging trends in tech, you can find him exploring the world or listening to music.

You must confirm your public display name before commenting

Please logout and then login again, you will then be prompted to enter your display name.