Sam Altman indicated it's impossible to create ChatGPT without copyrighted material, but a new study claims 57% of the content on the internet is AI-generated and is subtly killing quality search results

ChatGPT on a Google Pixel 7 Pro
ChatGPT on a Google Pixel 7 Pro (Image credit: Ben Wilson | Windows Central)

What you need to know

  • A new study suggests that more than 57% of the content available on the internet is generated content.
  • AI tools like Copilot and ChatGPT depend on information from the internet for training, but the infiltration of AI-generated content into the internet limits their scope, leading to inaccurate responses and misinformation.
  • If copyright law prohibits training AI models using copyrighted content, the responses generated using chatbots will likely worsen and become more inaccurate.

With the rapid adoption of generative AI, it's increasingly becoming difficult to tell what's real. From images and videos to text, AI tools are debatably at their peak and can generate sophisticated outputs based on prompts. 

There's been a constant battle between publishers and the companies behind these AI tools over copyright infringement-related issues. While OpenAI CEO Sam Altman admits it's impossible to create tools like ChatGPT without copyrighted content, copyright law doesn't prohibit the use of the content to train AI models.

A study by Amazon Web Services (AWS) researchers suggests 57% of content published online is AI-generated or translated using an AI algorithim (via Forbes). Researchers from Cambridge and Oxford claim the increasing number of AI-generated content and the overreliance of AI tools on the same content can only lead to one result — low-quality responses to queries.

Per the study, the AI-generated responses to queries degraded in value and accuracy after every attempt. According to Dr. Ilia Shumailov from the University of Oxford:

“It is surprising how fast model collapse kicks in and how elusive it can be. At first, it affects minority data—data that is badly represented. It then affects diversity of the outputs and the variance reduces. Sometimes, you observe small improvement for the majority data, that hides away the degradation in performance on minority data. Model collapse can have serious consequences.”

According to the researchers, the degradation in the quality of responses by chatbots is a cyclical overdose of AI-generated content. As you may know, AI models depend on information on the internet for training. As such, if the information on the internet is AI-generated and inaccurate, the training exercise becomes ineffective, prompting the generation of wrong answers and misinformation.

AI chatbots are lying to themselves

ChatGPT and Microsoft Logo (Image credit: Daniel Rubino)

The researchers decided to dig deeper in an attempt to uncover the root cause of the issue. Right off the bat, it can be attributed to an increase in AI-generated articles being published online without fact-checking. The team used a pre-trained AI-powered wiki to make its deductions. They trained the tool using its outputs. The team immediately noticed a decline in the quality of the information generated by the tool.

The study further highlights that the AI tool excluded rare dog breeds from its knowledge scope after repeated data sets, despite being trained on a wide library of information about dog breeds from the get-go.

To this end, the quality of search results will likely worsen with the prevalence of AI and the publishing of AI-generated content online. 

🎒The best Back to School deals📝

Kevin Okemwa
Contributor

Kevin Okemwa is a seasoned tech journalist based in Nairobi, Kenya with lots of experience covering the latest trends and developments in the industry at Windows Central. With a passion for innovation and a keen eye for detail, he has written for leading publications such as OnMSFT, MakeUseOf, and Windows Report, providing insightful analysis and breaking news on everything revolving around the Microsoft ecosystem. You'll also catch him occasionally contributing at iMore about Apple and AI. While AFK and not busy following the ever-emerging trends in tech, you can find him exploring the world or listening to music.

  • Yisroel Tech
    This shock-factor headline is misleading and almost intentionally-deceptive 😢.

    57% of content published online is AI-generated
    Not even close to that!

    The AWS study is only about "machine translated" content and has 0 to do with generative AI and/or model collapse.

    Studies researching AI generated content have found a mere 10-15% of content being AI generated. While this is still a valid concern going further, the misleading 57.1% number has no relation to this.

    And don't tell me that this is only the headline... and inside in the article it does say that 57.1% is AI generated "or translated", because that doesn't help at all. First of all, then put that in the headline. Furthermore, it's not "or", that's the only thing this 57% is referring to.

    I think this article should be clarified, and the AWS study removed.
    Reply