Research  AI models fed AI-generated data quickly spew nonsense (AI inbreeding)

#1
C C Offline
https://www.nature.com/articles/d41586-024-02420-7

EXCERPTS: Training artificial intelligence (AI) models on AI-generated text quickly leads to the models churning out nonsense, a study has found. This cannibalistic phenomenon, termed model collapse, could halt the improvement of large language models (LLMs) as they run out of human-derived training data and as increasing amounts of AI-generated text pervade the Internet.

“The message is, we have to be very careful about what ends up in our training data,” says co-author Zakhar Shumaylov, an AI researcher at the University of Cambridge, UK. Otherwise, “things will always, provably, go wrong”. he says.” The team used a mathematical analysis to show that the problem of model collapse is likely to be universal, affecting all sizes of language model that use uncurated data, as well as simple image generators and other types of AI.

[...] Language models work by building up associations between tokens — words or word parts — in huge swathes of text, often scraped from the Internet. They generate text by spitting out the statistically most probable next word, based on these learned patterns.

[...] Collapse happens because each model necessarily samples only from the data it is trained on. This means that words that were infrequent in the original data are less likely to be reproduced, and the probability of common ones being regurgitated is boosted. Complete collapse eventually occurs because each model learns not from reality, but from the previous model’s prediction of reality, with errors getting amplified in each iteration. “Over time, those errors end up stacking up on top of each other, to the point where the model basically only learns errors and nothing else,” says Shumailov.

The problem is analogous to inbreeding in a species, says Hany Farid, a computer scientist at the University of California, Berkeley. “If a species inbreeds with their own offspring and doesn’t diversify their gene pool, it can lead to a collapse of the species,” says Farid, whose work has demonstrated the same effect in image models, producing eerie distortions of reality2.

[...] As synthetic data build up in the web, the scaling laws that state that models should get better the more data they train on are likely to break — because training data will lose the richness and variety that comes with human-generated content, says Kempe,,, (MORE - missing details)
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Article The AI was fed sloppy code. It turned into something evil. C C 11 847 Aug 21, 2025 04:08 PM
Last Post: Syne
  Article What does sci publishing look like if every paper is AI-generated or co-authored? C C 0 219 Aug 4, 2025 07:00 PM
Last Post: C C
  Research With no need for sleep or food, AI-built ‘scientists’ get the job done quickly C C 0 327 Jul 29, 2025 09:42 PM
Last Post: C C
  Large Language Models Reflect the Ideology of their Creators Yazata 1 593 Oct 29, 2024 11:16 PM
Last Post: C C
  Why AI generated hands are always screwed up Magical Realist 2 667 Oct 18, 2024 10:42 AM
Last Post: Syne
  By next year, AI Models could be able to “replicate and survive in the wild” C C 0 447 Apr 22, 2024 05:41 PM
Last Post: C C
  Verbal nonsense reveals limitations of AI chatbots + Robot consensus + AI outperforms C C 1 425 Sep 15, 2023 11:52 PM
Last Post: confused2
  Article AI models fail to reproduce human judgements about rule violations + AI empathy C C 0 341 May 10, 2023 08:38 PM
Last Post: C C
  Article Unpredictable abilities emerge from large AI models + Could GPT-4 take over world? C C 1 389 Mar 18, 2023 08:12 AM
Last Post: Kornee
  Using physiological cues to distinguish computer-generated faces from human ones C C 0 439 Jan 22, 2020 05:06 AM
Last Post: C C



Users browsing this thread: 1 Guest(s)