Research AI models fed AI-generated data quickly spew nonsense (AI inbreeding)

**C C** · Jul 27, 2024 02:26 AM

https://www.nature.com/articles/d41586-024-02420-7

EXCERPTS: Training artificial intelligence (AI) models on AI-generated text quickly leads to the models churning out nonsense, a study has found. This cannibalistic phenomenon, termed model collapse, could halt the improvement of large language models (LLMs) as they run out of human-derived training data and as increasing amounts of AI-generated text pervade the Internet.

“The message is, we have to be very careful about what ends up in our training data,” says co-author Zakhar Shumaylov, an AI researcher at the University of Cambridge, UK. Otherwise, “things will always, provably, go wrong”. he says.” The team used a mathematical analysis to show that the problem of model collapse is likely to be universal, affecting all sizes of language model that use uncurated data, as well as simple image generators and other types of AI.

[...] Language models work by building up associations between tokens — words or word parts — in huge swathes of text, often scraped from the Internet. They generate text by spitting out the statistically most probable next word, based on these learned patterns.

[...] Collapse happens because each model necessarily samples only from the data it is trained on. This means that words that were infrequent in the original data are less likely to be reproduced, and the probability of common ones being regurgitated is boosted. Complete collapse eventually occurs because each model learns not from reality, but from the previous model’s prediction of reality, with errors getting amplified in each iteration. “Over time, those errors end up stacking up on top of each other, to the point where the model basically only learns errors and nothing else,” says Shumailov.

The problem is analogous to inbreeding in a species, says Hany Farid, a computer scientist at the University of California, Berkeley. “If a species inbreeds with their own offspring and doesn’t diversify their gene pool, it can lead to a collapse of the species,” says Farid, whose work has demonstrated the same effect in image models, producing eerie distortions of reality2.

[...] As synthetic data build up in the web, the scaling laws that state that models should get better the more data they train on are likely to break — because training data will lose the richness and variety that comes with human-generated content, says Kempe,,, (MORE - missing details)

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Research Popular AI models aren’t ready to safely power robots	C C	0	138	Nov 11, 2025 12:58 AM Last Post: C C
	Article The AI was fed sloppy code. It turned into something evil.	C C	11	1,505	Aug 21, 2025 04:08 PM Last Post: Syne
	Article What does sci publishing look like if every paper is AI-generated or co-authored?	C C	0	347	Aug 4, 2025 07:00 PM Last Post: C C
	Research With no need for sleep or food, AI-built ‘scientists’ get the job done quickly	C C	0	512	Jul 29, 2025 09:42 PM Last Post: C C
	Large Language Models Reflect the Ideology of their Creators	Yazata	1	650	Oct 29, 2024 11:16 PM Last Post: C C
	Why AI generated hands are always screwed up	Magical Realist	2	869	Oct 18, 2024 10:42 AM Last Post: Syne
	By next year, AI Models could be able to “replicate and survive in the wild”	C C	0	497	Apr 22, 2024 05:41 PM Last Post: C C
	Verbal nonsense reveals limitations of AI chatbots + Robot consensus + AI outperforms	C C	1	493	Sep 15, 2023 11:52 PM Last Post: confused2
	Article AI models fail to reproduce human judgements about rule violations + AI empathy	C C	0	438	May 10, 2023 08:38 PM Last Post: C C
	Article Unpredictable abilities emerge from large AI models + Could GPT-4 take over world?	C C	1	484	Mar 18, 2023 08:12 AM Last Post: Kornee