
https://www.nature.com/articles/d41586-024-02420-7
EXCERPTS: Training artificial intelligence (AI) models on AI-generated text quickly leads to the models churning out nonsense, a study has found. This cannibalistic phenomenon, termed model collapse, could halt the improvement of large language models (LLMs) as they run out of human-derived training data and as increasing amounts of AI-generated text pervade the Internet.
“The message is, we have to be very careful about what ends up in our training data,” says co-author Zakhar Shumaylov, an AI researcher at the University of Cambridge, UK. Otherwise, “things will always, provably, go wrong”. he says.” The team used a mathematical analysis to show that the problem of model collapse is likely to be universal, affecting all sizes of language model that use uncurated data, as well as simple image generators and other types of AI.
[...] Language models work by building up associations between tokens — words or word parts — in huge swathes of text, often scraped from the Internet. They generate text by spitting out the statistically most probable next word, based on these learned patterns.
[...] Collapse happens because each model necessarily samples only from the data it is trained on. This means that words that were infrequent in the original data are less likely to be reproduced, and the probability of common ones being regurgitated is boosted. Complete collapse eventually occurs because each model learns not from reality, but from the previous model’s prediction of reality, with errors getting amplified in each iteration. “Over time, those errors end up stacking up on top of each other, to the point where the model basically only learns errors and nothing else,” says Shumailov.
The problem is analogous to inbreeding in a species, says Hany Farid, a computer scientist at the University of California, Berkeley. “If a species inbreeds with their own offspring and doesn’t diversify their gene pool, it can lead to a collapse of the species,” says Farid, whose work has demonstrated the same effect in image models, producing eerie distortions of reality2.
[...] As synthetic data build up in the web, the scaling laws that state that models should get better the more data they train on are likely to break — because training data will lose the richness and variety that comes with human-generated content, says Kempe,,, (MORE - missing details)
EXCERPTS: Training artificial intelligence (AI) models on AI-generated text quickly leads to the models churning out nonsense, a study has found. This cannibalistic phenomenon, termed model collapse, could halt the improvement of large language models (LLMs) as they run out of human-derived training data and as increasing amounts of AI-generated text pervade the Internet.
“The message is, we have to be very careful about what ends up in our training data,” says co-author Zakhar Shumaylov, an AI researcher at the University of Cambridge, UK. Otherwise, “things will always, provably, go wrong”. he says.” The team used a mathematical analysis to show that the problem of model collapse is likely to be universal, affecting all sizes of language model that use uncurated data, as well as simple image generators and other types of AI.
[...] Language models work by building up associations between tokens — words or word parts — in huge swathes of text, often scraped from the Internet. They generate text by spitting out the statistically most probable next word, based on these learned patterns.
[...] Collapse happens because each model necessarily samples only from the data it is trained on. This means that words that were infrequent in the original data are less likely to be reproduced, and the probability of common ones being regurgitated is boosted. Complete collapse eventually occurs because each model learns not from reality, but from the previous model’s prediction of reality, with errors getting amplified in each iteration. “Over time, those errors end up stacking up on top of each other, to the point where the model basically only learns errors and nothing else,” says Shumailov.
The problem is analogous to inbreeding in a species, says Hany Farid, a computer scientist at the University of California, Berkeley. “If a species inbreeds with their own offspring and doesn’t diversify their gene pool, it can lead to a collapse of the species,” says Farid, whose work has demonstrated the same effect in image models, producing eerie distortions of reality2.
[...] As synthetic data build up in the web, the scaling laws that state that models should get better the more data they train on are likely to break — because training data will lose the richness and variety that comes with human-generated content, says Kempe,,, (MORE - missing details)