Can AI Models Collapse?
wait.. what.. really? an AI Language Model can collapse? Yes they can[1].
A recent research paper AI models collapse when trained on recursively generated data published by Ilia Shumailov, Zakhar Shumaylov, Yiren Zhao, Nicolas Papernot, Ross Anderson & Yarin Gal on 24 July, 2024 makes the point.
and they also kind of make sense.
What is a model collapse?
According to the paper, Model collapse is a degenerative process affecting generations of learned generative models, in which the data they generate end up polluting the training set of the next generation. Being trained on polluted data, they then misinterpret reality.
Basics. How AI models are trained
As most of you know there is tons of human-generated data all around us and has been since ages. Books, papers, news paintings, and in the last 2-3 decades most of that physical data is digitized and made available on the web. Then there’s more - like blogs, news, videos, movies, pictures, drawings, social media and the list goes on.
Imagine if there is a way to accumulate all of this data into a software system or a database? That is exactly what has happened and such data is being used to train AI LLMs or Large Language Models.
A few popular data sets used to train LLMs HelpSteer , No Robots , Anthropic HH Golden and several at Hugging Face Datasets
The Problem
These models keep growing bigger and bigger every generation, as model builders keep adding more data. What happens at some point if these builders exhaust / use all of this data?
The LLMs will evaluate themselves and generate their own data - called as, synthetic data [2].
Also LLMs while training raw data, processes the data to generate a refined set. And as new generation of LLMs are trained with more / new datasets, LLMs grow / pile up / add on the previously refined datasets. This is how LLMs data is managed / stored.
So far so good, now comes the bad part.
If in this cycle, some parts are corrupted, that corrupted data just keep piling up on new models - unless someone removes/clears it. And this may lead to an AI Model Collapse.
In the research paper it was demonstrated that training on samples from another generative model can induce a distribution shift, which—over time—causes model collapse. This in turn causes the model to misinterpret the underlying learning task.
The Solution, if there is any?
The “first mover advantage” is evident in training models like Large Language Models (LLMs). However, to sustain learning over time, AI vendors must ensure:
Preserved access to original data sources
Continuous availability of new, human-generated data
Community-Wide Coordination: different parties involved in LLM creation and deployment must share information to resolve provenance questions. Without this coordination, training newer LLM versions may become increasingly difficult due to:
Limited access to pre-mass adoption internet-crawled data
Insufficient direct access to human-generated data at scale
A key challenge arises in distinguishing between LLM-generated data and other data. This raises questions about the provenance of internet-crawled content. Tracking LLM-generated content at scale is a significant concern.