Feel the need to point out again ā even slightly more sophisticated sampling can overcome mode collapse in synthetic data.
x.com/sarahookr/status/18420ā¦
šØ "AI models collapse when trained on recursively generated data" was among the most influential AI papers of 2024 - don't miss it! Bookmark & download it below. Interesting quotes:
"The development of LLMs is very involved and requires large quantities of training data. Yet, although current LLMs2,4ā6, including GPT-3, were trained on predominantly human-generated text, this may change. If the training data of most future models are also scraped from the web, then they will inevitably train on data produced by their predecessors. In this paper, we investigate what happens when text produced by, for example, a version of GPT forms most of the training dataset of following models. What happens to GPT generations GPT-{n} as n increases? We discover that indiscriminately learning from data produced by other models causes āmodel collapseāāa degenerative process whereby, over time, models forget the true underlying data distribution, even in the absence of a shift in the distribution over time"
-
"Our evaluation suggests a āfirst mover advantageā when it comes to training models such as LLMs. In our work, we demonstrate that training on samples from another generative model can induce a distribution shift, whichāover timeācauses model collapse. This in turn causes the model to misperceive the underlying learning task. To sustain learning over a long period of time, we need to make sure that access to the original data source is preserved and that further data not generated by LLMs remain available over time. The need to distinguish data generated by LLMs from other data raises questions about the provenance of content that is crawled from the Internet: it is unclear how content generated by LLMs can be tracked at scale. (...)"
ā” Authors: Ilia Shumailov, Zakhar Shumaylov, Yiren (Aaron) Zhao, Nicolas Papernot,Ā Ross AndersonĀ & Yarin Gal
ā” Link to the paper below.
š„ To stay up to date with the latest developments in AI policy, compliance & regulation, including excellent research, join 44,400 people who subscribe to my AI newsletter (link below).