Grgur Kovač
kovacgrgur.bsky.social
Grgur Kovač
@kovacgrgur.bsky.social
PhD student at INRIA in the Flowers team. https://grgkovac.github.io
Twitter: @KovacGrgur
December 18, 2025 at 2:38 PM
P.S. This project wraps up my PhD research exploring how to leverage human sciences (psychology, cultural evolution) to better evaluate and understand LLMs.
I am now on the job market for EU-based remote roles in industry (LLM Researcher/Engineer). I’d love to connect! 👋
December 18, 2025 at 2:38 PM
This was done with:
@kovacgrgur.bsky.social *,Jérémy Perez *, Remy Portelas, Peter Ford Dominey, @pyoudeyer.bsky.social
(*equal contribution)
In the FlowersTeam, INRIA
December 18, 2025 at 2:38 PM
Caveat: Model collapse is a nascent field and studies currently make many assumptions wrt real world dynamics. Here we explore one assumption - homogeneity of data - but many more remain to be explored!
December 18, 2025 at 2:38 PM
Implication: These two takeaways together imply that different internet domains could exhibit different collapse dynamics (pertaining to the data properties of that domain).
December 18, 2025 at 2:38 PM
Finding 2: The effects are within-domain. For LLMs trained on multiple domains, drops in one domain (e.g. reddit) are influenced by that domain’s properties (e.g. reddit, not twitter/X or wikipedia), i.e. effects do not spill to other domains.
December 18, 2025 at 2:38 PM
Finding 1: Human data properties influence collapse dynamics. Some human data properties (lexical diversity, gaussianity) are associated with bigger drops in both quality and semantic diversity of generated text, and some (quality, semantic diversity) with smaller drops.
December 18, 2025 at 2:38 PM
We used an iterative chain design (iteratively fine-tuning base LLMs on data generated by previously fine-tuned models).

We use regression analysis to find associations between human data properties and relative drops in quality and semantic diversity of LLM-generated data.
December 18, 2025 at 2:38 PM
#LLMs are trained on internet data, which increasingly contains more synthetic data. These LLMs then generate new online data, which will be used to train future LLMs.

Will this closed loop result in future models generating data of lower quality and diversity (i.e. collapse)?
December 18, 2025 at 2:38 PM
The leaderboard is explained in our previous tweet (haven't transferred it to Bluesky yet) 😐:
x.com/KovacGrgur/s...
x.com
x.com
December 10, 2024 at 2:15 PM