Lightnews — Scholar-powered news

Grgur Kovač

@kovacgrgur.bsky.social

40 followers 63 following 13 posts

PhD student at INRIA in the Flowers team. https://grgkovac.github.io
Twitter: @KovacGrgur

Posts Replies Media Videos

Grgur Kovač

@kovacgrgur.bsky.social

This work was heavily inspired my many amazing works such as:
www.nature.com/articles/s41...
arxiv.org/abs/2404.01413
arxiv.org/abs/2311.09807
arxiv.org/abs/2402.0704

December 18, 2025 at 2:38 PM

Grgur Kovač

@kovacgrgur.bsky.social

P.S. This project wraps up my PhD research exploring how to leverage human sciences (psychology, cultural evolution) to better evaluate and understand LLMs.
I am now on the job market for EU-based remote roles in industry (LLM Researcher/Engineer). I’d love to connect! 👋

December 18, 2025 at 2:38 PM

Grgur Kovač

@kovacgrgur.bsky.social

This was done with:
@kovacgrgur.bsky.social *,Jérémy Perez *, Remy Portelas, Peter Ford Dominey, @pyoudeyer.bsky.social
(*equal contribution)
In the FlowersTeam, INRIA

December 18, 2025 at 2:38 PM

Grgur Kovač

@kovacgrgur.bsky.social

Caveat: Model collapse is a nascent field and studies currently make many assumptions wrt real world dynamics. Here we explore one assumption - homogeneity of data - but many more remain to be explored!

December 18, 2025 at 2:38 PM

Grgur Kovač

@kovacgrgur.bsky.social

Implication: These two takeaways together imply that different internet domains could exhibit different collapse dynamics (pertaining to the data properties of that domain).

December 18, 2025 at 2:38 PM

Grgur Kovač

@kovacgrgur.bsky.social

Finding 2: The effects are within-domain. For LLMs trained on multiple domains, drops in one domain (e.g. reddit) are influenced by that domain’s properties (e.g. reddit, not twitter/X or wikipedia), i.e. effects do not spill to other domains.

December 18, 2025 at 2:38 PM

Grgur Kovač

@kovacgrgur.bsky.social

Finding 1: Human data properties influence collapse dynamics. Some human data properties (lexical diversity, gaussianity) are associated with bigger drops in both quality and semantic diversity of generated text, and some (quality, semantic diversity) with smaller drops.

December 18, 2025 at 2:38 PM

Grgur Kovač

@kovacgrgur.bsky.social

We used an iterative chain design (iteratively fine-tuning base LLMs on data generated by previously fine-tuned models).

We use regression analysis to find associations between human data properties and relative drops in quality and semantic diversity of LLM-generated data.

December 18, 2025 at 2:38 PM

Grgur Kovač

@kovacgrgur.bsky.social

#LLMs are trained on internet data, which increasingly contains more synthetic data. These LLMs then generate new online data, which will be used to train future LLMs.

Will this closed loop result in future models generating data of lower quality and diversity (i.e. collapse)?