Lightnews — Scholar-powered news

Mete @mismayil.bsky.social · 16d

🙏 Huge thanks to my collaborators Antonio Laverghetta, Simone Luchini, Reet Patel, @abosselut.bsky.social Lonneke van der Plas and @roger-beaty.bsky.social
for making this possible! We also publicly release the code, models, and data!
www.mete.is/creative-pre...

Creative Preference Optimization

Novel preference optimization method and a multi-task creativity dataset promoting LLM output creativity

www.mete.is

1

Mete @mismayil.bsky.social · 16d

📊 Results
We fine-tuned small LLMs like Llama-3.1-8B-Instruct on MuCE using CrPO and achieved significant improvements on LLM output creativity across all dimensions while maintaining high output quality
CrPO models beat SFT, vanilla DPO, and large models such as GPT-4o 🔥

1 1

Mete @mismayil.bsky.social · 16d

📃Multi-task Creativity Evaluation (MuCE)
To apply CrPO, we also collect a large-scale preference dataset consisting of more than 200K human responses and ratings for more than 30 creativity assessments, and use a subset of it to train and evaluate our models.

1 1

Mete @mismayil.bsky.social · 16d

🧠 How do we compute creativity scores?
Instead of treating creativity as a single concept, we break it down into its major dimensions and employ metrics for each that provide measurable signals aligning with key cognitive theories and enable practical optimization within LLMs.

1 1

Mete @mismayil.bsky.social · 16d

🔧 How does it work?
CrPO = Direct Preference Optimization (DPO) × a weighted mix of creativity scores (novelty, surprise, diversity, quality).
This modular objective enables us to optimize LLMs for different dimensions of creativity tailored to a given domain.

1 1

Mete @mismayil.bsky.social · 16d

💡Can we optimize LLMs to be more creative?
Introducing Creative Preference Optimization (CrPO) and MuCE (Multi-task Creativity Evaluation Dataset).
Result: More novel, diverse, surprising text—without losing quality!
📝 Appearing at #EMNLP2025

1 4 6

Mete @mismayil.bsky.social · Apr 30

I am attending @naaclmeeting.bsky.social this week to present our paper. Come check out our poster at 14:00, Apr 30 in Hall 3 . @defnecirci.bsky.social and Hale Sirin will also be there to answer your questions!

Mete @mismayil.bsky.social · Feb 20

Are LLMs linguistically productive and systematic in morphologically-rich languages as good as humans?
No 🤨 Our new NAACL 2025 paper (arxiv.org/abs/2410.12656) reveals a significant performance gap between LLMs and humans in linguistic creativity and morphological generalization.

Evaluating Morphological Compositional Generalization in Large Language Models

Large language models (LLMs) have demonstrated significant progress in various natural language generation and understanding tasks. However, their linguistic generalization capabilities remain questio...

arxiv.org

2 3

Reposted by Mete

Antoine Bosselut @abosselut.bsky.social · Feb 25

Lots of great news out of the EPFL NLP lab these last few weeks. We'll be at @iclr-conf.bsky.social and @naaclmeeting.bsky.social in April / May to present some of our work in training dynamics, model representations, reasoning, and AI democratization. Come chat with us during the conference!

1 12 25

Mete @mismayil.bsky.social · Feb 20

Check out our paper [https://arxiv.org/abs/2410.12656] for more details. Huge shoutouts to my amazing collaborators
@defnecirci.bsky.social, @jonnesaleva.bsky.social, Hale Sirin, Abdullatif Koksal, Bhuwan Dhingra, @abosselut.bsky.social, Duygu Ataman, Lonneke van der Plas.

1 2

Mete @mismayil.bsky.social · Feb 20

Our further analysis shows that tokenization is not likely to be the issue, adding context does not help, models are sensitive to the order of morphemes in the prompt and removing language-specific shortcuts from data lowers their performance even further.

1 1

Mete @mismayil.bsky.social · Feb 20

Our analysis shows that model performance is negatively correlated with the morphological complexity of the words (i.e. number of morphemes) while human performance is not systematically affected (results shown for GPT-4 below)

1

Mete @mismayil.bsky.social · Feb 20

We find that all models struggle to compose new words and fail to consistently recognize the validity of all compositions, especially when applied to novel (i.e. out-of-distribution) word roots. Humans on the other hand ace both tasks and easily generalize to novel words.

1

Mete @mismayil.bsky.social · Feb 20

...and evaluate several multilingual LLMs (GPT-4, Gemini-1.5, Aya-23, Qwen2.5) on these tasks in two typologically-related (i.e. agglutination), but unrelated languages: Turkish and Finnish. We also evaluate native human speakers of these languages.

1

Mete @mismayil.bsky.social · Feb 20

We design two novel compositional probing tasks to measure morphological productivity (i.e. ability to produce novel well-formed combinations of morphemes) and systematicity (i.e. ability to systematically understand novel combinations)...

1

Mete @mismayil.bsky.social · Feb 20

Are LLMs linguistically productive and systematic in morphologically-rich languages as good as humans?
No 🤨 Our new NAACL 2025 paper (arxiv.org/abs/2410.12656) reveals a significant performance gap between LLMs and humans in linguistic creativity and morphological generalization.

Evaluating Morphological Compositional Generalization in Large Language Models

Large language models (LLMs) have demonstrated significant progress in various natural language generation and understanding tasks. However, their linguistic generalization capabilities remain questio...

arxiv.org

1 2