Mete
@mismayil.bsky.social
47 followers 120 following 14 posts
PhD in AI @EPFL
Posts Media Videos Starter Packs
mismayil.bsky.social
🙏 Huge thanks to my collaborators Antonio Laverghetta, Simone Luchini, Reet Patel, @abosselut.bsky.social Lonneke van der Plas and @roger-beaty.bsky.social
for making this possible! We also publicly release the code, models, and data!
www.mete.is/creative-pre...
Creative Preference Optimization
Novel preference optimization method and a multi-task creativity dataset promoting LLM output creativity
www.mete.is
mismayil.bsky.social
📊 Results
We fine-tuned small LLMs like Llama-3.1-8B-Instruct on MuCE using CrPO and achieved significant improvements on LLM output creativity across all dimensions while maintaining high output quality
CrPO models beat SFT, vanilla DPO, and large models such as GPT-4o 🔥
mismayil.bsky.social
📃Multi-task Creativity Evaluation (MuCE)
To apply CrPO, we also collect a large-scale preference dataset consisting of more than 200K human responses and ratings for more than 30 creativity assessments, and use a subset of it to train and evaluate our models.
mismayil.bsky.social
🧠 How do we compute creativity scores?
Instead of treating creativity as a single concept, we break it down into its major dimensions and employ metrics for each that provide measurable signals aligning with key cognitive theories and enable practical optimization within LLMs.
mismayil.bsky.social
🔧 How does it work?
CrPO = Direct Preference Optimization (DPO) × a weighted mix of creativity scores (novelty, surprise, diversity, quality).
This modular objective enables us to optimize LLMs for different dimensions of creativity tailored to a given domain.
mismayil.bsky.social
💡Can we optimize LLMs to be more creative?
Introducing Creative Preference Optimization (CrPO) and MuCE (Multi-task Creativity Evaluation Dataset).
Result: More novel, diverse, surprising text—without losing quality!
📝 Appearing at #EMNLP2025
mismayil.bsky.social
I am attending @naaclmeeting.bsky.social this week to present our paper. Come check out our poster at 14:00, Apr 30 in Hall 3 . @defnecirci.bsky.social and Hale Sirin will also be there to answer your questions!
mismayil.bsky.social
Are LLMs linguistically productive and systematic in morphologically-rich languages as good as humans?
No 🤨 Our new NAACL 2025 paper (arxiv.org/abs/2410.12656) reveals a significant performance gap between LLMs and humans in linguistic creativity and morphological generalization.
Evaluating Morphological Compositional Generalization in Large Language Models
Large language models (LLMs) have demonstrated significant progress in various natural language generation and understanding tasks. However, their linguistic generalization capabilities remain questio...
arxiv.org
Reposted by Mete
abosselut.bsky.social
Lots of great news out of the EPFL NLP lab these last few weeks. We'll be at @iclr-conf.bsky.social and @naaclmeeting.bsky.social in April / May to present some of our work in training dynamics, model representations, reasoning, and AI democratization. Come chat with us during the conference!
mismayil.bsky.social
Check out our paper [https://arxiv.org/abs/2410.12656] for more details. Huge shoutouts to my amazing collaborators
@defnecirci.bsky.social, @jonnesaleva.bsky.social, Hale Sirin, Abdullatif Koksal, Bhuwan Dhingra, @abosselut.bsky.social, Duygu Ataman, Lonneke van der Plas.
mismayil.bsky.social
Our further analysis shows that tokenization is not likely to be the issue, adding context does not help, models are sensitive to the order of morphemes in the prompt and removing language-specific shortcuts from data lowers their performance even further.
mismayil.bsky.social
Our analysis shows that model performance is negatively correlated with the morphological complexity of the words (i.e. number of morphemes) while human performance is not systematically affected (results shown for GPT-4 below)
mismayil.bsky.social
We find that all models struggle to compose new words and fail to consistently recognize the validity of all compositions, especially when applied to novel (i.e. out-of-distribution) word roots. Humans on the other hand ace both tasks and easily generalize to novel words.
mismayil.bsky.social
...and evaluate several multilingual LLMs (GPT-4, Gemini-1.5, Aya-23, Qwen2.5) on these tasks in two typologically-related (i.e. agglutination), but unrelated languages: Turkish and Finnish. We also evaluate native human speakers of these languages.
mismayil.bsky.social
We design two novel compositional probing tasks to measure morphological productivity (i.e. ability to produce novel well-formed combinations of morphemes) and systematicity (i.e. ability to systematically understand novel combinations)...
mismayil.bsky.social
Are LLMs linguistically productive and systematic in morphologically-rich languages as good as humans?
No 🤨 Our new NAACL 2025 paper (arxiv.org/abs/2410.12656) reveals a significant performance gap between LLMs and humans in linguistic creativity and morphological generalization.
Evaluating Morphological Compositional Generalization in Large Language Models
Large language models (LLMs) have demonstrated significant progress in various natural language generation and understanding tasks. However, their linguistic generalization capabilities remain questio...
arxiv.org