Lightnews — Scholar-powered news

Reposted by Mingxuan (Aldous) Li

chenhaotan.bsky.social @chenhaotan.bsky.social · 13d

🚀 We’re thrilled to announce the upcoming AI & Scientific Discovery online seminar! We have an amazing lineup of speakers.

This series will dive into how AI is accelerating research, enabling breakthroughs, and shaping the future of research across disciplines.

ai-scientific-discovery.github.io

1 15 23

Reposted by Mingxuan (Aldous) Li

chenhaotan.bsky.social @chenhaotan.bsky.social · 22d

As AI becomes increasingly capable of conducting analyses and following instructions, my prediction is that the role of scientists will increasingly focus on identifying and selecting important problems to work on ("selector"), and effectively evaluating analyses performed by AI ("evaluator").

2 8 10

Reposted by Mingxuan (Aldous) Li

chenhaotan.bsky.social @chenhaotan.bsky.social · Aug 29

We are proposing the second workshop on AI & Scientific Discovery at EACL/ACL. The workshop will explore how AI can advance scientific discovery. Please use this Google form to indicate your interest (corrected link):

forms.gle/MFcdKYnckNno...

More in the 🧵! Please share! #MLSky 🧠

Program Committee Interest for the Second Workshop on AI & Scientific Discovery

We are proposing the second workshop on AI & Scientific Discovery at EACL/ACL (Annual meetings of The Association for Computational Linguistics, the European Language Resource Association and Internat...

forms.gle

1 8 14

Reposted by Mingxuan (Aldous) Li

Xiaoyan Bai @elenal3ai.bsky.social · Jul 31

⚡️Ever asked an LLM-as-Marilyn Monroe about the 2020 election? Our paper calls this concept incongruence, common in both AI and how humans create and reason.
🧠Read my blog to learn what we found, why it matters for AI safety and creativity, and what's next: cichicago.substack.com/p/concept-in...

1 5 9

Mingxuan (Aldous) Li @itea1001.bsky.social · Jul 27

#ACL2025 Poster Session 1 tomorrow 11:00-12:30 Hall 4/5!

1 3

Mingxuan (Aldous) Li @itea1001.bsky.social · Jul 27

Excited to present our work at #ACL2025!
Come by Poster Session 1 tomorrow, 11:00–12:30 in Hall X4/X5 — would love to chat!

Haokun Liu @haokunliu.bsky.social · Nov 14

1/ 🚀 New Paper Alert!
Excited to share: Literature Meets Data: A Synergistic Approach to Hypothesis Generation 📚📊!
We propose a novel framework combining literature insights & observational data with LLMs for hypothesis generation. Here’s how and why it matters.

2 4

Reposted by Mingxuan (Aldous) Li

chenhaotan.bsky.social @chenhaotan.bsky.social · Jul 9

Prompting is our most successful tool for exploring LLMs, but the term evokes eye-rolls and grimaces from scientists. Why? Because prompting as scientific inquiry has become conflated with prompt engineering.

This is holding us back. 🧵and new paper with @ari-holtzman.bsky.social .

2 15 37

Reposted by Mingxuan (Aldous) Li

chenhaotan.bsky.social @chenhaotan.bsky.social · Jul 2

When you walk into the ER, you could get a doc:
1. Fresh from a week of not working
2. Tired from working too many shifts

@oziadias.bsky.social has been both and thinks that they're different! But can you tell from their notes? Yes we can! Paper @natcomms.nature.com www.nature.com/articles/s41...

1 11 26

Reposted by Mingxuan (Aldous) Li

Xiaoyan Bai @elenal3ai.bsky.social · May 27

🚨 New paper alert 🚨

Ever asked an LLM-as-Marilyn Monroe who the US president was in 2000? 🤔 Should the LLM answer at all? We call these clashes Concept Incongruence. Read on! ⬇️

1/n 🧵

1 17 28

Mingxuan (Aldous) Li @itea1001.bsky.social · May 21

HypoEval evaluators (github.com/ChicagoHAI/H...) are now incorporated into judges from QuotientAI — check it out at github.com/quotient-ai/...!

2 2

Mingxuan (Aldous) Li @itea1001.bsky.social · May 12

12/n Acknowledgments:
Great thanks to my wonderful collaborators Hanchen Li and my advisor @chenhaotan.bsky.social!
Check out full paper here at (arxiv.org/abs/2504.07174)

HypoEval: Hypothesis-Guided Evaluation for Natural Language Generation

Large language models (LLMs) have demonstrated great potential for automating the evaluation of natural language generation. Previous frameworks of LLM-as-a-judge fall short in two ways: they either u...

arxiv.org

1

Mingxuan (Aldous) Li @itea1001.bsky.social · May 12

11/n Closing thoughts:
This is a sample-efficient method for LLM-as-a-judge, grounded upon human judgments — paving the way for personalized evaluators and alignment!

1

Mingxuan (Aldous) Li @itea1001.bsky.social · May 12

10/n Code:
We have released to repositories for HypoEval:
For replicating results/building upon: github.com/ChicagoHAI/H...
For off-the-shelf 0-shot evaluators for summaries and stories🚀: github.com/ChicagoHAI/H...

GitHub - ChicagoHAI/HypoEval-Gen: Repository for HypoEval paper (Hypothesis-Guided Evaluation for Natural Language Generation)

Repository for HypoEval paper (Hypothesis-Guided Evaluation for Natural Language Generation) - ChicagoHAI/HypoEval-Gen

github.com

1 1

Mingxuan (Aldous) Li @itea1001.bsky.social · May 12

9/n Why HypoEval matters:
We push forward LLM-as-a-judge research by showing you can get:
Sample efficiency
Interpretable automated evaluation
Strong human alignment
…without massive fine-tuning.

1

Mingxuan (Aldous) Li @itea1001.bsky.social · May 12

8/n 🔬 Ablation insights:
Dropping hypothesis generation → performance drops ~7%
Combining all hypotheses into one criterion → performance drops ~8% (Better to let LLMs rate one sub-dimension at a time!)

1 1

Mingxuan (Aldous) Li @itea1001.bsky.social · May 12

7/n 💪 What’s robust?
✅ Works across out-of-distribution (OOD) tasks
✅ Generated hypothesis can be transferred to different LLMs (e.g., GPT-4o-mini ↔ LLAMA-3.3-70B)
✅ Reduces sensitivity to prompt variations compared to direct scoring

1 1

Mingxuan (Aldous) Li @itea1001.bsky.social · May 12

6/n 🏆 Where did we test it?
Across summarization (SummEval, NewsRoom) and story generation (HANNA, WritingPrompt)
We show state-of-the-art correlations with human judgments, for both rankings (Spearman correlation) and scores (Pearson correlation)! 📈

1 1

Mingxuan (Aldous) Li @itea1001.bsky.social · May 12

5/n Why is this better?
By combining small-scale human data + literature + non-binary checklists, HypoEval:
🔹 Outperforms G-Eval by ~12%
🔹 Beats fine-tuned models using 3x more human labels
🔹 Adds interpretable evaluation

1 1

Mingxuan (Aldous) Li @itea1001.bsky.social · May 12

4/n These hypotheses break down complex evaluation rubric (ex. “Is this summary comprehensive?”) into sub-dimensions an LLM can score clearly. ✅✅✅

1 1

Mingxuan (Aldous) Li @itea1001.bsky.social · May 12

3/n 🌟 Our solution: HypoEval
Building upon SOTA hypothesis generation methods, we generate hypotheses — decomposed rubrics (similar to checklists, but more systematic and explainable) — from existing literature and just 30 human annotations (scores) of texts.

1 2

Mingxuan (Aldous) Li @itea1001.bsky.social · May 12

2/n What’s the problem?
Most LLM-as-a-judge studies either:
❌ Achieve lower alignment with humans
⚙️ Requires extensive fine-tuning -> expensive data and compute.
❓ Lack of interpretability

1 3

Mingxuan (Aldous) Li @itea1001.bsky.social · May 12

1/n 🚀🚀🚀 Thrilled to share our latest work🔥: HypoEval - Hypothesis-Guided Evaluation for Natural Language Generation! 🧠💬📊
There’s a lot of excitement around using LLMs for automated evaluation, but many methods fall short on alignment or explainability — let’s dive in! 🌊

1 7 22

Reposted by Mingxuan (Aldous) Li

Mourad Heddaya @mheddaya.bsky.social · May 1

🧑‍⚖️How well can LLMs summarize complex legal documents? And can we use LLMs to evaluate?

Excited to be in Albuquerque presenting our paper this afternoon at @naaclmeeting 2025!

2 13 23

Reposted by Mingxuan (Aldous) Li

Haokun Liu @haokunliu.bsky.social · Apr 28

🚀🚀🚀Excited to share our latest work: HypoBench, a systematic benchmark for evaluating LLM-based hypothesis generation methods!

There is much excitement about leveraging LLMs for scientific hypothesis generation, but principled evaluations are missing - let’s dive into HypoBench together.

1 9 11