Lightnews — Scholar-powered news

Monica M Reddy @monicamreddy.bsky.social · Aug 11

Excited to present our work at MLHC 2025 at held at Mayo Clinic on Saturday, Aug 16! 🏥
Thank you to my collaborators Akshay Swaminathan, @jason-fries.bsky.social, Jenelle Jindal, Sanjana Narayanan, Ivan Lopez, Lucia Tu, Philip Chung, Jesutofunmi A. Omiye, Mehr Kashyap, Nigam Shah

Monica M Reddy @monicamreddy.bsky.social · Aug 11

📢 How factual are LLMs in healthcare?
We’re excited to release FactEHR — a new benchmark to evaluate factuality in clinical notes. As generative AI enters the clinic, we need rigorous, source-grounded tools to measure what these models get right — and what they don’t. 🏥 🤖

Monica M Reddy @monicamreddy.bsky.social · Aug 11

FactEHR is both a benchmark and a training resource for improving clinical LLMs in key tasks like summarization, electronic phenotyping, and QA.
📂 Code & Data: github.com/som-shahlab/...
📄 Paper: arxiv.org/abs/2412.124...

Monica M Reddy @monicamreddy.bsky.social · Aug 11

We observe wide variation across models — in both fact decomposition and entailment judgment. Some LLMs generate accurate, grounded outputs; others miss or misstate key facts.
FactEHR highlights these gaps and guides improvement.

1

Monica M Reddy @monicamreddy.bsky.social · Aug 11

🧠 FactEHR is a large NLI dataset for evaluating entailment-based LLM-as-a-judge methods in clinical text
📄 2,168 notes | 🏥 4 note types, 3 health systems
🔗 987K entailment pairs + 3.4K expert labels
🤖 Full fact decompositions from GPT-4o, Gemini 1.5, LLaMA3 8B, and o1-mini

1

Monica M Reddy @monicamreddy.bsky.social · Aug 11

Why is this so hard?
Clinical notes are long, messy, and inconsistent. Evaluating fine-grained factuality across diverse note types (e.g., discharge vs. radiology) is a major challenge — but essential for safe, trustworthy LLMs. ⚠️

1

Monica M Reddy @monicamreddy.bsky.social · Aug 11

📢 How factual are LLMs in healthcare?
We’re excited to release FactEHR — a new benchmark to evaluate factuality in clinical notes. As generative AI enters the clinic, we need rigorous, source-grounded tools to measure what these models get right — and what they don’t. 🏥 🤖

1 1 3

Reposted by Monica M Reddy

Emma Strubell @strubell.bsky.social · Feb 13

Was AC for one of the papers. Went to metareview and noticed that two reviews were basically paraphrases of each other (down to ordering of weaknesses) and LLM generated. Noticed the paper was also weirdly well written garbage. Then I investigated the deadbeat reviewers, realized they don't exist.

2 1 10

Monica M Reddy @monicamreddy.bsky.social · Nov 26

Really nice work, earlier this year we found similar results in addition to the fact that, clinical LLMs are more sensitive to changes in instruction phrasings compared to their general domain counterparts.

arxiv.org/abs/2407.09429

4