Monica M Reddy
@monicamreddy.bsky.social
190 followers 90 following 7 posts
PhD student at @KhouryCollege. Working in Machine Learning for Healthcare. Previously: @ StanfordMed @allen_ai, @UmassAmherst https://monicamunnangi.github.io/
Posts Media Videos Starter Packs
monicamreddy.bsky.social
Excited to present our work at MLHC 2025 at held at Mayo Clinic on Saturday, Aug 16! 🏥
Thank you to my collaborators Akshay Swaminathan, @jason-fries.bsky.social, Jenelle Jindal, Sanjana Narayanan, Ivan Lopez, Lucia Tu, Philip Chung, Jesutofunmi A. Omiye, Mehr Kashyap, Nigam Shah
monicamreddy.bsky.social
📢 How factual are LLMs in healthcare?
We’re excited to release FactEHR — a new benchmark to evaluate factuality in clinical notes. As generative AI enters the clinic, we need rigorous, source-grounded tools to measure what these models get right — and what they don’t. 🏥 🤖
monicamreddy.bsky.social
FactEHR is both a benchmark and a training resource for improving clinical LLMs in key tasks like summarization, electronic phenotyping, and QA.
📂 Code & Data: github.com/som-shahlab/...
📄 Paper: arxiv.org/abs/2412.124...
monicamreddy.bsky.social
We observe wide variation across models — in both fact decomposition and entailment judgment. Some LLMs generate accurate, grounded outputs; others miss or misstate key facts.
FactEHR highlights these gaps and guides improvement.
monicamreddy.bsky.social
🧠 FactEHR is a large NLI dataset for evaluating entailment-based LLM-as-a-judge methods in clinical text
📄 2,168 notes | 🏥 4 note types, 3 health systems
🔗 987K entailment pairs + 3.4K expert labels
🤖 Full fact decompositions from GPT-4o, Gemini 1.5, LLaMA3 8B, and o1-mini
monicamreddy.bsky.social
Why is this so hard?
Clinical notes are long, messy, and inconsistent. Evaluating fine-grained factuality across diverse note types (e.g., discharge vs. radiology) is a major challenge — but essential for safe, trustworthy LLMs. ⚠️
monicamreddy.bsky.social
📢 How factual are LLMs in healthcare?
We’re excited to release FactEHR — a new benchmark to evaluate factuality in clinical notes. As generative AI enters the clinic, we need rigorous, source-grounded tools to measure what these models get right — and what they don’t. 🏥 🤖
Reposted by Monica M Reddy
strubell.bsky.social
Was AC for one of the papers. Went to metareview and noticed that two reviews were basically paraphrases of each other (down to ordering of weaknesses) and LLM generated. Noticed the paper was also weirdly well written garbage. Then I investigated the deadbeat reviewers, realized they don't exist.
monicamreddy.bsky.social
Really nice work, earlier this year we found similar results in addition to the fact that, clinical LLMs are more sensitive to changes in instruction phrasings compared to their general domain counterparts.

arxiv.org/abs/2407.09429