Lightnews — Scholar-powered news

Wang Bill Zhu

@billzhu.bsky.social

At @naaclmeeting.bsky.social this week! I’ll be presenting our work on LLM domain induction with @thomason.bsky.social on Thu (5/1) at 4pm in Hall 3, Section I.

Would love to connect and chat about LLM planning, reasoning, AI4Science, multimodal stuff, or anything else. Feel free to DM!

April 30, 2025 at 6:38 PM

Wang Bill Zhu

@billzhu.bsky.social

Common failure types:
❌ “Late-stage means no treatment”
❌ “You’ll always need a colostomy bag after rectal cancer treatment”
Models do slightly better on myths like “no symptoms = no cancer” or causal misattribution.
[7/n]

April 16, 2025 at 5:07 PM

Wang Bill Zhu

@billzhu.bsky.social

We also analyze adversarial transfer:
Questions generated from Gemini-1.5-Pro are the hardest across all models.
GPT-4o’s adversarial questions are much less effective. [6/n]

April 16, 2025 at 5:07 PM

Wang Bill Zhu

@billzhu.bsky.social

Results? No model corrects more than 30% of questions. Even advanced prompting + multi-agent setups (e.g., MDAgents) doesn’t fix this.
Metrics:
✅ PCR – % fully correct the false belief
🧠 PCS – average correction score.
[5/n]

April 16, 2025 at 5:07 PM

Wang Bill Zhu

@billzhu.bsky.social

To test this, we collect 994 common cancer myths and develop an adversarial Cancer-Myth of 585 examples. We perform three separate runs over the entire set of myths, each targeting GPT-4o, Gemini-1.5-Pro, and Claude-3.5-Sonnet, respectively. All questions are vetted by physicians.
[4/n]

April 16, 2025 at 5:07 PM

Wang Bill Zhu

@billzhu.bsky.social

Initially, we evaluated GPT-4, Gemini-1.5-Pro, Claude-3.5-Sonnet on CancerCare questions.
✅ Answers were rated helpful by oncologists.
🙎‍♂️ Outperformed human social workers on average. Sounds good… but there’s a catch.
LLMs answered correctly but often left patient misconceptions untouched.
[3/n]

April 16, 2025 at 5:07 PM

Wang Bill Zhu

@billzhu.bsky.social

🏥 Why this matters for clinical safety?
Patients increasingly turn to LLMs for medical advice. But real questions often contain hidden false assumptions. LLMs that ignore false assumptions can reinforce harmful beliefs.
⚠️ Safety = not just answering correctly, but correcting the question.
[2/n]

April 16, 2025 at 5:07 PM

Wang Bill Zhu

@billzhu.bsky.social

We obtain supervision for sub-questions from human-annotated question decomposition meaning representation (QDMR). We treat sub-answers as latent variables and infer them with a dynamic mixture of Hard-EM+RL.

October 10, 2023 at 10:59 PM

Wang Bill Zhu

@billzhu.bsky.social

✨ Excited to share our Chain-of-Questions paper #EMNLP2023: we develop a framework that trains *one T5 model* to robustly answer multistep questions by generating and answering sub-questions. Outperforms ChatGPT on DROP, HotpotQA and their contrast/adversarial sets.

October 10, 2023 at 10:57 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news