Lightnews — Scholar-powered news

Wang Bill Zhu @billzhu.bsky.social · Apr 30

At @naaclmeeting.bsky.social this week! I’ll be presenting our work on LLM domain induction with @thomason.bsky.social on Thu (5/1) at 4pm in Hall 3, Section I.

Would love to connect and chat about LLM planning, reasoning, AI4Science, multimodal stuff, or anything else. Feel free to DM!

3 4

Wang Bill Zhu @billzhu.bsky.social · Apr 16

Huge thanks to my co-first-author Tianqi (just graduated as USC MS and actively searching for MLE jobs now), and my other amazing collaborators @robinomial, Ruishan, Roman, Jade, Mazen and Jorge, who helped shape this project.
We hope Cancer-Myth moves us closer to safer, medically grounded AI.

2

Wang Bill Zhu @billzhu.bsky.social · Apr 16

🤗 Dataset: huggingface.co/datasets/Can...
💻 Code: github.com/Bill1235813/...
Data, pipeline, evaluation, results are all open-source. [8/n]

Cancer-Myth/Cancer-Myth · Datasets at Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

1 3

Wang Bill Zhu @billzhu.bsky.social · Apr 16

Common failure types:
❌ “Late-stage means no treatment”
❌ “You’ll always need a colostomy bag after rectal cancer treatment”
Models do slightly better on myths like “no symptoms = no cancer” or causal misattribution.
[7/n]

1 1

Wang Bill Zhu @billzhu.bsky.social · Apr 16

We also analyze adversarial transfer:
Questions generated from Gemini-1.5-Pro are the hardest across all models.
GPT-4o’s adversarial questions are much less effective. [6/n]

1 1

Wang Bill Zhu @billzhu.bsky.social · Apr 16

Results? No model corrects more than 30% of questions. Even advanced prompting + multi-agent setups (e.g., MDAgents) doesn’t fix this.
Metrics:
✅ PCR – % fully correct the false belief
🧠 PCS – average correction score.
[5/n]

1 1

Wang Bill Zhu @billzhu.bsky.social · Apr 16

To test this, we collect 994 common cancer myths and develop an adversarial Cancer-Myth of 585 examples. We perform three separate runs over the entire set of myths, each targeting GPT-4o, Gemini-1.5-Pro, and Claude-3.5-Sonnet, respectively. All questions are vetted by physicians.
[4/n]

1 1

Wang Bill Zhu @billzhu.bsky.social · Apr 16

Initially, we evaluated GPT-4, Gemini-1.5-Pro, Claude-3.5-Sonnet on CancerCare questions.
✅ Answers were rated helpful by oncologists.
🙎‍♂️ Outperformed human social workers on average. Sounds good… but there’s a catch.
LLMs answered correctly but often left patient misconceptions untouched.
[3/n]

1 1

Wang Bill Zhu @billzhu.bsky.social · Apr 16

🏥 Why this matters for clinical safety?
Patients increasingly turn to LLMs for medical advice. But real questions often contain hidden false assumptions. LLMs that ignore false assumptions can reinforce harmful beliefs.
⚠️ Safety = not just answering correctly, but correcting the question.
[2/n]

1 1

Wang Bill Zhu @billzhu.bsky.social · Apr 16

🚨 New work!
LLMs often sound helpful—but fail to challenge dangerous medical misconceptions in real patient questions.
We test how well LLMs handle false assumptions in oncology Q&A.
📝 Paper: arxiv.org/abs/2504.11373
🌐 Website: cancermyth.github.io
👇 [1/n]

Cancer-Myth: Evaluating AI Chatbot on Patient Questions with False Presuppositions

Cancer patients are increasingly turning to large language models (LLMs) as a new form of internet search for medical information, making it critical to assess how well these models handle complex, pe...

arxiv.org

1 3

Reposted by Wang Bill Zhu

Robin Jia @robinjia.bsky.social · Dec 9

I'll be at #NeurIPS2024! My group has papers analyzing how LLMs use Fourier Features for arithmetic and how TFs learn higher-order optimization for ICL (led by @deqing.bsky.social), plus workshop papers on backdoor detection and LLMs + PDDL (led by @billzhu.bsky.social)

1 3 23

Wang Bill Zhu @billzhu.bsky.social · Oct 10

Check out more details at arxiv.org/abs/2305.14901. Many thanks to my two great advisors @thomason.bsky.social and @robinjia.bsky.social !

Chain-of-Questions Training with Latent Answers for Robust...

We train a language model (LM) to robustly answer multistep questions by generating and answering sub-questions. We propose Chain-of-Questions, a framework that trains a model to generate...

arxiv.org

1

Wang Bill Zhu @billzhu.bsky.social · Oct 10

We obtain supervision for sub-questions from human-annotated question decomposition meaning representation (QDMR). We treat sub-answers as latent variables and infer them with a dynamic mixture of Hard-EM+RL.

Wang Bill Zhu @billzhu.bsky.social · Oct 10

✨ Excited to share our Chain-of-Questions paper #EMNLP2023: we develop a framework that trains *one T5 model* to robustly answer multistep questions by generating and answering sub-questions. Outperforms ChatGPT on DROP, HotpotQA and their contrast/adversarial sets.

2 1 3