Shang Qu
banner
lindsayttsq.bsky.social
Shang Qu
@lindsayttsq.bsky.social
AI4Biomed & LLMs @ Tsinghua University
Check out the details!
📒Preprint: arxiv.org/pdf/2501.18362
🗃️Data files will be released shortly at: github.com/TsinghuaC3I/...
arxiv.org
February 4, 2025 at 1:33 PM
We also found that reasoning process errors & perceptual errors (in MM) take up a large percentage of model errors. Error cases provide further insights into the challenges models still face regarding clinical reasoning:
February 4, 2025 at 1:33 PM
💡Clinical reasoning facilitates model reasoning evaluation beyond math & code. We annotate MedXpertQA questions as Reasoning/Understanding based on required reasoning complexity.
Comparing 3 inference-time scaled models against their backbones, we find distinct improvements in the Reasoning subset:
February 4, 2025 at 1:32 PM
Benchmark construction process - 38k original ➡️ 4k+ final questions
- Filtering for difficulty and diversity using responses from humans + 8 AI experts
- Question rewriting & option set expansion to lower data leakage risk
- Human expert proofreading & error correction
February 4, 2025 at 1:31 PM
We improve clinical relevance through
⭐️Medical specialty coverage: MedXpertQA includes questions from 20+ exams of medical licensing level or higher
⭐️Realistic context: MM is the first multimodal medical benchmark to introduce rich clinical information with diverse image types
February 4, 2025 at 1:31 PM
Compared with rapidly saturating benchmarks like MedQA, we raise the bar with harder questions and a sharper focus on medical reasoning.
Full results evaluating 17 LLMs, LMMs, and inference-time scaled models:
February 4, 2025 at 1:30 PM