With EgoNormia, a 1.8k ego-centric video 🥽 QA benchmark, we show that this is surprisingly challenging!
With EgoNormia, a 1.8k ego-centric video 🥽 QA benchmark, we show that this is surprisingly challenging!
What is the false belief task? ->
What is the false belief task? ->
Very incomplete please comment with suggestions (or just if you're missing and want to be added!)
Very incomplete please comment with suggestions (or just if you're missing and want to be added!)
1. Tools Fail: Detecting Silent Errors in Faulty Tools
Are you using tools with your LLMs? Are you assuming your tools are perfect? Assuming the LLM can just handle any errors for you? 😬
Danger… 🚨 Models trust tools over their own “knowledge” even for simple and well trained cases.
1. Tools Fail: Detecting Silent Errors in Faulty Tools
Are you using tools with your LLMs? Are you assuming your tools are perfect? Assuming the LLM can just handle any errors for you? 😬
Danger… 🚨 Models trust tools over their own “knowledge” even for simple and well trained cases.