Mark Ibrahim
markibrahim.bsky.social
Mark Ibrahim
@markibrahim.bsky.social
Researching the dark arts of deep learning at Meta's FAIR (Fundamental AI Research) Lab https://markibrahim.me/
Want to teach AI agents to use apps like humans? Get started with digital agents research using OpenApps, our new Python-based environment.
December 10, 2025 at 3:44 PM
Despite saturating single image perception, Common-O establishes a new challenging multimodal benchmark. The best performing model only achieves 35% on Common-O and on Common-O Complex, consisting of more complex scenes, the best model achieves only 1%.

🧵2/3
November 7, 2025 at 8:55 PM
We introduce, Common-O, a new multimodal benchmark for hallucination when reasoning across scenes.

We find leading multimodal LLMs can reliably identify objects, yet hallucinate when reasoning across scenes.

🧵1/3
November 7, 2025 at 8:55 PM
One can manipulate LLM rankings to put any model in the lead—only by modifying the single character separating demonstration examples. Learn more in our new paper arxiv.org/abs/2510.05152
w/ Jingtong Su, Jianyu Zhang, @karen-ullrich.bsky.social , and Léon Bottou.
🧵
October 9, 2025 at 2:32 PM
A good language model should say “I don’t know” by reasoning about the limits of its knowledge. Our new work AbstentionBench carefully measures this overlooked skill in an open-codebase others can build on!

We find frontier reasoning degrades models’ ability to know when NOT to answer.

🧵1/2
June 17, 2025 at 6:32 PM
Recently, we also applied the same MLM-U objective to maze navigation. We find when training parameter-matched transformers on identical data, MLM-U without any tweaks outperforms standard next token training across all maze grid sizes (up to 30x30).
December 11, 2024 at 6:42 PM
Can we boost transformers’ ability to retrieve knowledge and plan in maze navigation by only tweaking the learning objective?

We emphatically say YES in our #NeurIPS 2024 study! 🧵

w/ Ouail Kitouni, Niklas Nolte, Diane Bouchacourt, Adina Williams, and Mike Rabbat

Paper arxiv.org/abs/2406.05183
December 11, 2024 at 6:32 PM