🎒 @mcgill-nlp.bsky.social
w/ Michelle Yang, @sivareddyg.bsky.social , @msonderegger.bsky.social and @dallascard.bsky.social👇(1/12)
w/ Michelle Yang, @sivareddyg.bsky.social , @msonderegger.bsky.social and @dallascard.bsky.social👇(1/12)
Paper: arxiv.org/abs/2506.09301 to appear @ #ACL2025 (Main)
Paper: arxiv.org/abs/2506.09301 to appear @ #ACL2025 (Main)
This position paper argues that rather than forcing web agents to adapt to UIs designed for humans, we should develop a new interface optimized for web agents, which we call Agentic Web Interface (AWI).
arxiv.org/abs/2506.10953
This position paper argues that rather than forcing web agents to adapt to UIs designed for humans, we should develop a new interface optimized for web agents, which we call Agentic Web Interface (AWI).
arxiv.org/abs/2506.10953
We ask 🤔
What subtle shortcuts are VideoLLMs taking on spatio-temporal questions?
And how can we instead curate shortcut-robust examples at a large-scale?
We release: MVPBench
Details 👇🔬
We ask 🤔
What subtle shortcuts are VideoLLMs taking on spatio-temporal questions?
And how can we instead curate shortcut-robust examples at a large-scale?
We release: MVPBench
Details 👇🔬
Our #ACL2025 (Main) paper shows that hallucinations under irrelevant contexts follow a systematic failure mode — revealing how LLMs generalize using abstract classes + context cues, albeit unreliably.
📎 Paper: arxiv.org/abs/2505.22630 1/n
Our #ACL2025 (Main) paper shows that hallucinations under irrelevant contexts follow a systematic failure mode — revealing how LLMs generalize using abstract classes + context cues, albeit unreliably.
📎 Paper: arxiv.org/abs/2505.22630 1/n
We are releasing the first benchmark to evaluate how well automatic evaluators, such as LLM judges, can evaluate web agent trajectories.
We are releasing the first benchmark to evaluate how well automatic evaluators, such as LLM judges, can evaluate web agent trajectories.
We are releasing the first benchmark to evaluate how well automatic evaluators, such as LLM judges, can evaluate web agent trajectories.
🔗 arxiv.org/abs/2504.07128
🔗 arxiv.org/abs/2504.07128
142-page report diving into the reasoning chains of R1. It spans 9 unique axes: safety, world modeling, faithfulness, long context, etc.
Now on arxiv: arxiv.org/abs/2504.07128
142-page report diving into the reasoning chains of R1. It spans 9 unique axes: safety, world modeling, faithfulness, long context, etc.
Now on arxiv: arxiv.org/abs/2504.07128
🔗: mcgill-nlp.github.io/thoughtology/
🔗: mcgill-nlp.github.io/thoughtology/
🔗: mcgill-nlp.github.io/thoughtology/
> Follow @actinterp.bsky.social
> Website actionable-interpretability.github.io
@talhaklay.bsky.social @anja.re @mariusmosbach.bsky.social @sarah-nlp.bsky.social @iftenney.bsky.social
Paper submission deadline: May 9th!
🌐 sites.google.com/view/vlms4all
🌐 sites.google.com/view/vlms4all
Retrievers need to be aligned too! 🚨🚨🚨
Work done with the wonderful Nick and @sivareddyg.bsky.social
🔗 mcgill-nlp.github.io/malicious-ir/
Thread: 🧵👇
Retrievers need to be aligned too! 🚨🚨🚨
Work done with the wonderful Nick and @sivareddyg.bsky.social
🔗 mcgill-nlp.github.io/malicious-ir/
Thread: 🧵👇
See the thread below for more details on our new web agent safety benchmark: SafeArena and Agent Risk Assessment framework (ARIA).
To find out, we introduce SafeArena (safearena.github.io), a benchmark to assess the capabilities of web agents to complete harmful web tasks. A thread 👇
See the thread below for more details on our new web agent safety benchmark: SafeArena and Agent Risk Assessment framework (ARIA).
That's why we created SafeArena, a safety benchmark for web agents. See the thread and our paper for details: arxiv.org/abs/2503.04957 👇
That's why we created SafeArena, a safety benchmark for web agents. See the thread and our paper for details: arxiv.org/abs/2503.04957 👇
Check out our new Web Agents ∩ Safety benchmark: SafeArena!
Paper: arxiv.org/abs/2503.04957
Check out our new Web Agents ∩ Safety benchmark: SafeArena!
Paper: arxiv.org/abs/2503.04957
To find out, we introduce SafeArena (safearena.github.io), a benchmark to assess the capabilities of web agents to complete harmful web tasks. A thread 👇
To find out, we introduce SafeArena (safearena.github.io), a benchmark to assess the capabilities of web agents to complete harmful web tasks. A thread 👇
Human alignment balances social expectations, economic incentives, and legal frameworks. What if LLM alignment worked the same way?🤔
Our latest work explores how social, economic, and contractual alignment can address incomplete contracts in LLM alignment🧵
Human alignment balances social expectations, economic incentives, and legal frameworks. What if LLM alignment worked the same way?🤔
Our latest work explores how social, economic, and contractual alignment can address incomplete contracts in LLM alignment🧵
This work implements >500 evaluation tasks across >1000 languages and covers a wide range of use cases and domains🩺👩💻⚖️
Among the tasks i contributed, i'm most excited about the contextual web element retrieval task derived from weblinx, which i think is a crucial component for building web agents!
This work implements >500 evaluation tasks across >1000 languages and covers a wide range of use cases and domains🩺👩💻⚖️
Among the tasks i contributed, i'm most excited about the contextual web element retrieval task derived from weblinx, which i think is a crucial component for building web agents!
Work w/ fantastic advisors Dima Bahdanau and @sivareddyg.bsky.social
Thread 🧵:
Work w/ fantastic advisors Dima Bahdanau and @sivareddyg.bsky.social
Thread 🧵:
This work implements >500 evaluation tasks across >1000 languages and covers a wide range of use cases and domains🩺👩💻⚖️
This work implements >500 evaluation tasks across >1000 languages and covers a wide range of use cases and domains🩺👩💻⚖️
📢We're thrilled to announce REALM: The first Workshop for Research on Agent Language Models 🤖 #ACL2025NLP in Vienna 🎻
We have an exciting lineup of speakers
🗓️ Submit your work by *March 1st*
@aclmeeting.bsky.social
📢We're thrilled to announce REALM: The first Workshop for Research on Agent Language Models 🤖 #ACL2025NLP in Vienna 🎻
We have an exciting lineup of speakers
🗓️ Submit your work by *March 1st*
@aclmeeting.bsky.social