Lightnews — Scholar-powered news

Reposted by Xing Han Lu

Gaurav Kamath @grvkamath.bsky.social · Jul 29

Our new paper in #PNAS (bit.ly/4fcWfma) presents a surprising finding—when words change meaning, older speakers rapidly adopt the new usage; inter-generational differences are often minor.

w/ Michelle Yang, ‪@sivareddyg.bsky.social‬ , @msonderegger.bsky.social‬ and @dallascard.bsky.social‬👇(1/12)

3 17 33

Reposted by Xing Han Lu

Cesare @cesare-spinoso.bsky.social · Jun 26

A blizzard is raging through Montreal when your friend says “Looks like Florida out there!” Humans easily interpret irony, while LLMs struggle with it. We propose a 𝘳𝘩𝘦𝘵𝘰𝘳𝘪𝘤𝘢𝘭-𝘴𝘵𝘳𝘢𝘵𝘦𝘨𝘺-𝘢𝘸𝘢𝘳𝘦 probabilistic framework as a solution.
Paper: arxiv.org/abs/2506.09301 to appear @ #ACL2025 (Main)

1 7 15

Xing Han Lu @xhluca.bsky.social · Jun 14

"Build the web for agents, not agents for the web"

This position paper argues that rather than forcing web agents to adapt to UIs designed for humans, we should develop a new interface optimized for web agents, which we call Agentic Web Interface (AWI).

arxiv.org/abs/2506.10953

4 6

Reposted by Xing Han Lu

Benno Krojer @bennokrojer.bsky.social · Jun 13

Excited to share the results of my recent internship!

We ask 🤔
What subtle shortcuts are VideoLLMs taking on spatio-temporal questions?

And how can we instead curate shortcut-robust examples at a large-scale?

We release: MVPBench

Details 👇🔬

1 5 16

Reposted by Xing Han Lu

Ziling Cheng @ziling-cheng.bsky.social · Jun 6

Do LLMs hallucinate randomly? Not quite.

Our #ACL2025 (Main) paper shows that hallucinations under irrelevant contexts follow a systematic failure mode — revealing how LLMs generalize using abstract classes + context cues, albeit unreliably.

📎 Paper: arxiv.org/abs/2505.22630 1/n

1 18 46

Xing Han Lu @xhluca.bsky.social · May 10

Without 🐦 and 🦋, are we left with LinkedIn?

1 1

Reposted by Xing Han Lu

Mila - Institut québécois d'IA @mila-quebec.bsky.social · May 1

Congratulations to Mila members @adadtur.bsky.social , Gaurav Kamath and @sivareddyg.bsky.social for their SAC award at NAACL! Check out Ada's talk in Session I: Oral/Poster 6. Paper: arxiv.org/abs/2502.05670

7 13

Reposted by Xing Han Lu

Karolina Stańczak @karstanczak.bsky.social · Apr 15

Exciting release! AgentRewardBench offers that much-needed closer look at evaluating agent capabilities: automatic vs. human eval. Important findings here, especially on the popular LLM judges. Amazing work by @xhluca.bsky.social & team!

Xing Han Lu @xhluca.bsky.social · Apr 15

AgentRewardBench: Evaluating Automatic Evaluations of Web Agent Trajectories

We are releasing the first benchmark to evaluate how well automatic evaluators, such as LLM judges, can evaluate web agent trajectories.

1 1 3

Xing Han Lu @xhluca.bsky.social · Apr 15

Daily Paper: huggingface.co/papers/2504....
Data: huggingface.co/datasets/McG...
Demo: huggingface.co/spaces/McGil...
Leaderboard: huggingface.co/spaces/McGil...
Arxiv: arxiv.org/abs/2504.08942

Paper page - AgentRewardBench: Evaluating Automatic Evaluations of Web Agent Trajectories

Join the discussion on this paper page

huggingface.co

1

Xing Han Lu @xhluca.bsky.social · Apr 15

An amazing team effort with: @a-kazemnejad.bsky.social Nick @arkil.bsky.social Dongchan Alejandra @karstanczak.bsky.social @ptshaw.bsky.social @chrisjpal.bsky.social @sivareddyg.bsky.social

1 1

Xing Han Lu @xhluca.bsky.social · Apr 15

We find that rule-based evals underreport success rates, and no single LLM judge excels across all benchmarks.
We collect trajectories from web agents built on four LLMs (Claude 3.7, GPT-4o, Llama 3.3, Qwen2.5-VL) across popular web benchmarks (AssistantBench, WebArena, VWA, WorkArena, WorkArena++)

1 1

Xing Han Lu @xhluca.bsky.social · Apr 15

AgentRewardBench: Evaluating Automatic Evaluations of Web Agent Trajectories

We are releasing the first benchmark to evaluate how well automatic evaluators, such as LLM judges, can evaluate web agent trajectories.

1 4 7

Reposted by Xing Han Lu

Sara Vera Marjanovic @saravera.bsky.social · Apr 11

And thoughtology is now on Arxiv! Read more about R1 reasoning 🐋💭 across visual, cultural and psycholinguistic tasks at the link below:

🔗 arxiv.org/abs/2504.07128

1 5

Xing Han Lu @xhluca.bsky.social · Apr 12

bsky.app/profile/sara...

Sara Vera Marjanovic @saravera.bsky.social · Apr 1

Models like DeepSeek-R1 🐋 mark a fundamental shift in how LLMs approach complex problems. In our preprint on R1 Thoughtology, we study R1’s reasoning chains across a variety of tasks; investigating its capabilities, limitations, and behaviour.
🔗: mcgill-nlp.github.io/thoughtology/

A circular diagram with a blue whale icon at the center. The diagram shows 8 interconnected research areas around LLM reasoning represented as colored rectangular boxes arranged in a circular pattern. The areas include: §3 Analysis of Reasoning Chains (central cloud), §4 Scaling of Thoughts (discussing thought length and performance metrics), §5 Long Context Evaluation (focusing on information recall), §6 Faithfulness to Context (examining question answering accuracy), §7 Safety Evaluation (assessing harmful content generation and jailbreak resistance), §8 Language & Culture (exploring moral reasoning and language effects), §9 Relation to Human Processing (comparing cognitive processes), §10 Visual Reasoning (covering ASCII generation capabilities), and §11 Following Token Budget (investigating direct prompting techniques). Arrows connect the sections in a clockwise flow, suggesting an iterative research methodology.

1

Xing Han Lu @xhluca.bsky.social · Apr 12

DeepSeek-R1 Thoughtology: Let’s about LLM reasoning

142-page report diving into the reasoning chains of R1. It spans 9 unique axes: safety, world modeling, faithfulness, long context, etc.

Now on arxiv: arxiv.org/abs/2504.07128

1 1 6

Reposted by Xing Han Lu

Siva Reddy @sivareddyg.bsky.social · Apr 1

Introducing the DeepSeek-R1 Thoughtology -- the most comprehensive study of R1 reasoning chains/thoughts ✨. Probably everything you need to know about R1 thoughts. If we missed something, please let us know.

Sara Vera Marjanovic @saravera.bsky.social · Apr 1

Models like DeepSeek-R1 🐋 mark a fundamental shift in how LLMs approach complex problems. In our preprint on R1 Thoughtology, we study R1’s reasoning chains across a variety of tasks; investigating its capabilities, limitations, and behaviour.
🔗: mcgill-nlp.github.io/thoughtology/

4 17

Reposted by Xing Han Lu

Sara Vera Marjanovic @saravera.bsky.social · Apr 1

Models like DeepSeek-R1 🐋 mark a fundamental shift in how LLMs approach complex problems. In our preprint on R1 Thoughtology, we study R1’s reasoning chains across a variety of tasks; investigating its capabilities, limitations, and behaviour.
🔗: mcgill-nlp.github.io/thoughtology/

1 16 52

Reposted by Xing Han Lu

Marius Mosbach @mariusmosbach.bsky.social · Mar 31

Check out our new workshop on Actionable Interpretability @ ICML 2025. We are also looking forward to submissions that take a position on the future of interpretability research more broadly. 👇

Mor Geva @megamor2.bsky.social · Mar 31

🎉 Our Actionable Interpretability workshop has been accepted to #ICML2025! 🎉
> Follow @actinterp.bsky.social
> Website actionable-interpretability.github.io

@talhaklay.bsky.social @anja.re @mariusmosbach.bsky.social @sarah-nlp.bsky.social @iftenney.bsky.social

Paper submission deadline: May 9th!

1 9

Reposted by Xing Han Lu

VLMs4All - CVPR 2025 Workshop @vlms4all.bsky.social · Mar 14

📢Excited to announce our upcoming workshop - Vision Language Models For All: Building Geo-Diverse and Culturally Aware Vision-Language Models (VLMs-4-All) @CVPR 2025!
🌐 sites.google.com/view/vlms4all

1 11 17

Reposted by Xing Han Lu

Parishad BehnamGhader @parishadbehnam.bsky.social · Mar 12

Instruction-following retrievers can efficiently and accurately search for harmful and sensitive information on the internet! 🌐💣

Retrievers need to be aligned too! 🚨🚨🚨

Work done with the wonderful Nick and @sivareddyg.bsky.social

🔗 mcgill-nlp.github.io/malicious-ir/
Thread: 🧵👇

Exploiting Instruction-Following Retrievers for Malicious Information Retrieval

Parishad BehnamGhader, Nicholas Meade, Siva Reddy

mcgill-nlp.github.io

1 8 12

Reposted by Xing Han Lu

Spandana Gella @spandanagella.bsky.social · Mar 10

Web agents powered by LLMs can solve complex tasks, but our analysis shows that they can also be easily misused to automate harmful tasks.

See the thread below for more details on our new web agent safety benchmark: SafeArena and Agent Risk Assessment framework (ARIA).

Xing Han Lu @xhluca.bsky.social · Mar 10

Agents like OpenAI Operator can solve complex computer tasks, but what happens when users use them to cause harm, e.g. spread misinformation?

To find out, we introduce SafeArena (safearena.github.io), a benchmark to assess the capabilities of web agents to complete harmful web tasks. A thread 👇

2 5

Reposted by Xing Han Lu

Karolina Stańczak @karstanczak.bsky.social · Mar 10

The potential for malicious misuse of LLM agents is a serious threat.

That's why we created SafeArena, a safety benchmark for web agents. See the thread and our paper for details: arxiv.org/abs/2503.04957 👇

2 9

Reposted by Xing Han Lu

Arkil Patel @arkil.bsky.social · Mar 10

Llamas browsing the web look cute, but they are capable of causing a lot of harm!

Check out our new Web Agents ∩ Safety benchmark: SafeArena!

Paper: arxiv.org/abs/2503.04957

3 9

Xing Han Lu @xhluca.bsky.social · Mar 10

WebArena by Zhou et al; AgentLab and Browsergym by @servicenow.bsky.social allowed us to explore the latest agents; @gradio-hf.bsky.social enabled us to design UIs for implementing our ARIA framework, whereas @hf.co provided a hosting platform for 100GB+ artifacts.

bsky.app/profile/xhlu...

Xing Han Lu @xhluca.bsky.social · Mar 10

Agents like OpenAI Operator can solve complex computer tasks, but what happens when users use them to cause harm, e.g. spread misinformation?

To find out, we introduce SafeArena (safearena.github.io), a benchmark to assess the capabilities of web agents to complete harmful web tasks. A thread 👇

3

Xing Han Lu @xhluca.bsky.social · Mar 10

This work was done by an awesome team of authors: @adadtur.bsky.social, Nick, @arkil.bsky.social, @karstanczak.bsky.social, Esin, @spandanagella.bsky.social, and @sivareddyg.bsky.social.

It's also important to recognize the incredible works that helped us build SafeArena:

1 1 4