Lightnews — Scholar-powered news

Reposted by Hao Zhu 朱昊

Dirk Hovy @dirkhovy.bsky.social · May 3

We (w/ @diyiyang.bsky.social, @zhuhao.me, & Bodhisattwa Prasad Majumder) are excited to present our #NAACL25 tutorial on Social Intelligence in the Age of LLMs!
It will highlight long-standing and emerging challenges of AI interacting w humans, society & the world.
⏰ May 3, 2:00pm-5:30pm Room Pecos

6 14

Reposted by Hao Zhu 朱昊

Tomer Ullman @tomerullman.bsky.social · Mar 13

woooooo!

Out in Child Development:

"Learning Loopholes: The Development of Intentional
Misunderstandings in Children"

paper: srcd.onlinelibrary.wiley.com/doi/10.1111/...

preprint-pdf: www.tomerullman.org/papers/kids_...

2 13 53

Hao Zhu 朱昊 @zhuhao.me · Mar 11

This works like magic!

Nikhil Garg @nkgarg.bsky.social · Mar 10

*Please repost* @sjgreenwood.bsky.social and I just launched a new personalized feed (*please pin*) that we hope will become a "must use" for #academicsky. The feed shows posts about papers filtered by *your* follower network. It's become my default Bluesky experience bsky.app/profile/pape...

1 2

Hao Zhu 朱昊 @zhuhao.me · Mar 7

I have similar observations. But as a reviewer, I have to be honest that I cannot check each claim about previous papers, and these kinds of false references are often considered as minor issues (not really) comparing to novelty or empirical results.

1

Reposted by Hao Zhu 朱昊

Chris Paxton @cpaxton.bsky.social · Mar 5

New personal project with my friend Michael Cho: RoboPapers, a podcast where we chat with authors of cool robotics papers and post the discussion on YouTube and spotify. First one was with Duan Jiafei, who did the very cool paper SAM2Act, and it goes up Friday.

1 3 32

Reposted by Hao Zhu 朱昊

Danny To Eun Kim @teknology.bsky.social · Mar 5

🚨New Breakthrough in Tip-of-the-Tongue (TOT) Retrieval Research!

We address data limitations and offer a fresh evaluation method for these complex queries.

Curious how TREC TOT track test queries are created? Check out this thread 🧵 and our paper 📄: arxiv.org/abs/2502.17776

Tip of the Tongue Query Elicitation for Simulated Evaluation

Tip-of-the-tongue (TOT) search occurs when a user struggles to recall a specific identifier, such as a document title. While common, existing search systems often fail to effectively support TOT scena...

arxiv.org

2 7 17

Reposted by Hao Zhu 朱昊

Caleb Ziems @calebziems.com · Mar 4

EgoNormia (egonormia.org) exposes a major gap in Vision-Language Models understanding of the social world: they don't know how to behave when norms about the physical world *conflict* ⚔️ (<45% acc.)

But humans are naturally quite good at this (>90% acc.)

Check it out!

➡️ arxiv.org/abs/2502.20490

2 7

Hao Zhu 朱昊 @zhuhao.me · Mar 4

thanks to Leena Mathur and Su Li for helping with collecting robotics videos.
thanks @michaelryan207.bsky.social @williamheld.com @echoshao8899.bsky.social @jyangballin.bsky.social @ellaminzhili.bsky.social @juliakruk.bsky.social @rewang.bsky.social @vidhijain.bsky.social @ybisk.me Dorsa Sadigh

2

Hao Zhu 朱昊 @zhuhao.me · Mar 4

Incredible collab with MohammadHossein Rezaei* (U of A) Yicheng Fu* Phil Cuvin* (U of T) @calebziems.com @yanzhe.bsky.social @diyiyang.bsky.social

Upvote our paper at huggingface.co/papers/2502....
arxiv.org/abs/2502.20490

Paper page - EgoNormia: Benchmarking Physical Social Norm Understanding

Join the discussion on this paper page

huggingface.co

1 2

Hao Zhu 朱昊 @zhuhao.me · Mar 4

As always, we open source everything. Even our nicely made website: egonormia.org Please check out the leaderboard, the blog (w/Bibtex support), the code, data, as well as a data viewer.

EgoNormia: A Benchmark for Visual Frontier Models' Normative Reasoning

A large scale video dataset and a benchmark for evaluating frontier models' understanding of physical social norms through videos.

egonormia.org

1 1 3

Hao Zhu 朱昊 @zhuhao.me · Mar 4

We are getting closer to have agents operating in the real physical world. However, can we trust frontier models to make embodied decisions 🎮 aligned with human norms 👩‍⚖️ ?

With EgoNormia, a 1.8k ego-centric video 🥽 QA benchmark, we show that this is surprisingly challenging!

1 9 22

Reposted by Hao Zhu 朱昊

Shikhar Murty @shikharmurty.bsky.social · Feb 6

Want to make a browser agent for *any* domain like banking or healthcare?
We propose methods for training LLMs with open-ended, unsupervised interaction on live websites:
✅ OSS SoTA on WebVoyager
✅ world's smallest high-performing web-agent
Try it here: nnetnav.dev

1 2 9

Hao Zhu 朱昊 @zhuhao.me · Feb 6

Visit nnetnav.dev for more examples, code, and data.

NNetNav - Unsupervised Browser Agents

Unsupervised learning for browser automation.

nnetnav.dev

2

Hao Zhu 朱昊 @zhuhao.me · Feb 6

The key insight is that LLMs are good at understanding whether a traj is doing something reasonable and that guides efficient exploration and gives accurate labels. Be warned that deploying exploration algorithms in the real world has consequences -- monitor your agents closely.

1 2

Hao Zhu 朱昊 @zhuhao.me · Feb 6

Ever dreamed of AI agents learning through interacting with the open world unsupervisedly? Our latest preprint introduces NNetNav-Live which collects training data through exploration on real websites and hindsight labeling, which produces a SOTA OSS agent.

1 2 4

Hao Zhu 朱昊 @zhuhao.me · Dec 10

Our awesome team: @ellaminzhili.bsky.social @williamheld.com @michaelryan207.bsky.social Kunat Pipatanakul, Potsawee Manakul @diyiyang.bsky.social

8

Hao Zhu 朱昊 @zhuhao.me · Dec 10

My first bluesky post will be for my first project as a postdoc at Stanford.

Talk Arena is our first step towards building audio LMs into interactive agents. Try it out and let me know what you think. talkarena.org

Talk Arena

Interactive evaluation for audio models

talkarena.org

2 4 18

Reposted by Hao Zhu 朱昊

Will Held @williamheld.com · Dec 10

With an increasing number of Large *Audio* Models 🔊, which one do users like the most?

Introducing talkarena.org — an open platform where users speak to LAMs and receive text responses. Through open interaction, we focus on rankings based on user preferences rather than static benchmarks.
🧵 (1/5)

Talk Arena: Interactive Evaluation of Large Audio Models

3 8 30

Hao Zhu 朱昊 @zhuhao.me · Nov 22

matplotlib with customization. I can share the code with you

1 1

Hao Zhu 朱昊 @zhuhao.me · Nov 22

1 1

Hao Zhu 朱昊 @zhuhao.me · Nov 21

Would really appreciate it if I can be included. I build social intelligence models/agents that can cooperate with humans.

1 1

Hao Zhu 朱昊 @zhuhao.me · Nov 19

🙋‍♂️

1