Hao Zhu 朱昊
@zhuhao.me
550 followers 160 following 15 posts
AI researcher. Postdocing at Stanford NLP. Prev: PhD CMU LTI. Visit https://zhuhao.me Raising agents in the Opensocial.world
Posts Media Videos Starter Packs
Pinned
zhuhao.me
We are getting closer to have agents operating in the real physical world. However, can we trust frontier models to make embodied decisions 🎮 aligned with human norms 👩‍⚖️ ?

With EgoNormia, a 1.8k ego-centric video 🥽 QA benchmark, we show that this is surprisingly challenging!
Reposted by Hao Zhu 朱昊
dirkhovy.bsky.social
We (w/ @diyiyang.bsky.social, @zhuhao.me, & Bodhisattwa Prasad Majumder) are excited to present our #NAACL25 tutorial on Social Intelligence in the Age of LLMs!
It will highlight long-standing and emerging challenges of AI interacting w humans, society & the world.
⏰ May 3, 2:00pm-5:30pm Room Pecos
Reposted by Hao Zhu 朱昊
tomerullman.bsky.social
woooooo!

Out in Child Development:

"Learning Loopholes: The Development of Intentional
Misunderstandings in Children"

paper: srcd.onlinelibrary.wiley.com/doi/10.1111/...

preprint-pdf: www.tomerullman.org/papers/kids_...
zhuhao.me
This works like magic!
nkgarg.bsky.social
*Please repost* @sjgreenwood.bsky.social and I just launched a new personalized feed (*please pin*) that we hope will become a "must use" for #academicsky. The feed shows posts about papers filtered by *your* follower network. It's become my default Bluesky experience bsky.app/profile/pape...
zhuhao.me
I have similar observations. But as a reviewer, I have to be honest that I cannot check each claim about previous papers, and these kinds of false references are often considered as minor issues (not really) comparing to novelty or empirical results.
Reposted by Hao Zhu 朱昊
cpaxton.bsky.social
New personal project with my friend Michael Cho: RoboPapers, a podcast where we chat with authors of cool robotics papers and post the discussion on YouTube and spotify. First one was with Duan Jiafei, who did the very cool paper SAM2Act, and it goes up Friday.
Reposted by Hao Zhu 朱昊
teknology.bsky.social
🚨New Breakthrough in Tip-of-the-Tongue (TOT) Retrieval Research!

We address data limitations and offer a fresh evaluation method for these complex queries.

Curious how TREC TOT track test queries are created? Check out this thread 🧵 and our paper 📄: arxiv.org/abs/2502.17776
Tip of the Tongue Query Elicitation for Simulated Evaluation
Tip-of-the-tongue (TOT) search occurs when a user struggles to recall a specific identifier, such as a document title. While common, existing search systems often fail to effectively support TOT scena...
arxiv.org
Reposted by Hao Zhu 朱昊
calebziems.com
EgoNormia (egonormia.org) exposes a major gap in Vision-Language Models understanding of the social world: they don't know how to behave when norms about the physical world *conflict* ⚔️ (<45% acc.)

But humans are naturally quite good at this (>90% acc.)

Check it out!

➡️ arxiv.org/abs/2502.20490
zhuhao.me
As always, we open source everything. Even our nicely made website: egonormia.org Please check out the leaderboard, the blog (w/Bibtex support), the code, data, as well as a data viewer.
EgoNormia: A Benchmark for Visual Frontier Models' Normative Reasoning
A large scale video dataset and a benchmark for evaluating frontier models' understanding of physical social norms through videos.
egonormia.org
zhuhao.me
We are getting closer to have agents operating in the real physical world. However, can we trust frontier models to make embodied decisions 🎮 aligned with human norms 👩‍⚖️ ?

With EgoNormia, a 1.8k ego-centric video 🥽 QA benchmark, we show that this is surprisingly challenging!
Reposted by Hao Zhu 朱昊
shikharmurty.bsky.social
Want to make a browser agent for *any* domain like banking or healthcare?
We propose methods for training LLMs with open-ended, unsupervised interaction on live websites:
✅ OSS SoTA on WebVoyager
✅ world's smallest high-performing web-agent
Try it here: nnetnav.dev
zhuhao.me
The key insight is that LLMs are good at understanding whether a traj is doing something reasonable and that guides efficient exploration and gives accurate labels. Be warned that deploying exploration algorithms in the real world has consequences -- monitor your agents closely.
zhuhao.me
Ever dreamed of AI agents learning through interacting with the open world unsupervisedly? Our latest preprint introduces NNetNav-Live which collects training data through exploration on real websites and hindsight labeling, which produces a SOTA OSS agent.
zhuhao.me
My first bluesky post will be for my first project as a postdoc at Stanford.

Talk Arena is our first step towards building audio LMs into interactive agents. Try it out and let me know what you think. talkarena.org
Talk Arena
Interactive evaluation for audio models
talkarena.org
Reposted by Hao Zhu 朱昊
williamheld.com
With an increasing number of Large *Audio* Models 🔊, which one do users like the most?

Introducing talkarena.org — an open platform where users speak to LAMs and receive text responses. Through open interaction, we focus on rankings based on user preferences rather than static benchmarks.
🧵 (1/5)
Talk Arena: Interactive Evaluation of Large Audio Models
zhuhao.me
matplotlib with customization. I can share the code with you
zhuhao.me
Would really appreciate it if I can be included. I build social intelligence models/agents that can cooperate with humans.
zhuhao.me
🙋‍♂️