Peng Qi
@qi2peng2.bsky.social
290 followers 43 following 79 posts
Multimodal Agents Research @ Orby AI. Ex-AWS AI, JD AI. PhD from @stanfordnlp.bsky.social, UG Tsinghua U. He/him. Opinions my own.
Posts Media Videos Starter Packs
qi2peng2.bsky.social
How do we prove that #AI can't do #maths?

Real Mathematics (yes, "real" is a pun here):

a+b+c = (a+b)+c = a+(b+c)

AI Mathematics (well, floating point maths, really):

>>> 0.1+0.2+0.3
0.6000000000000001
>>> 0.1+(0.2+0.3)
0.6

QED.
qi2peng2.bsky.social

This project was joint work with my Amazon colleagues (led by Yumo Xu), and it's great to see it finally published. Hope this helps motivate more careful eval work in the near future!

#AI #agent #evaluation #RAG #NLP
qi2peng2.bsky.social
b) as builders, we evaluate the technology soberly and help users navigate these risks in product design.

Want to learn more? Checkout
Our paper: arxiv.org/pdf/2506.01829
Open-source code: github.com/amazon-scien...
qi2peng2.bsky.social
Why should you care? As businesses / individuals leverage AI more and more to speed up research and decision-making, it is important that, a) as users, we examine the tools we are using to understand their limitations and avoid pitfalls with significant potential downsides, and
qi2peng2.bsky.social
With a new, carefully annotated dataset and an automated evaluation metric we designed, we find that although LLMs are reasonably good at citing accurate sources most of the time, SOTA LLMs still cite incorrectly 5-28% of the time, and miss citations anywhere from 16% to an alarming 95% of the time.
qi2peng2.bsky.social
"𝐂𝐢𝐭𝐞𝐄𝐯𝐚𝐥: 𝐏𝐫𝐢𝐧𝐜𝐢𝐩𝐥𝐞-𝐃𝐫𝐢𝐯𝐞𝐧 𝐂𝐢𝐭𝐚𝐭𝐢𝐨𝐧 𝐄𝐯𝐚𝐥𝐮𝐚𝐭𝐢𝐨𝐧 𝐟𝐨𝐫 𝐒𝐨𝐮𝐫𝐜𝐞 𝐀𝐭𝐭𝐫𝐢𝐛𝐮𝐭𝐢𝐨𝐧", we propose a framework to systematically study citation accuracy by considering previously neglected contexts such as user-provided information and LLMs' parametric knowledge.
qi2peng2.bsky.social
As 🔎 AI deep research agents 🔎 become an essential part of many people's day-to-day work, it is more essential than ever before that we can trust what they produce.

When these agents cite sources they claim the report is based on, how much can we actually trust them? In our new #ACL2025 paper, ...
qi2peng2.bsky.social
a good problem in its historical context, some of my own research attempts at solving this problem that I believe are on the critical path to autonomous agents, and what's changed today to make the dataset less relevant in its original form. I also reflect on the possible paths forward for ...
Why You Should Stop Using HotpotQA for AI Agents Evaluation in 2025 | Peng Qi
We published HotpotQA, a groundbreaking multi-step question answering dataset in 2018, which has since motivated and facilitated numerous AI agent research works. But you should probably reconsider…
qipeng.me
qi2peng2.bsky.social
Seven years ago, I co-led a paper called 𝗛𝗼𝘁𝗽𝗼𝘁𝗤𝗔 that has motivated and facilitated many #AI #Agents research works since. Today, I'm asking that you stop using HotpotQA blindly for agents research in 2025 and beyond.

In my new blog post, I revisit the brief history of 𝗛𝗼𝘁𝗽𝗼𝘁𝗤𝗔, why it defined ...
Why You Should Stop Using HotpotQA for AI Agents Evaluation in 2025 | Peng Qi
We published HotpotQA, a groundbreaking multi-step question answering dataset in 2018, which has since motivated and facilitated numerous AI agent research works. But you should probably reconsider…
qipeng.me
qi2peng2.bsky.social
The longer employers don’t acknowledge and embrace this discrepancy, the faster they lose the top candidates they spent enormous efforts to hire and retain, leaving the organization in self-fulfilling mediocrity.
qi2peng2.bsky.social
but not perfectly aligned with those of the employer. They ask: Will I get the opportunity to build a career beyond what is immediately required of me? Will I learn and grow, be part of a great team and culture? Will I make a name for myself while doing great work? Will I remain competitive?
3/
qi2peng2.bsky.social
We are never settling for a candidate that does exactly the thing that needs to be done right now, since that thing itself can change before you know it.

But too often employers and managers forget, that highly motivated and capable candidates also hold expectations parallel to these, ...
2/
qi2peng2.bsky.social
When making great hashtag#hiring decisions, we often look for growth potential in a candidate. Will they rise to the occasion when unforeseen challenges arise? Will they grow in the role, and lift up others in the team? Will they still be able to contribute if business direction changes?
1/
qi2peng2.bsky.social
While many aspects of our work (especially in the digital world) can be amenable to #AI #automation, it is also through automation that we continuously rediscover again and again the true meaning of our work and our unique humanness.

#MondayReflection /fin
qi2peng2.bsky.social
this coding phase alone, and I ended up delivering something slightly better than I would've done without it.

As with any technological evolution, tools themselves never fully replace the humans doing the work, but greatly enhance the ones that embrace them and adapt to working with them. 6/
qi2peng2.bsky.social
But, of course, this has its implications. I did save a lot of time looking up programming resources on things I have a vague understanding of and wasn't very familiar with, and didn't have to type all those many characters. By my estimate, the AI assistant did save me 50-80% of the effort of 5/
qi2peng2.bsky.social
(especially to the general public) the fact that the act of putting code down is typically the *least* mentally effortful part of the work. It's as if saying "my 3D printer made 100% of my new shiny collections" -- true in the narrow sense of the printing effort, but it's missing the point. 4/
qi2peng2.bsky.social
how to fix things that aren't working (the AI helped a bit with this too at times), and how to keep things future-proof. In this regard, I still did >90% of the most important *work* in this project. Saying AI 99% of the code, while factually correct in one particular sense (line count), obscures 3/
qi2peng2.bsky.social
AI with no manual edits from me, or that the AI assistant was the last to "touch" those lines of code.

What this 99% number oversimplifies is the amount of time my colleague and I engage in numerous offline discussions, times where I had to stop and think about what to ask the AI to code next, 2/
qi2peng2.bsky.social
In one of my recent projects, AI code assistants actually DID write 99% of my code, and the project was reasonably complex starting from scratch. Does this mean I'm obsolete now? Here's the catch:when I say AI wrote 99% of the code, I was counting roughly how many lines were directly generated by 1/
qi2peng2.bsky.social
#AI 𝘄𝗿𝗼𝘁𝗲 𝟵𝟵% 𝗼𝗳 𝗺𝘆 𝗰𝗼𝗱𝗲, 𝗻𝗼𝘄 𝘄𝗵𝗮𝘁?

Big tech executives and business analysts are racing to share eye-catching statements like "AI will write XX% of the code at MetaCorp by 20YY." How much truth is there to these, and what implications might this have?

🧵
qi2peng2.bsky.social
Non-native speakers sometimes have a unique advantage to language-based humor stemming from their unfamiliarity with idiomatic expressions. I saw an “assembly of god” on the road and thought to myself, “wait, they have a factory to build gods here?”
Reposted by Peng Qi
qi2peng2.bsky.social
Is #AI the new #RocketScience? In my new blog post, I explore the similarities and connections between the two seemingly distant relatives, and reflect on what today's AI scientists can learn from their rocket cousins, plus what makes AI science unique:
AI is the New Rocket Science | Peng Qi
AI science of today has astonishing similarities to rocket science in its prime days, if one pays close attention to history. What are some of these, and what can the history of rocket science tell…
qipeng.me
qi2peng2.bsky.social
Is #AI the new #RocketScience? In my new blog post, I explore the similarities and connections between the two seemingly distant relatives, and reflect on what today's AI scientists can learn from their rocket cousins, plus what makes AI science unique:
AI is the New Rocket Science | Peng Qi
AI science of today has astonishing similarities to rocket science in its prime days, if one pays close attention to history. What are some of these, and what can the history of rocket science tell…
qipeng.me