Lightnews — Scholar-powered news

Siqi Liu (刘思奇)

@liusiqi.bsky.social

63 followers 150 following 3 posts

Staff Research Engineer @ DeepMind

Posts Media Videos Starter Packs

Siqi Liu (刘思奇) @liusiqi.bsky.social · 11d

We have got exciting (and unconventional) stuff cooking and we are hiring for a strong research engineer on the GDM Game Theory team in London.

Consider apply if you are interested in the intersection of game theory, multiagent systems and LLMs!
job-boards.greenhouse.io/deepmind/job...

Research Engineer, Game Theory & Multi-Agent Systems

London, UK

job-boards.greenhouse.io

7 19

Siqi Liu (刘思奇) @liusiqi.bsky.social · Apr 18

Joint work with @drimgemp.bsky.social, @lukemarris.bsky.social, Georgios Piliouras, Nicolas Heess and @sharky6000.bsky.social.

Siqi Liu (刘思奇) @liusiqi.bsky.social · Apr 18

Frontier models are often compared on crowdsourced user prompts - user prompts can be low-quality, biased and redundant, making "performance on average" hard to trust.

Come find us at #ICLR2025 to discuss game-theoretic evaluation (shorturl.at/0QtBj)! See you in Singapore!

Re-evaluating Open-Ended Evaluation of Large Language Models

A case study using the livebench.ai leaderboard.

shorturl.at

1 2 7

Reposted by Siqi Liu (刘思奇)

Luke Marris @lukemarris.bsky.social · Apr 17

[🧵1/N] Thrilled to share our work "Re-evaluating Open-Ended Evaluation of Large Language Models"! 🚀 Popular LLM leaderboards (think Elo/Chatbot Arena) are useful, but are they telling the whole story? We find issues w/ redundancy & bias. 🤔
Paper @ ICLR 2025: arxiv.org/abs/2502.20170 #LLM #ICLR2025

2 2 14

Reposted by Siqi Liu (刘思奇)

Jeff Dean @jeffdean.bsky.social · Mar 25

🥁Introducing Gemini 2.5, our most intelligent model with impressive capabilities in advanced reasoning and coding.

Now integrating thinking capabilities, 2.5 Pro Experimental is our most performant Gemini model yet. It’s #1 on the LM Arena leaderboard. 🥇

34 65 220

Reposted by Siqi Liu (刘思奇)

Luke Marris @lukemarris.bsky.social · Feb 18

[🧵1/N] Please check out our new paper (arxiv.org/abs/2502.11645) on game-theoretic evaluation. It is the first method that results in clone-invariant ratings in N-player, general-sum interactions. Co-authors: @liusiqi.bsky.social , Ian Gemp, Georgios Piliouras, @sharky6000.bsky.social 🎉

Deviation Ratings: A General, Clone-Invariant Rating Method

Many real-world multi-agent or multi-task evaluation scenarios can be naturally modelled as normal-form games due to inherent strategic (adversarial, cooperative, and mixed motive) interactions. These...

arxiv.org

2 2 14