Wayne
@waynechi.bsky.social
33 followers 170 following 17 posts
CS Ph.D. at CMU. Building Copilot Arena. Editor at http://blog.ml.cmu.edu
Posts Media Videos Starter Packs
Reposted by Wayne
chrisdonahue.com
Inaugurating new acct to share work from my PhD student!

Wayne et al have been running a live eval platform Copilot Arena - a VSCode extension serving code completions from AI systems to real developers. See 🧵 for findings and preprint

Excited to be evaluating human-AI *workflows* holistically!
waynechi.bsky.social
What do developers 𝘳𝘦𝘢𝘭𝘭𝘺 think of AI coding assistants?

In October, we launched Copilot Arena to collect user preferences on real dev workflows. After months of live service, we’re here to share our findings in our recent preprint.

Here's what we have learned /🧵
waynechi.bsky.social
Full Paper with additional analyses: arxiv.org/abs/2502.09328
Code: github.com/lmarena/copi...

w/ Valerie Chen, Anastasios Nikolas Angelopoulos, Wei-Lin Chiang, Aditya Mittal, Naman Jain, Tianjun Zhang, Ion Stoica, @chrisdonahue.com , @atalwalkar.bsky.social
arxiv.org
waynechi.bsky.social
Our paper analyzes human preferences across 10 SOTA coding models, but we continue to add more models to the live Copilot Arena leaderboard on lmarena.ai!
waynechi.bsky.social
Different data slices affect user preferences disproportionally. There is a drastic difference in relative model performance between real-world tasks such as frontend or backend development versus leetcode style coding challenges but little difference between programming languages.
waynechi.bsky.social
We attribute these differences to a significant shift in our data distribution. Compared to previous benchmarks, Copilot Arena observes more programming languages (PL), natural languages (NL), longer context lengths, multiple task types, and various code structures.
waynechi.bsky.social
Our leaderboard differs from existing evaluations. In particular, smaller models over perform in static benchmarks compared to real development workflows.
waynechi.bsky.social
We evaluate models in a developer's IDE by presenting pairs of code completions generated by two different models. This workflow evaluates human preferences across models with real users and tasks in their native environment.
waynechi.bsky.social
What do developers 𝘳𝘦𝘢𝘭𝘭𝘺 think of AI coding assistants?

In October, we launched Copilot Arena to collect user preferences on real dev workflows. After months of live service, we’re here to share our findings in our recent preprint.

Here's what we have learned /🧵
waynechi.bsky.social
Got to test out InceptionAILab's newest model, Mercury Coder Mini, on Copilot Arena!

Mercury Coder Mini is blazing fast and overtakes Codestral as the fastest coding model out there (0.24s end-to-end latency) while boasting similar performance (joint #2).

Congrats to InceptionAILabs! 📸
waynechi.bsky.social
I had the same problem. I only use cursor for newer, small projects. I use Copilot Arena's edit feature for projects in VSCode (but obviously I'm biased)
kylelo.bsky.social
tried switching to cursor and having extreme difficulty getting all my vscode extensions to work properly ☹️ doesn’t seem worth
waynechi.bsky.social
Deepseek v3 (FiM) is now available in Copilot Arena for free!

Download at lmarena.ai/copilot
waynechi.bsky.social
These lists are better than most "2024's best games" lists
hdkirin.bsky.social
This week's Famitsu had a lot of Japanese gaming industry folks give their personal Game of the Year lists. I'll update this thread periodically since there's a lot of them.
waynechi.bsky.social
Copilot Arena's leaderboard is now live on lmarena.ai/leaderboard!

We've collected over 15k votes on 11 models (2 new models since our last blogpost release). Congrats @deepseek.bsky.social🥇and @anthropic.com🥇!
Chatbot Arena (formerly LMSYS): Free AI Chat to Compare & Test Best AI Chatbots
lmarena.ai
waynechi.bsky.social
I'm not physically at NeurIPS, but my good friend
@naveenraman.bsky.social will be presenting in my stead.

In this work, we found that UI element ordering significantly affected GUI agent performance. Come check out the poster (and quiz Naveen) at the Workshop on Open-World Agents (OWA-2024)!
waynechi.bsky.social
Bruh what... 💀
waynechi.bsky.social
We've open sourced CopilotArena’s server code!

Check out how we handle code completions and share your ideas for new system prompts!

Github:
github.com/lmarena/copi...
Technical details in the blog: blog.lmarena.ai/blog/2024/co...

Download Copilot now at: lmarena.ai/copilot
waynechi.bsky.social
Trying out Bluesky. Will mostly be posting about Copilot Arena!