Lightnews — Scholar-powered news

Kenneth Marino @kennethmarino.bsky.social · Aug 29

We hope this survey is useful and fun for the community! We couldn’t include everything, but tried to at least give a good overview of the field. Happy to hear feedback and if you think we messed something up, feel free to DM or email me.

1

Kenneth Marino @kennethmarino.bsky.social · Aug 29

There’s a lot of great stuff in here we think! We cite over 100 papers and websites. One thing I am very happy about is how easy it is to follow links in our survey to the bibliography which then links to the papers directly.

1 1

Kenneth Marino @kennethmarino.bsky.social · Aug 29

Then we talk about the LLM-Agent approaches and try to explain and make some sense of the many components that make up an LLM-based Computer Use Agent.

1 2

Kenneth Marino @kennethmarino.bsky.social · Aug 29

We then spend a lot of time looking at the different earlier (Pre-LLM) approaches to the problem, including the RL-from scratch period and even the very earliest planning-based approaches.

1 1

Kenneth Marino @kennethmarino.bsky.social · Aug 29

We try to categorize all the environments and datasets in common use and let users click/filter and browse through each of the datasets.

1 1

Kenneth Marino @kennethmarino.bsky.social · Aug 29

First, we try to ground our survey, say what we even mean by “Computer Use” and define some key terms, grounded in the classical agent-environment framework.

1 1

Kenneth Marino @kennethmarino.bsky.social · Aug 29

You can view the survey here: kennethmarino.com/computeruse/...
We tried to make it as interactive and fun as possible, including a retro DOS theme to go along with the subject.
Credit to Claude for helping me create the website :)

1 1

Kenneth Marino @kennethmarino.bsky.social · Aug 29

Super excited that the Computer Use survey I've been working on w/ @anamarasovic.bsky.social for a while now is ready! Originally we were planning on a more traditional survey paper but as more surveys came out we decided on an interactive website survey.

1 1 1

Reposted by Kenneth Marino

Ana Marasović @anamarasovic.bsky.social · Jul 27

Arriving to #ACL2025 #ACL2025NLP in a few hours!

See you at the welcome reception & catch me at the poster session on 𝐓𝐮𝐞𝐬𝐝𝐚𝐲 (𝐉𝐮𝐥𝐲 𝟐𝟗) 𝐚𝐭 𝟏𝟎:𝟑𝟎𝐚𝐦, where Jesse will present our work introducing new tasks for supporting legal brief writing: arxiv.org/abs/2506.06619

1 3 25

Kenneth Marino @kennethmarino.bsky.social · Jul 16

I can’t find it but my favorite was when someone asked ChatGPT to set an alarm for them and it pretended to set one and the person missed their important meeting

2

Kenneth Marino @kennethmarino.bsky.social · Jul 1

Also, this is my first paper (hopefully of many) with my
@utah.edu colleagues! Feel very welcomed so far and really excited about the things we'll be able to do together. And we just had another great hiring year with several new colleagues, so expect lots of exciting stuff soon!

1

Kenneth Marino @kennethmarino.bsky.social · Jul 1

Read Fateme's full thread, but what I find interesting about the paper is that LLMs are already pretty good at summarization, but is still quite bad at finding relevant cases. With many retrieval benchmarks becoming saturated, I think this is an exciting place for new work!

1

Kenneth Marino @kennethmarino.bsky.social · Jul 1

Really excited about this!

As backstory, Jesse Woo started this project when I taught a ML Datasets class at Columbia.

Then we joined up with @anamarasovic.bsky.social and @fatemehc.bsky.social and really kicked it into high gear. Would not have happened without the full team!

Fateme Hashemi Chaleshtori @fatemehc.bsky.social · Jun 20

1/ 🚨NEW PAPER: "BriefMe: A Legal NLP Benchmark for Assisting with Legal Briefs", accepted to ACL Findings 2025!
We introduce the first benchmark specifically designed to help LLMs assist lawyers in writing legal briefs 🧑‍⚖️

📄 arxiv.org/abs/2506.06619
🗂️ huggingface.co/datasets/jw4...

1 1 1

Reposted by Kenneth Marino

FGVC Workshop @fgvcworkshop.bsky.social · Jun 8

Join us on June 11, 9am to discuss all things fine-grained!
We are looking forward to a series of talks on semantic granularity, covering topics such as machine teaching, interpretability and much more!
Room 104 E
Schedule & details: sites.google.com/view/fgvc12
@cvprconference.bsky.social #CVPR25

1 6 10

Reposted by Kenneth Marino

FGVC Workshop @fgvcworkshop.bsky.social · Jun 8

We are so excited to have this amazing line-up of speakers!!
Randall Balestriero, Kai Han, Mia Chiquier, Kenneth Marino (@kennethmarino.bsky.social‬), Elisa Ricci, Thomas Fel (@thomasfel.bsky.social‬)

1 2

Kenneth Marino @kennethmarino.bsky.social · May 16

We just dropped a new paper on studying LLMs on the “Blicket Test” to ask the question: do language models explore like adults or like children? We also show how to get them to act more like children (i.e. more like scientists). All credit to Anthony and team, this came together super well!

Anthony GX-Chen @agx-chen.bsky.social · May 16

Language model (LM) agents are all the rage now—but they may exhibit cognitive biases when inferring causal relationships!

We evaluate LMs on a cognitive task to find:
- LMs struggle with certain simple causal relationships
- They show biases similar to human adults (but not children)

🧵⬇️

Example of the Blicket Test experiment. A subset of objects activate the machine following an unobserved rule ("disjunctive" / "conjunctive"). The agent needs to interact with the environment by placing objects on/off the machine to figure out the rule.

2 3

Kenneth Marino @kennethmarino.bsky.social · May 15

Really glad you like the paper! Anthony and team did a great job on this.

Kenneth Marino @kennethmarino.bsky.social · Mar 18

Are you tired of your static fixed benchmarks? Feel like your data is in a rut. You want to change something but you just feel stuck? Try ReCogLab!

Really proud of this work and of my fantastic colleagues at Google DeepMind who put in so much hard work.

See you all in Singapore!

Kim Stachenfeld, PhD @neurokim.bsky.social · Mar 18

Want to procedurally generate large-scale relational reasoning experiments in natural language, to study human psychology 🧠 or eval LLMs 🤖?

We have a tool for you! Our latest #ICLR work on long-context/relational reasoning evaluation for LLMs ReCogLab!
github.com/google-deepm...

Thread ⬇️

2

Kenneth Marino @kennethmarino.bsky.social · Feb 19

You don’t know me man. Get off your high horse. Blocking you now

Kenneth Marino @kennethmarino.bsky.social · Feb 19

I literally do none of those things. I don’t work in any of these areas. I think you need to step back and ask why you’re fighting random researchers who don’t decide these things instead of the people you actually seem mad at

1

Kenneth Marino @kennethmarino.bsky.social · Feb 19

?????
I post about AI papers, what on Earth are you talking about?

1

Kenneth Marino @kennethmarino.bsky.social · Jan 20

People who actually believe in the promise of AI should be the most upset about the over-claiming, over-hyping and overt secrecy and unwillingness to expose your work to scrutiny that has come to characterize much of the “feel the AGI” crowd.

1 7

Kenneth Marino @kennethmarino.bsky.social · Jan 19

This field is literally so old that there was famously a paper calling the field overhyped called the Lighthill Report in 1973 that caused funding to plummet. We’ve literally already went through at least a few hype cycles.

3

Kenneth Marino @kennethmarino.bsky.social · Jan 19

This is why open source and publishing is important. Maybe OpenAI didn’t do anything sus with held out splits. But if code and models are never released and the experiments and methods are not published or described in sufficient detail, we can’t reproduce it or scrutinize any of these decisions.

Sung Kim @sungkim.bsky.social · Jan 19

This is just a reminder that training on test data is all you need to achieve SOTA perf

OpenAI had access to all of FrontierMath data from the beginning, but they verbally agreed that data would not be used in model training. Although there was a legal agreement not to disclose the partnership

7

Kenneth Marino @kennethmarino.bsky.social · Jan 17

Just read a fantastic web agent paper. Game changer!

* Treats it as an RL problem
* Trains rather than just prompting
* Beats closed models
* Releases code and model so other people can build off of their work

Many great ideas in this paper too, definitely read

arxiv.org/pdf/2411.02337

arxiv.org

7