Guy Davidson
@guydav.bsky.social
960 followers 680 following 130 posts
@guyd33 on the X-bird site. PhD student at NYU, broadly cognitive science x machine learning, specifically richer representations for tasks and cognitive goals. Otherwise found cooking, playing ultimate frisbee, and making hot sauces.
Posts Media Videos Starter Packs
guydav.bsky.social
Belated update #2: my year at Meta FAIR through the AIM program was so nice that I’m sticking around for the long haul.

I’m excited to stay at FAIR and work with @asli-celikyilmaz.bsky.social and friends on fun LLM questions; I’ll be working from the New York office so we’re sticking around.
guydav.bsky.social
Tune in tomorrow for belated update #2, on post-PhD plans!
guydav.bsky.social
I owe tremendous thanks to many other people, all (or, hopefully, at least most) of whom I mentioned in my acknowledgments. I’m also so grateful my dad could represent my family, and for my wife, Sarah, for, well, everything.
guydav.bsky.social
Much, much larger thanks to my advisors, @brendenlake.bsky.social and @toddgureckis.bsky.social , for your guidance and mentorship over the last several years. I appreciate you so much, and this wouldn’t have looked the same without you!
guydav.bsky.social
Belated update #1: I defended my PhD about a month ago! I appreciate the warm reception from everyone who made it in-person and virtually. Thanks to my committee, @lerrelpinto.com, @togelius.bsky.social, and @markkho.bsky.social for your feedback and fun questions.
guydav.bsky.social
Friends and virtual acquaintances! I’m defending my PhD tomorrow morning at 11:30 AM ET. If anyone would like to watch, let me know and I’ll send you the Zoom link (and if you’re in NYC and feel compelled to join in person, that works, too!)
guydav.bsky.social
Wherever good coffee is to be found, the rest of the time. Don't hesitate to reach out!

(also happy to talk about job search in industry and what that looks and feels like these days)
guydav.bsky.social
Today's Minds in the Making: Design Thinking and Cognitive Science Workshop (Pacific E):

minds-making.github.io
guydav.bsky.social
#CogSci2025 friends! I'm here all week and would love to chat. I'd particularly love to talk to anyone thinking about Theory of Mind and how to evaluate it better (in both minds and machines, in different settings and contexts), and about goals and their representations. Find me at:
guydav.bsky.social
Cool new work on localizing and removing concepts using attention heads from colleagues at NYU and Meta!
karen-ullrich.bsky.social
How would you make an LLM "forget" the concept of dog — or any other arbitrary concept? 🐶❓

We introduce SAMD & SAMI — a novel, concept-agnostic approach to identify and manipulate attention modules in transformers.
guydav.bsky.social
You (yes, you!) should work with Sydney! Either short-term this summer, or longer term at her nascent lab at NYU!
sydneylevine.bsky.social
🔆 I'm hiring! 🔆

There are two open positions:

1. Summer research position (best for master's or graduate student); focus on computational social cognition.
2. Postdoc (currently interviewing!); focus on computational social cognition and AI safety.

sites.google.com/corp/site/sy...
Sydney Levine - Open Positions
Summer Research Position I am seeking a part-time or full-time researcher for the summer (starting asap) to bring a project to completion. The project asks the question: do people around the world u...
sites.google.com
guydav.bsky.social
Fantastic new work by @johnchen6.bsky.social (with @brendenlake.bsky.social and me trying not to cause too much trouble).

We study systematic generalization in a safety setting and find LLMs struggle to consistently respond safely when we vary how we ask naive questions. More analyses in the paper!
johnchen6.bsky.social
Do LLMs show systematic generalization of safety facts to novel scenarios?

Introducing our work SAGE-Eval, a benchmark consisting of 100+ safety facts and 10k+ scenarios to test this!

- Claude-3.7-Sonnet passes only 57% of facts evaluated
- o1 and o3-mini passed <45%! 🧵
guydav.bsky.social
Finally, if this work makes you think "I'd like to work with this person," please reach out -- I'm on the job market for industry post-PhD roles (keywords: language models, interpretability, open-endedness, user intent understanding, alignment).
See more: guydavidson.me
Guy Davidson
Guy Davidson's academic website
guydavidson.me
guydav.bsky.social
As with pretty much everything else I've worked on in grad school, this work would have looked different (and almost certainly worse) without the guidance of my advisors, @brendenlake.bsky.social and @toddgureckis.bsky.social . I continue to appreciate your thoughtful engagement with my work! 16/N
guydav.bsky.social
This work would also have been impossible without @adinawilliams.bsky.social 's guidance, the freedom she gave me in picking a problem to study, and believing in me that I could tackle it despite it being my first foray into (mechanistic) interpretability work. 15/N
guydav.bsky.social
We owe a great deal of gratitude to @ericwtodd.bsky.social d , not only for open-sourcing their code, but also for answering our numerous questions over the last few months. If you find this interesting, you should also read their paper introducing function vectors. 14/N
guydav.bsky.social
See the paper for a description of the methods, the many different controls we ran, our discussion and limitations, examples of our instructions and baselines, and other odd findings (applying an FV twice can be beneficial! Some attention heads have negative causal effects!) 13/N
guydav.bsky.social
Finding 5 bonus: Which post-training steps facilitate this? Using the OLMo-2 model family, we find that the SFT and DPO stages each bring a jump in performance, but the final RLVR step doesn't make a difference for the ability to extract instruction FVs. 12/N
guydav.bsky.social
Finding 5: We can steer base models with instruction FVs extracted from their post-trained versions. We didn't expect this to work! It's less effective for the Llama-3.2 models that are distilled and smaller. We're also excited to dig into this and see where we can push it. 11/N
guydav.bsky.social
Finding 4: The relationship between demonstrations and instructions is asymmetrical. Especially in post-trained models, the top attention heads for instructions appear peripherally useful for demonstrations, more than the opposite case (see paper for details). 10/N
guydav.bsky.social
We (preliminarily) interpret this as evidence that the effect of post-training is _not_ in adapting the model to represent instructions with the mechanism used for demonstrations, but in developing a mostly complementary mechanism. We're excited to dig into this further. 9/N.
guydav.bsky.social
Finding 3 bonus: examining activations in the shared attention heads, we see (a) generally increased similarity with increasing model depth, and (b) no difference in similarity between base and post-trained models (circles and squares). 8/N