Taylor Sorensen
@taylor-sorensen.bsky.social
230 followers 250 following 33 posts
NLP PhD Candidate at UW
Posts Media Videos Starter Packs
taylor-sorensen.bsky.social
Also check out these very cool related papers exploring diversity and mode collapse!
arxiv.org/pdf/2505.00047 from
Peter West et al.

arxiv.org/abs/2504.05228, arxiv.org/abs/2404.10859 by Yiming Zhang +
Daphne Ippolito et al.

arxiv.org/abs/2510.01171 by
Jiayi Zhang et al.
arxiv.org
taylor-sorensen.bsky.social
In new work, we introduce a simple post-training method and large-scale resource for maximizing diversity and coverage! We call it Spectrum Tuning.
More on this in the coming days - but I'm really excited about this work, and am so happy that it's now public
taylor-sorensen.bsky.social
Pretrained models are better at this - they actually give you substantively different outputs when you sample. BUT, they are unable to reliably follow instructions.
How can we train models to follow instructions AND to span the space of possible outputs?
taylor-sorensen.bsky.social
This may seem like a silly toy example - shouldn’t we just use np.randint()?
Fair - but this simple case is illustrative of a broader weakness. What about creative writing? Or hypothesis generation? Or diverse data generation?
We need models that SPAN the entire output space.
taylor-sorensen.bsky.social
Current post-training teaches a model to output the highest reward answer, even if there are other good answers. E.g. when picking random numbers, 7 seems like the most “random” number to annotators - so models ALWAYS pick 7!
arxiv.org/pdf/2505.00047
arxiv.org/pdf/2203.02155
arxiv.org/pdf/2510.01171
taylor-sorensen.bsky.social
Did you know that LLMs suffer from serious mode collapse?

For example, if you ask models to tell you a joke, they almost always tell you the same joke? This is true across samples and even across model families!

Why does this happen? Can we improve it?
taylor-sorensen.bsky.social
Others can disagree, but in my view it's important to understand the science behind diverse viewpoint modeling both for understanding potential risks and for building prosocial systems. AI alignment especially is a case where I think if we aren't careful, it will be easy for people to be left behind
taylor-sorensen.bsky.social
That is a risk to personalization technologies like these 😬 However, I'm excited about AI's potential to _reduce_ polarization ( www.pnas.org/doi/full/10....), find common ground (www.science.org/doi/10.1126/...), and help people gain sympathy by exploring others' perspectives (ongoing work)
PNAS
Proceedings of the National Academy of Sciences (PNAS), a peer reviewed journal of the National Academy of Sciences (NAS) - an authoritative source of high-impact, original research that broadly spans...
www.pnas.org
taylor-sorensen.bsky.social
That being said, I'm in total agreement that there are absolutely risks to the technology as well, and figuring out which technologies to deploy where in what way will be very important.
taylor-sorensen.bsky.social
Additionally, I think steerability to diverse perspectives becomes even more important as AI systems start having more autonomy (e.g. AI agents). I want an AI agent that knows _my_ perspective, not just the average one!
taylor-sorensen.bsky.social
That being said, it's my belief that all model responses already have _a_ worldview associated with it - so personally, I think it's important to have systems that a) we can measure/explictly see what perspectives it is being aligned to so b) many people's perspectives can be included.
taylor-sorensen.bsky.social
Apologies if this wasn't clear! They're provided textually in an in-context prompt, which the model then tries to steer towards

And yes you are absolutely right, that's one of the risks of personalization in general (see great paper here: arxiv.org/pdf/2303.05453)
Reposted by Taylor Sorensen
lasha.bsky.social
Want to know what training data has been memorized by models like GPT-4?

We propose information-guided probes, a method to uncover memorization evidence in *completely black-box* models,

without requiring access to
🙅‍♀️ Model weights
🙅‍♀️ Training data
🙅‍♀️ Token probabilities 🧵 (1/5)
Information-Guided Identification of Training Data Imprint in (Proprietary) Large Language Models
High-quality training data has proven crucial for developing performant large language models (LLMs). However, commercial LLM providers disclose few, if any, details about the data used for training. ...
arxiv.org
taylor-sorensen.bsky.social
This was my Google DeepMind internship work with amazing collaborators Pushkar Mishra, @romapatel.bsky.social, Michael Henry Tessler, @mbakker.bsky.social, Georgina Evans, Iason Gabriel, @noahdgoodman.bsky.social, @verenarieser.bsky.social
They are a really amazing team!
taylor-sorensen.bsky.social
Value profiles enable new ways to model variation and enable representation at the individual level. We hope that our work helps enable systems that better model diverse perspectives and that work for everyone (yay for #pluralisticalignment #nlpforsocialgood #compsocialscience #compdemocracy!)
taylor-sorensen.bsky.social
There are benefits though!
✅ Value profiles may enhance user agency, as, a person could change their own value profile
✅ Enabling value reflection via bottom-up discovery and top-down editing
✅ Unlike sociodemographics, which are often unchosen, people can choose values for themselves
(16/?)
taylor-sorensen.bsky.social
Our goal in this work is to improve AI systems' ability to model diverse perspectives and better serve more people! However, risks remain
❌ Privacy risks: people may not wish values inferred
❌ Systems may fail to generalize to less common values, and we only test on English language
(15/?)
taylor-sorensen.bsky.social
As a last experiment, we simulate an annotator population ("jury learning" @mitchellgordon) using with our trained models and value profiles.

We find that the instance-level interannotator agreement (IAA) predicted by our simulated population correlates with the observed IAA.
(14/?)
taylor-sorensen.bsky.social
We also find that our value profile system is very well-calibrated.

This calibration is important for trusting the model's confidence and for disentangling value-related epistemic uncertainty from aleatoric uncertainty in rater variation.
(13/?)
taylor-sorensen.bsky.social
Value profiles are written in natural language. But are they actually semantically interpretable? Does the system change its judgments with wording changes in common-sense ways?

Yes, we find that semantic changes in value profile lead to expected changes in the output.
(12/?)
taylor-sorensen.bsky.social
Clustering with value profiles also enables dataset-level qualitative analysis.

For example, for OQA/DIC, even restricting to just 2 clusters explains the majority of rater variation, suggesting a bimodal distribution.

Additionally, the profile descriptions suggest why people may disagree.
(11/?)
taylor-sorensen.bsky.social
Our algorithm is effective at uncovering useful rater groupings, with the resulting value profile clusters outperform the most performant demographic grouping!

Additionally, on the dataset where demographics helped most, the clusters partially recover ideological trends.
(10/?)
taylor-sorensen.bsky.social
To characterize common modes of (dis)agreement, we introduce a value-based clustering algorithm.

Unlike traditional methods, ours: 1) does not require that raters label overlapping instances, 2) leverages semantic instance information, and 3) returns cluster descriptions.
(9/?)