Lightnews — Scholar-powered news

Lihao Sun

@1e0sun.bsky.social

26 followers 39 following 11 posts

Working on LLM interpretability; recent graduate from uchicago. slhleosun.github.io

Posts Media Videos Starter Packs

Lihao Sun @1e0sun.bsky.social · Jun 10

7/
📢 Accepted to #ACL2025 Main Conference! See you in Vienna.
Work done by @1e0sun.bsky.social‬, Chengzhi Mao, @valentinhofmann.bsky.social‬, Xuechunzi Bai.

Paper: arxiv.org/abs/2506.00253
Project page: slhleosun.github.io/aligned_but_...
Code & Data: github.com/slhleosun/al...

1 4

Lihao Sun @1e0sun.bsky.social · Jun 10

6/
We call this failure mode "blindness"—when alignment makes certain concepts less salient. This may reflect a broader class of alignment issues.

Similar methods can be extended to other forms of social bias or to study how models resolve polysemy under ambiguity.

1 2

Lihao Sun @1e0sun.bsky.social · Jun 10

5/
This challenges a common belief:
unlearning ≠ debiasing

When debiasing strategies suppress sensitive concepts, they can unintentionally reduce a model’s ability to detect bias.

🧠 Instead, we may achieve deeper alignment effects with strategies that make models aware of them.

1 2

Lihao Sun @1e0sun.bsky.social · Jun 10

4/
Inspired by these results, we tested the opposite of “machine unlearning” for debiasing.

What if we reinforced race concepts in models?
- Injecting race-laden activations cut implicit bias by 54.9%.
- LoRA fine-tuning brought it down from 97.3% → 42.4%.

Bonus: also lowered explicit bias.

1 2

Lihao Sun @1e0sun.bsky.social · Jun 10

3/
We mechanistically tested this using activation patching and embedding interpretation.

Aligned models were 52.2% less likely to represent “black” as race in ambiguous contexts compared to unaligned models.

🧠 LMs trained for harmlessness may avoid racial representations—amplifying stereotypes.

1 2

Lihao Sun @1e0sun.bsky.social · Jun 10

This resembles race blindness in humans; ignoring race makes stereotypes more likely to slip through, and the LMs’ safety guardrails aren't triggered.

1 2

Lihao Sun @1e0sun.bsky.social · Jun 10

2/
So why does alignment increase implicit bias?

Our analyses showed that aligned LMs are more likely to treat “black” and “white” as pure color, not race, when the context is ambiguous.

1 2

Lihao Sun @1e0sun.bsky.social · Jun 10

Aligned models passed explicit tests—but were more biased in implicit settings.
📉 Explicit bias: near 0%
📈 Implicit bias: 91.4%

1 2

Lihao Sun @1e0sun.bsky.social · Jun 10

- Explicit: Likert scale, asking whether the model agrees with a given association such as “black” is related to negative, “white” is related to positive.
- Implicit: Word association, let the model freely pair “black”/”white” with positive/negative words.

1 2

Lihao Sun @1e0sun.bsky.social · Jun 10

1/
We curated pairs of prompts testing for implicit and explicit racial bias and used them to evaluate Llama 3 models.

1 2

Lihao Sun @1e0sun.bsky.social · Jun 10

🚨New #ACL2025 paper!

Today’s “safe” language models can look unbiased—but alignment can actually make them more biased implicitly by reducing their sensitivity to race-related associations.

🧵Find out more below!

1 2 12