Lightnews — Scholar-powered news

Caleb Ziems

@calebziems.com

1.8K followers 480 following 17 posts

PhD student at Stanford NLP. Working on Social NLP and CSS. Previously at GaTech, Meta AI, Emory.

📍Palo Alto, CA
🔗 calebziems.com

Posts Replies Media Videos

Caleb Ziems

@calebziems.com

Thanks to many @stanfordnlp.bsky.social members for feedback! @juliakruk.bsky.social @yanzhe.bsky.social @myra.bsky.social @jaredlcm.bsky.social

May be of interest to @paul-rottger.bsky.social @monadiab77.bsky.social @vinodkpg.bsky.social @dbamman.bsky.social @davidjurgens.bsky.social and you

November 4, 2025 at 6:04 PM

Caleb Ziems

@calebziems.com

Our implementation of Culture Cartography is based on Farsight (Wang et al., 2024).

This was an interdisciplinary effort across computer science (@diyiyang.bsky.social, @williamheld.com, Jane Yu) and sociology (David Grusky and Amir Goldberg), and the research process taught me so much!

November 4, 2025 at 5:38 PM

Caleb Ziems

@calebziems.com

Finally, Culture Cartography is aligned with prior notions of culture evals in our field.

We observe positive transfer performance from Cartography to two leading benchmarks: BLEnD (Myung et al., 2024) and CulturalBench (Chiu et al., 2024).

November 4, 2025 at 5:35 PM

Caleb Ziems

@calebziems.com

Compared to knowledge extraction, Culture Cartography is less prone to test-set contamination.

We evaluate GPT-4o with and without search and find no significant difference in their recall on Cartography data.

Culture Cartography is "Google proof" since search doesn't help.

November 4, 2025 at 5:34 PM

Caleb Ziems

@calebziems.com

Compared to traditional annotation, Culture Cartography more often elicits knowledge that is unknown to LLMs.

Qwen-2 72 B recalls 21% less Cartography data than it recalls traditional data (p < .0001)

Even a strong reasoning model (R1) is challenged more by our data.

November 4, 2025 at 5:33 PM

Caleb Ziems

@calebziems.com

We propose a mixed-initiative method called Culture Cartography.

And to find challenging questions, we let the LLM steer towards topics it has low confidence in.

To find culturally-representative knowledge, we let the human steer towards what they find most salient.

November 4, 2025 at 5:33 PM

Caleb Ziems

@calebziems.com

Other benchmarks use knowledge extracted from the rich cultural artifacts that humans actively produce on the web.

Still this is a single-initiative process.

Researchers can’t steer the distribution towards questions of interest (i.e., those that challenge LLMs).

November 4, 2025 at 5:32 PM

Caleb Ziems

@calebziems.com

How are prior benchmarks constructed?

In traditional annotation, the researcher picks some questions and the annotator passively provides ground truth answers.

This is single-initiative.

Annotators don't steer the process, so their interests and culture may not be represented.

November 4, 2025 at 5:32 PM

Caleb Ziems

@calebziems.com

go.bsky.app/VZBhuJ5

November 22, 2024 at 2:42 AM

Caleb Ziems

@calebziems.com

👋

November 19, 2024 at 8:48 PM

Caleb Ziems

@calebziems.com

@butanium.bsky.social I nominate @aryaman.io

November 19, 2024 at 4:57 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news