📍Palo Alto, CA
🔗 calebziems.com
May be of interest to @paul-rottger.bsky.social @monadiab77.bsky.social @vinodkpg.bsky.social @dbamman.bsky.social @davidjurgens.bsky.social and you
May be of interest to @paul-rottger.bsky.social @monadiab77.bsky.social @vinodkpg.bsky.social @dbamman.bsky.social @davidjurgens.bsky.social and you
This was an interdisciplinary effort across computer science (@diyiyang.bsky.social, @williamheld.com, Jane Yu) and sociology (David Grusky and Amir Goldberg), and the research process taught me so much!
This was an interdisciplinary effort across computer science (@diyiyang.bsky.social, @williamheld.com, Jane Yu) and sociology (David Grusky and Amir Goldberg), and the research process taught me so much!
We observe positive transfer performance from Cartography to two leading benchmarks: BLEnD (Myung et al., 2024) and CulturalBench (Chiu et al., 2024).
We observe positive transfer performance from Cartography to two leading benchmarks: BLEnD (Myung et al., 2024) and CulturalBench (Chiu et al., 2024).
We evaluate GPT-4o with and without search and find no significant difference in their recall on Cartography data.
Culture Cartography is "Google proof" since search doesn't help.
We evaluate GPT-4o with and without search and find no significant difference in their recall on Cartography data.
Culture Cartography is "Google proof" since search doesn't help.
Qwen-2 72 B recalls 21% less Cartography data than it recalls traditional data (p < .0001)
Even a strong reasoning model (R1) is challenged more by our data.
Qwen-2 72 B recalls 21% less Cartography data than it recalls traditional data (p < .0001)
Even a strong reasoning model (R1) is challenged more by our data.
And to find challenging questions, we let the LLM steer towards topics it has low confidence in.
To find culturally-representative knowledge, we let the human steer towards what they find most salient.
And to find challenging questions, we let the LLM steer towards topics it has low confidence in.
To find culturally-representative knowledge, we let the human steer towards what they find most salient.
Still this is a single-initiative process.
Researchers can’t steer the distribution towards questions of interest (i.e., those that challenge LLMs).
Still this is a single-initiative process.
Researchers can’t steer the distribution towards questions of interest (i.e., those that challenge LLMs).
In traditional annotation, the researcher picks some questions and the annotator passively provides ground truth answers.
This is single-initiative.
Annotators don't steer the process, so their interests and culture may not be represented.
In traditional annotation, the researcher picks some questions and the annotator passively provides ground truth answers.
This is single-initiative.
Annotators don't steer the process, so their interests and culture may not be represented.