π jouisseuse.github.io
Really appreciate @elisakreiss.bsky.socialβs kind guidance and encouragement throughout this work π
Really appreciate @elisakreiss.bsky.socialβs kind guidance and encouragement throughout this work π
π Paper: arxiv.org/abs/2509.04373
π» Code: github.com/jouisseuse/B...
π Paper: arxiv.org/abs/2509.04373
π» Code: github.com/jouisseuse/B...
This suggests that LLM benchmark behavior may generalize less and less to non-benchmark settings, raising new concerns about ecological validity.
This suggests that LLM benchmark behavior may generalize less and less to non-benchmark settings, raising new concerns about ecological validity.
Do LLMs exhibit distinct behavior when the prompt looks similar to common evaluation prompts? π
We show that prompts that signal bias evaluation can flip the measured bias. See below β¬οΈ
Do LLMs exhibit distinct behavior when the prompt looks similar to common evaluation prompts? π
We show that prompts that signal bias evaluation can flip the measured bias. See below β¬οΈ