Zorik Gekhman
@zorikgekhman.bsky.social
7 followers 26 following 17 posts
https://zorikg.github.io/
Posts Media Videos Starter Packs
zorikgekhman.bsky.social
This work is the result of a collaboration with a great team. Thanks to my co-authors:
Eyal Ben David, Hadas Orgad, Eran Ofek, Yonatan Belinkov, Idan Szpektor, Jonathan Herzig, and Roi Reichart.

Paper: arxiv.org/abs/2503.15299

17/🧵 (end)
Inside-Out: Hidden Factual Knowledge in LLMs
This work presents a framework for assessing whether large language models (LLMs) encode more factual knowledge in their parameters than what they express in their outputs. While a few studies hint at...
arxiv.org
zorikgekhman.bsky.social
We hope our framework facilitates future research on hidden knowledge, ultimately leading to more transparent and reliable LLMs.
16/🧵
zorikgekhman.bsky.social
Yet the fact that models fail to generate known answers puts a practical constraint on scaling test-time compute via repeated sampling in closed-book QA. Significant gains remain inaccessible because we fail to sample answers that the probe would otherwise rank first.
15/🧵
zorikgekhman.bsky.social
We also leverage our setup to enhance performance in a challenging closed-book QA setting, achieving a 12% average relative improvement over greedy decoding by increasing test-time compute: sampling a large set of answers and selecting the top one using our probe.
14/🧵
zorikgekhman.bsky.social
For example, here the correct answer “Volvo Buses” gets a very low P(a | q) score, meaning it is unlikely to be generated. Accordingly, it wasn’t sampled after 1,000 attempts in our study. Yet the probe scores it higher than all other alternatives.
13/🧵
zorikgekhman.bsky.social
This shows that LLMs can know the answer but have practically zero chance of generating it even once, despite large-scale repeated sampling.

This highlights limitations in the generation process and opens interesting directions for future research on decoding mechanisms.

12/🧵
zorikgekhman.bsky.social
We also discover an extreme case of hidden knowledge. When the ground-truth answer isn’t sampled after 1,000 attempts, manually adding it to the set of candidate answers leads to a substantial increase in knowledge scores.
11/🧵
zorikgekhman.bsky.social
Our results indicate that LLMs consistently exhibit hidden knowledge, with an average relative gap of 40%.

This highlights the need to understand these differences and build models that better use their knowledge, for which our framework serves as a foundation.

10/🧵
zorikgekhman.bsky.social
We then compare internal and external knowledge.

Internal knowledge is measured using a linear probing classifier to score candidate answers, while external knowledge is measured using standard methods that rely on the model’s observable token-level probabilities.

9/🧵
zorikgekhman.bsky.social
In our study, we estimate the set of (correct, incorrect) answer pairs per question using 1,000 model-generated answers, labeled for correctness by an LLM judge that compares each answer to the ground truth.
8/🧵
zorikgekhman.bsky.social
We define hidden knowledge as the condition in which internal knowledge exceeds external knowledge.
7/🧵
zorikgekhman.bsky.social
This allows us to measure internal and external knowledge using the same definition. We just use different scoring methods: external ones, that rely only on the model's observable token-level probabilities, and internal ones, that can use intermediate computations.
6/🧵
zorikgekhman.bsky.social
We propose to measure knowledge relative to a function that scores answer candidates using signals from the model, and we formalize knowledge of a question as the fraction of correct-incorrect answer pairs where the correct one is scored higher.
5/🧵
zorikgekhman.bsky.social
To define hidden knowledge, we first need a definition of “knowledge”, which is also not well defined for LLMs, as shown by @constanzafierro.bsky.social.
4/🧵
zorikgekhman.bsky.social
Yet, despite its importance from both practical and interpretability perspectives, hidden knowledge hasn't been clearly defined and measured.

We propose such a definition, laying foundations for studying this concept, and use it in a study to demonstrate hidden knowledge.
3/🧵
zorikgekhman.bsky.social
🚨 It's often claimed that LLMs know more facts than they show in their outputs, but what does this actually mean, and how can we measure this “hidden knowledge”?

In our new paper, we clearly define this concept and design controlled experiments to test it.
1/🧵