Michael Kirchhof (ICML)
@mkirchhof.bsky.social
340 followers 190 following 65 posts
Research Scientist at Apple for uncertainty quantification.
Posts Media Videos Starter Packs
Pinned
mkirchhof.bsky.social
Can LLMs access and describe their own internal distributions? With my colleagues at Apple, I invite you to take a leap forward and make LLM uncertainty quantification what it can be.
📄 arxiv.org/abs/2505.20295
💻 github.com/apple/ml-sel...
🧵1/9
mkirchhof.bsky.social
Memories complement RAG and can be combined for enhanced results. Post-hoc memory learning is possible (for Qwen, Gemma, etc), with more ablations in the paper.

This was spearheaded by Hadi Pouransari, with David Grangier, C Thomas, me, and Oncel Tuzel at the Apple Machine Learning Research team :)
mkirchhof.bsky.social
🚀 Consider a hypothetical hardware storing a bank with three memory levels:
Anchor model: 0.8GB @ RAM
Level 1: 39GB @ Flash
Level 2: 155GB @ External Disk
Level 3: 618GB @ Cloud

Total fetch time: 38ms (vs. 198ms for a single-level flat memory bank). [9/10]
mkirchhof.bsky.social
💡With hierarchical memories, deeper memories (capturing details) need a larger bank size but require fetching only a few parameters during inference—a great fit for Von Neumann architecture with small-fast to large-slow storage hierarchy. See👇. [8/10]
mkirchhof.bsky.social
💡Information access is controllable with memories.

Unlike typical architectures, the proposed memory bank setup enables controlled parametric knowledge access (e.g., for training data privacy). See the impact of memory bank blocking on performance here: [7/10]
mkirchhof.bsky.social
💡Memories capture long-tail knowledge.

For the text completion task "Atomic number of [element-name] is...", the baseline model (purple) has 17% accuracy for the least frequent elements in DCLM (last bucket). With only 10% added memory, accuracy improves to 83%. [6/10]
mkirchhof.bsky.social
🤔 Which tasks benefit more from memory?

💡 Tasks requiring specific knowledge, like ARC and TriviaQA. Below are categorizations of common pretraining benchmarks based on their knowledge specificity and accuracy improvement when a 410M model is augmented with 10% memory. [5/10]
mkirchhof.bsky.social
💡Accuracy improves with larger fetched memory and total memory bank sizes.

👇A 160M anchor model, augmented with memories from 1M to 300M parameters, gains over 10 points in accuracy. Two curves show memory bank sizes of 4.6B and 18.7B parameters. [4/10]
mkirchhof.bsky.social
🤔 Which parametric memories work best?

💡 We evaluate 1) FFN-memories (extending SwiGLU's internal dimension), 2) LoRA applied to various layers, and 3) Learnable KV. Larger memories perform better, with FFN-memories significantly outperforming others of the same size. [3/10]
mkirchhof.bsky.social
🤔 How to learn memories?

💡 We cluster the pretraining dataset into thousands of nested clusters, each assigned a memory block. During training, for a document, we optimize anchor model parameters and memory bank parameters for the document's matched clusters. [2/10]
mkirchhof.bsky.social
LLMs are currently this one big parameter block that stores all sort of facts. In our new preprint, we add context-specific memory parameters to the model, and pretrain the model along with a big bank of memories.

📑 arxiv.org/abs/2510.02375

[1/10]🧵
Reposted by Michael Kirchhof (ICML)
marcocuturi.bsky.social
Our two phenomenal interns, Alireza Mousavi-Hosseini and Stephen Zhang @syz.bsky.social have been cooking some really cool work with Michal Klein and me over the summer.

Relying on optimal transport couplings (to pick noise and data pairs) should, in principle, be helpful to guide flow matching

🧵
mkirchhof.bsky.social
But it does not seem impossible. Releasing this benchmark (+ code) to let you take a shot at this new avenue for uncertainty communication. This is a missing building block to enable agentic reasoning in uncertain environments, user trust, conformal calibration. Let’s solve it :)
mkirchhof.bsky.social
Second, we attempted hill-climbing along the benchmark. We already knew Reasoning and CoT can’t do it, now we’ve tried to explicitly SFT/DPO. Result: LLMs can get the format right, but what they output is not what they are actually uncertain about, information-theoretically.
mkirchhof.bsky.social
Since its initial release, we didn’t stop cooking: First, we continued validating whether the scores that the SelfReflect benchmarks assigns are robust signals of quality. Across more LLMs and datasets, it works. I have more confidence in the benchmark than ever.
mkirchhof.bsky.social
Many treat uncertainty = a number. At Apple, we're rethinking this: LLMs should output strings that reveal all information of their internal distributions. We find that Reasoning, SFT, CoT can't do it - yet. To get there, we introduce the SelfReflect benchmark.

arxiv.org/pdf/2505.20295
mkirchhof.bsky.social
I'll present my view on the future of uncertainties in LLMs and vision models at @icmlconf.bsky.social, in penal discussions, posters, and workshops. Reach out if you wanna chat :)
Here's everything from me and other folks at Apple: machinelearning.apple.com/updates/appl...
mkirchhof.bsky.social
This is my first larger project at Apple MLR. 🍏 My collaborators did great things here: Luca Füger, @adamgol.bsky.social , Eeshan Gunesh Dhekane, Arno Blaas, and @sineadwilliamson.bsky.social PS: If you wanna learn more, just let me know, happy to give presentations and chat :)
mkirchhof.bsky.social
If the LLM could do that, it would give users absolute honesty about subjective uncertainties. It could even use it itself to gain clarity about which follow-up questions to ask. We invite you to develop such strategies. Here’s a repo to get started: github.com/apple/ml-sel... 🧵9/9
GitHub - apple/ml-selfreflect
Contribute to apple/ml-selfreflect development by creating an account on GitHub.
github.com
mkirchhof.bsky.social
But it’s not impossible per se. Sampling i.i.d. responses and asking the LLMs to summarize them consistently produces self-reflective strings. This technique is expensive and not elegant. Ideally, an LLM should have a mechanism to do this self-reflection inherently. 🧵8/9
mkirchhof.bsky.social
RQ2: Can LLMs produce such strings that describe their own distributions? We test the most recent LLMs of various sizes, with and without reasoning, with different prompts and CoT. None of them is able to honestly reveal the LLM’s internal distribution. 🧵7/9
mkirchhof.bsky.social
We put SelfReflect to the test (plus a whole landscape of other metrics we’ve explored). Our final metric is able to distinguish even almost-good from good summaries and pick up implicit distributional statements. We find it aligns with human judgements. So let’s use it: 🧵6/9
mkirchhof.bsky.social
RQ 1: How can we measure if a string does that? The idea of our SelfReflect metric is that the summary string should imply the same follow-up responses as the original distribution; theoretically speaking it should be a predictively sufficient statistic of the distribution. 🧵5/9
mkirchhof.bsky.social
So we want a string that summarizes the LLM’s internal distribution. Both what options it deems possible, but also how likely each is. In other words, we want the string to contain the same information as the LLM’s internal distribution over strings. 🧵4/9
mkirchhof.bsky.social
LLMs output strings, and strings are incredibly expressive. In fact, they are so expressive that a single string can describe _a whole distribution over_ strings. Such a string would give a perfect insight into what the LLM is certain and uncertain about. 🧵3/9