Lightnews — Scholar-powered news

Michael Kirchhof (ICML)

@mkirchhof.bsky.social

340 followers 190 following 65 posts

Research Scientist at Apple for uncertainty quantification.

Posts Media Videos Starter Packs

Pinned

Michael Kirchhof (ICML) @mkirchhof.bsky.social · Jul 3

Can LLMs access and describe their own internal distributions? With my colleagues at Apple, I invite you to take a leap forward and make LLM uncertainty quantification what it can be.
📄 arxiv.org/abs/2505.20295
💻 github.com/apple/ml-sel...
🧵1/9

1 6 21

Michael Kirchhof (ICML) @mkirchhof.bsky.social · 2d

Memories complement RAG and can be combined for enhanced results. Post-hoc memory learning is possible (for Qwen, Gemma, etc), with more ablations in the paper.

This was spearheaded by Hadi Pouransari, with David Grangier, C Thomas, me, and Oncel Tuzel at the Apple Machine Learning Research team :)

Michael Kirchhof (ICML) @mkirchhof.bsky.social · 2d

🚀 Consider a hypothetical hardware storing a bank with three memory levels:
Anchor model: 0.8GB @ RAM
Level 1: 39GB @ Flash
Level 2: 155GB @ External Disk
Level 3: 618GB @ Cloud

Total fetch time: 38ms (vs. 198ms for a single-level flat memory bank). [9/10]

1 1

Michael Kirchhof (ICML) @mkirchhof.bsky.social · 2d

💡With hierarchical memories, deeper memories (capturing details) need a larger bank size but require fetching only a few parameters during inference—a great fit for Von Neumann architecture with small-fast to large-slow storage hierarchy. See👇. [8/10]

1 1

Michael Kirchhof (ICML) @mkirchhof.bsky.social · 2d

💡Information access is controllable with memories.

Unlike typical architectures, the proposed memory bank setup enables controlled parametric knowledge access (e.g., for training data privacy). See the impact of memory bank blocking on performance here: [7/10]

Michael Kirchhof (ICML) @mkirchhof.bsky.social · 2d

💡Memories capture long-tail knowledge.

For the text completion task "Atomic number of [element-name] is...", the baseline model (purple) has 17% accuracy for the least frequent elements in DCLM (last bucket). With only 10% added memory, accuracy improves to 83%. [6/10]

Michael Kirchhof (ICML) @mkirchhof.bsky.social · 2d

🤔 Which tasks benefit more from memory?

💡 Tasks requiring specific knowledge, like ARC and TriviaQA. Below are categorizations of common pretraining benchmarks based on their knowledge specificity and accuracy improvement when a 410M model is augmented with 10% memory. [5/10]

Michael Kirchhof (ICML) @mkirchhof.bsky.social · 2d

💡Accuracy improves with larger fetched memory and total memory bank sizes.

👇A 160M anchor model, augmented with memories from 1M to 300M parameters, gains over 10 points in accuracy. Two curves show memory bank sizes of 4.6B and 18.7B parameters. [4/10]

Michael Kirchhof (ICML) @mkirchhof.bsky.social · 2d

🤔 Which parametric memories work best?

💡 We evaluate 1) FFN-memories (extending SwiGLU's internal dimension), 2) LoRA applied to various layers, and 3) Learnable KV. Larger memories perform better, with FFN-memories significantly outperforming others of the same size. [3/10]

Michael Kirchhof (ICML) @mkirchhof.bsky.social · 2d

🤔 How to learn memories?

💡 We cluster the pretraining dataset into thousands of nested clusters, each assigned a memory block. During training, for a document, we optimize anchor model parameters and memory bank parameters for the document's matched clusters. [2/10]

Michael Kirchhof (ICML) @mkirchhof.bsky.social · 2d

LLMs are currently this one big parameter block that stores all sort of facts. In our new preprint, we add context-specific memory parameters to the model, and pretrain the model along with a big bank of memories.

📑 arxiv.org/abs/2510.02375

[1/10]🧵

1 4 12

Reposted by Michael Kirchhof (ICML)

Marco Cuturi @marcocuturi.bsky.social · 4d

Our two phenomenal interns, Alireza Mousavi-Hosseini and Stephen Zhang @syz.bsky.social have been cooking some really cool work with Michal Klein and me over the summer.

Relying on optimal transport couplings (to pick noise and data pairs) should, in principle, be helpful to guide flow matching

🧵

2 7 28

Michael Kirchhof (ICML) @mkirchhof.bsky.social · 7d

But it does not seem impossible. Releasing this benchmark (+ code) to let you take a shot at this new avenue for uncertainty communication. This is a missing building block to enable agentic reasoning in uncertain environments, user trust, conformal calibration. Let’s solve it :)

Michael Kirchhof (ICML) @mkirchhof.bsky.social · 7d

Second, we attempted hill-climbing along the benchmark. We already knew Reasoning and CoT can’t do it, now we’ve tried to explicitly SFT/DPO. Result: LLMs can get the format right, but what they output is not what they are actually uncertain about, information-theoretically.

Michael Kirchhof (ICML) @mkirchhof.bsky.social · 7d

Since its initial release, we didn’t stop cooking: First, we continued validating whether the scores that the SelfReflect benchmarks assigns are robust signals of quality. Across more LLMs and datasets, it works. I have more confidence in the benchmark than ever.

1 1

Michael Kirchhof (ICML) @mkirchhof.bsky.social · 7d

Many treat uncertainty = a number. At Apple, we're rethinking this: LLMs should output strings that reveal all information of their internal distributions. We find that Reasoning, SFT, CoT can't do it - yet. To get there, we introduce the SelfReflect benchmark.

arxiv.org/pdf/2505.20295

3 6 30

Reposted by Michael Kirchhof (ICML)

Shubhendu Trivedi @shubhendu.bsky.social · Sep 1

Natural idea. Looks like a nice paper too. arxiv.org/abs/2508.21184

BED-LLM: Intelligent Information Gathering with LLMs and Bayesian Experimental Design

We propose a general-purpose approach for improving the ability of Large Language Models (LLMs) to intelligently and adaptively gather information from a user or other external source using the framew...

arxiv.org

1 6 25

Michael Kirchhof (ICML) @mkirchhof.bsky.social · Jul 13

I'll present my view on the future of uncertainties in LLMs and vision models at @icmlconf.bsky.social, in penal discussions, posters, and workshops. Reach out if you wanna chat :)
Here's everything from me and other folks at Apple: machinelearning.apple.com/updates/appl...

1 5

Michael Kirchhof (ICML) @mkirchhof.bsky.social · Jul 3

This is my first larger project at Apple MLR. 🍏 My collaborators did great things here: Luca Füger, @adamgol.bsky.social , Eeshan Gunesh Dhekane, Arno Blaas, and @sineadwilliamson.bsky.social PS: If you wanna learn more, just let me know, happy to give presentations and chat :)

Michael Kirchhof (ICML) @mkirchhof.bsky.social · Jul 3

If the LLM could do that, it would give users absolute honesty about subjective uncertainties. It could even use it itself to gain clarity about which follow-up questions to ask. We invite you to develop such strategies. Here’s a repo to get started: github.com/apple/ml-sel... 🧵9/9

GitHub - apple/ml-selfreflect

Contribute to apple/ml-selfreflect development by creating an account on GitHub.

github.com

1 2

Michael Kirchhof (ICML) @mkirchhof.bsky.social · Jul 3

But it’s not impossible per se. Sampling i.i.d. responses and asking the LLMs to summarize them consistently produces self-reflective strings. This technique is expensive and not elegant. Ideally, an LLM should have a mechanism to do this self-reflection inherently. 🧵8/9

1 2

Michael Kirchhof (ICML) @mkirchhof.bsky.social · Jul 3

RQ2: Can LLMs produce such strings that describe their own distributions? We test the most recent LLMs of various sizes, with and without reasoning, with different prompts and CoT. None of them is able to honestly reveal the LLM’s internal distribution. 🧵7/9

1 2

Michael Kirchhof (ICML) @mkirchhof.bsky.social · Jul 3

We put SelfReflect to the test (plus a whole landscape of other metrics we’ve explored). Our final metric is able to distinguish even almost-good from good summaries and pick up implicit distributional statements. We find it aligns with human judgements. So let’s use it: 🧵6/9

1 1

Michael Kirchhof (ICML) @mkirchhof.bsky.social · Jul 3

RQ 1: How can we measure if a string does that? The idea of our SelfReflect metric is that the summary string should imply the same follow-up responses as the original distribution; theoretically speaking it should be a predictively sufficient statistic of the distribution. 🧵5/9

1 1

Michael Kirchhof (ICML) @mkirchhof.bsky.social · Jul 3

So we want a string that summarizes the LLM’s internal distribution. Both what options it deems possible, but also how likely each is. In other words, we want the string to contain the same information as the LLM’s internal distribution over strings. 🧵4/9

1 1

Michael Kirchhof (ICML) @mkirchhof.bsky.social · Jul 3

LLMs output strings, and strings are incredibly expressive. In fact, they are so expressive that a single string can describe _a whole distribution over_ strings. Such a string would give a perfect insight into what the LLM is certain and uncertain about. 🧵3/9

1 1