Lightnews — Scholar-powered news

1 1

Andrew Lampinen @lampinen.bsky.social · 16d

Hahaha much appreciated

Andrew Lampinen @lampinen.bsky.social · 16d

Even comparing my own work in different areas; it's harder to be both timely and as through with LM works, especially with the scale of experiments

Latent learning: episodic memory complements parametric learning by enabling flexible reuse of experiences

Andrew Lampinen @lampinen.bsky.social · 16d

I was gonna say, I feel attacked by this tweet 😅

2 1

Andrew Lampinen @lampinen.bsky.social · 16d

Check out the paper if you’re interested! arxiv.org/abs/2509.16189
And thanks to my awesome collaborators: @martinengelcke.bsky.social, Effie Li, @arslanchaudhry.bsky.social and James McClelland. 9/9

When do machine learning systems fail to generalize, and what mechanisms could improve their generalization? Here, we draw inspiration from cognitive science to argue that one weakness of machine lear...

9

Andrew Lampinen @lampinen.bsky.social · 16d

We think this work sheds light on why retrieval offers distinct benefits beyond just training models more, and provides a different perspective on why episodic memory and parametric learning are complementary, which we hope will be of interest for both AI and cognitive science 8/

Andrew Lampinen @lampinen.bsky.social · 16d

In the paper, we explore many more settings & nuances — including RL and BC versions of maze navigation experiments based on the original experiments on latent learning in rats, the effects of associative cues, the importance of within-episode ICL, and ablations. 7/

Andrew Lampinen @lampinen.bsky.social · 16d

We show that even when models generalize well from parametric learning in standard (nontrivial) evaluations, there are selective, consistent failures of latent learning. Only models with retrieval generalize well on the key tests of latent learning. 6/

The benefits of oracle retrieval on the (a) Codebooks and (b) simple reversals benchmarks. Both baseline and retrieval models perform well on component tasks like recalling definitions, or encoding new sequences involving indices used in encoding during training (a, center). However, performance differs dramatically on the latent encoding test (right bars on both plots), where only the model with retrieval achieves above-chance performance.

Latent learning: episodic memory complements parametric learning by enabling flexible reuse of experiences

Andrew Lampinen @lampinen.bsky.social · 16d

To illustrate this point, we explore latent learning across a wide range of benchmarks (from codebook translation to BC and RL navigation) — and compare baseline language models or agents to those equipped with oracle retrieval. 5/

The benchmarks we use and the key types of latent generalization that they test. (a) The codebooks benchmark tests the ability to use latent indices (highlighted in red) for which only the definitions have been seen in training to complete test encoding sequences. (b) The simple reversals benchmark tests the ability of models to reverse relations seen in training, and which models have learned to reverse in-context. (c) The semantic structure benchmark uses training embedded in more naturalistic text to test latent generalization types ranging from reversals to syllogisms, or more challenging category-inclusion-only holdouts. (d) The latent gridworld—with both its pixel-based RL and ASCII-based BC instantiations—tests the ability to navigate to objects that have never been a navigation goal in training for a particular maze, but have been frequently seen.

1 4

Andrew Lampinen @lampinen.bsky.social · 16d

But models can readily use latent information in their context. We therefore suggest that natural intelligence solves the latent learning problem via the complementary strengths of episodic memory: reinstating experiences into context makes latent information accessible. 4/

Explicit retrieval of learning experiences from nonparametric learning systems complements the broader knowledge of parametric learning—by making select, relevant experiences available in context where they can be more flexibly used in ways different from the original task setting in which they were encountered.

1 1 5

Andrew Lampinen @lampinen.bsky.social · 16d

we argue that parametric learning methods are too tied to the explicit training task, and fail to effectively encode latent information relevant to possible future tasks, and we suggest that this explains a wide range of findings, from navigation to the reversal curse. 3/

While a model may be trained on some explicit information (e.g. X is Y's teacher" or goals (e.g. navigate to Z), there may be other information latent in it (such as a reversal "Y is X's teacher).
Challenges of reversal are one instance of the much broader phenomenon that what is explicitly learned may also latently convey information relevant to other tasks—e.g., multi-hop reasoning,
alternative goals, or answering questions in other languages. Like the reversal curse, learning on such sequences may primarily improve performance on the explicit information or goals; however, if the sequence were in context, models would readily be able to make inferences about the latent information.

2 5

Andrew Lampinen @lampinen.bsky.social · 16d

We take inspiration from classic experiments on latent learning in animals, where the animals learn about information that is not useful at present, but that might be useful later — for example, learning the location of useful resources in passing. By contrast, 2/

1 5

Andrew Lampinen @lampinen.bsky.social · 16d

Why does AI sometimes fail to generalize, and what might help? In a new paper (arxiv.org/abs/2509.16189), we highlight the latent learning gap — which unifies findings from language modeling to agent navigation — and suggest that episodic memory complements parametric learning to bridge it. Thread:

When do machine learning systems fail to generalize, and what mechanisms could improve their generalization? Here, we draw inspiration from cognitive science to argue that one weakness of machine lear...

Deep learning generalizes because the parameter-function map is biased towards simple functions

1 10 44

Reposted by Andrew Lampinen

Naomi Saphra @nsaphra.bsky.social · Aug 29

How can an imitative model like an LLM outperform the experts it is trained on? Our new COLM paper outlines three types of transcendence and shows that each one relies on a different aspect of data diversity. arxiv.org/abs/2508.17669

3 17 95

Andrew Lampinen @lampinen.bsky.social · Aug 6

Thanks! Yes, I'm interested in which constraints most strongly push against this: 1) efficiency of acting (current FHE is slow), 2) efficiency of learning (simplicity bias), 3) maybe relatedly probability of learning a la arxiv.org/abs/1805.08522 or 4) some combination thereof

Deep neural networks (DNNs) generalize remarkably well without explicit regularization even in the strongly over-parametrized regime where classical learning theory would instead predict that they wou...

Equivalence between representational similarity analysis, centered kernel alignment, and canonical correlations analysis

3

Andrew Lampinen @lampinen.bsky.social · Aug 5

they're mostly equivalent after mean-centering: www.biorxiv.org/content/10.1... fwiw

Centered kernel alignment (CKA) and representational similarity analysis (RSA) of dissimilarity matrices are two popular methods for comparing neural systems in terms of representational geometry. Alt...

www.biorxiv.org

1 5

Andrew Lampinen @lampinen.bsky.social · Aug 5

When we've compared these in past work e.g. Supplement fig. A.13 here proceedings.neurips.cc/paper/2020/h... we've seen pretty similar results between the two. Haven't run it in exactly this setting though. There are also some arguments that 1/2

Andrew Lampinen @lampinen.bsky.social · Aug 5

even though both are linearly decodable and equally predictive. Katherine's paper studies some instances more thoroughly in simple settings. My sense though is that the magnitude of these effects are quite a bit smaller than the base bias, so probably not a huge issue if datasets aren't tiny. 2/2

1

Andrew Lampinen @lampinen.bsky.social · Aug 5

I don't know of any reviews unfortunately! Fig. 16 in our TMLR paper (openreview.net/forum?id=aY2...) shows an instance — training classifiers on the penultimate reps to decode a label predicted by both easy and hard features; at high predictivity the classifier prefers the easy feature, even 1/2

Andrew Lampinen @lampinen.bsky.social · Aug 5

Thanks, glad you like it!

1 1

Andrew Lampinen @lampinen.bsky.social · Aug 5

just by dimensionality arguments (input dim 64 << first rep 256) even before training *any* function of the inputs will likely be computable from that rep with a sufficiently complex nonlinear decoder — even features like XOR that the model is *incapable* of computing at the first layer. 2/2

On the Foundations of Shortcut Learning

Andrew Lampinen @lampinen.bsky.social · Aug 5

Good Q: it clearly helps with that concern! But 1) variance biases still affect what nonlinear decoders will learn from finite data (cf. availability effects here arxiv.org/abs/2310.16228). 2) there's also a concern of "overestimating" what is represented. E.g. in our models, 1/2

Deep-learning models can extract a rich assortment of features from data. Which features a model uses depends not only on \emph{predictivity} -- how reliably a feature indicates training-set labels --...