Lightnews — Scholar-powered news

Alexi Gladstone

@alexiglad.bsky.social

PhD @ UIUC advised by Heng Ji. RS Intern @ Meta, previously @ Palantir, UVA. Working on SSL, world models, multimodal learning.

https://alexiglad.github.io/

Posts Replies Media Videos

Alexi Gladstone

@alexiglad.bsky.social

[10/N] We also compare EBTs to diffusion models on relatively toy image denoising tasks, where we observe that EBTs outperform diffusion models while using 99% less forward passes.

EBTs also learn better representations of images than diffusion models, achieving a ~10x higher ImageNet accuracy.

July 7, 2025 at 8:33 PM

Alexi Gladstone

@alexiglad.bsky.social

[9/N] EBTs outscaling the Transformer++ also holds across modalities! We test this on video.📹

We think this performance improvement occurs because verification is often easier than generation and because EBTs can learn to express uncertainty in continuous spaces.

July 7, 2025 at 8:33 PM

Alexi Gladstone

@alexiglad.bsky.social

[7/N] 🧠We can also investigate the thinking capabilities of EBTs compared to the Transformer++ by increasing the amount of compute at inference time.

We find that EBTs can out-generalize the Transformer++ on Out-of-Distribution data by thinking longer and that Thinking also improves with scale.

July 7, 2025 at 8:33 PM

Alexi Gladstone

@alexiglad.bsky.social

[6/N] Of particular note is the data scaling, where we consistently observe EBTs being more data-efficient than the Transformer++ by > 30%. This is especially important because frontier labs are saying we are now data-constrained and that more data-efficient algorithms are the bottleneck.

July 7, 2025 at 8:33 PM

Alexi Gladstone

@alexiglad.bsky.social

[5/N] We compared autoregressive EBTs against the SOTA recipe (Transformer++) in language modeling. We observe that EBTs consistently scale at a higher rate than the Transformer++ with respect to data, batch size, depth, FLOPs, and parameters.📈

July 7, 2025 at 8:33 PM

Alexi Gladstone

@alexiglad.bsky.social

[4/N] So if EBMs are so promising, why are they uncommon, and why haven’t they been used at scale?

EBMs have struggled to scale due to issues with stability and parallelization. Therefore, we create Transformers specifically for solving these issues, which we call Energy-Based Transformers (EBTs).

July 7, 2025 at 8:33 PM

Alexi Gladstone

@alexiglad.bsky.social

[3/N] So what are EBMs?💭

EBMs learn to assign a scalar energy value denoting the compatibility of inputs.

Then, EBMs learn to optimize predictions to minimize this energy.

This allows EBMs to know when a problem is difficult (high energy), and adjust resources until a good solution is found.

July 7, 2025 at 8:33 PM

Alexi Gladstone

@alexiglad.bsky.social

[2/N] 🤔So how can models learn to think from unsupervised learning?

It turns out that there’s an elegant solution:💡
Learn to verify predictions
Optimization predictions with respect to this verifier

This is exactly what Energy-Based Models (EBM) are! EBMs enable thinking longer and self-verifying.

July 7, 2025 at 8:33 PM

Alexi Gladstone

@alexiglad.bsky.social

How can we unlock generalized reasoning?

⚡️Introducing Energy-Based Transformers (EBTs), an approach that out-scales (feed-forward) transformers and unlocks generalized reasoning/thinking on any modality/problem without rewards.

🧵Thread:

July 7, 2025 at 8:33 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news