Lightnews — Scholar-powered news

Reposted by Hafez Ghaemi

Shahab Bakhtiari @shahabbakht.bsky.social · 19d

Thrilled to see this work accepted at NeurIPS!

Kudos to @hafezghm.bsky.social for the heroic effort in demonstrating the efficacy of seq-JEPA in representation learning from multiple angles.

#MLSky 🧠🤖

Hafez Ghaemi @hafezghm.bsky.social · 19d

Excited to share that seq-JEPA has been accepted to NeurIPS 2025!

Hafez Ghaemi @hafezghm.bsky.social · May 14

Preprint Alert 🚀

Can we simultaneously learn transformation-invariant and transformation-equivariant representations with self-supervised learning?

TL;DR Yes! This is possible via simple predictive learning & architectural inductive biases – without extra loss terms and predictors!

🧵 (1/10)

1 4 17

Hafez Ghaemi @hafezghm.bsky.social · 19d

Excited to share that seq-JEPA has been accepted to NeurIPS 2025!

Hafez Ghaemi @hafezghm.bsky.social · May 14

Preprint Alert 🚀

Can we simultaneously learn transformation-invariant and transformation-equivariant representations with self-supervised learning?

TL;DR Yes! This is possible via simple predictive learning & architectural inductive biases – without extra loss terms and predictors!

🧵 (1/10)

2 2 13

Hafez Ghaemi @hafezghm.bsky.social · May 14

Huge thanks to my supervisors and co-authors @neuralensemble.bsky.social and @shahabbakht.bsky.social !

Check out the full paper here: 📄 arxiv.org/abs/2505.03176

💻 Code coming soon!
📬 DM me if you’d like to chat or discuss the paper!

(10/10)

seq-JEPA: Autoregressive Predictive Learning of Invariant-Equivariant World Models

Current self-supervised algorithms mostly rely on transformations such as data augmentation and masking to learn visual representations. This is achieved by inducing invariance or equivariance with re...

arxiv.org

9

Hafez Ghaemi @hafezghm.bsky.social · May 14

Interestingly, seq-JEPA shows path integration capabilities – an important research problem in neuroscience. By observing a sequence of views and their corresponding actions, it can integrate the path connecting the initial view to the final view.

(9/10)

1 5

Hafez Ghaemi @hafezghm.bsky.social · May 14

Thanks to action conditioning, the visual backbone encodes rotation information which can be decoded from its representations, while the transformer encoder aggregates different rotated views, reduces intra-class variations (caused by rotations), and produces a semantic object representation.

8/10

1 3

Hafez Ghaemi @hafezghm.bsky.social · May 14

On 3D Invariant-Equivariant Benchmark (3DIEBench) where each object view has a different rotation, seq-JEPA achieves top performance on both invariance-related object categorization and equivariance-related rotation prediction w/o sacrificing one for the other.

(7/10)

1 2

Hafez Ghaemi @hafezghm.bsky.social · May 14

Seq-JEPA learns invariant-equivariant representations for tasks that contain sequential observations and transformations; e.g., it can learn semantic image representations by seeing a sequence of small image patches across simulated eye movements w/o hand-crafted augmentation or masking.

(6/10)

1 4

Hafez Ghaemi @hafezghm.bsky.social · May 14

Post-training, the model has learned two segregated representations:

An action-invariant aggregate representation
Action-equivariant individual-view representations

💡No explicit equivariance loss or dual predictor required!

(5/10)

1 5

Hafez Ghaemi @hafezghm.bsky.social · May 14

Inspired by this, we designed seq-JEPA which processes sequences of views and their relative transformations (actions).

➡️ A transformer encoder aggregates these action-conditioned view representations to predict a yet unseen view.

(4/10)

1 1 4

Hafez Ghaemi @hafezghm.bsky.social · May 14

🧠 Humans learn to recognize new objects by moving around them, manipulating them, and probing them via eye movements. Different views of a novel object are generated through actions (manipulations & eye movements) that are then integrated to form new concepts in the brain.

(3/10)

1 3

Hafez Ghaemi @hafezghm.bsky.social · May 14

Current SSL methods face a trade-off: optimizing for transformation invariance in representational space (useful in high-level classification) often reduces equivariance (needed for tasks related to details like object rotation & movement). Our world model, seq-JEPA, resolves this trade-off.

2/10

1 5

Hafez Ghaemi @hafezghm.bsky.social · May 14

Preprint Alert 🚀

Can we simultaneously learn transformation-invariant and transformation-equivariant representations with self-supervised learning?

TL;DR Yes! This is possible via simple predictive learning & architectural inductive biases – without extra loss terms and predictors!

🧵 (1/10)

1 15 51