Hafez Ghaemi
@hafezghm.bsky.social
60 followers 44 following 11 posts
Ph.D. Student @mila-quebec.bsky.social and @umontreal.ca, AI Researcher
Posts Media Videos Starter Packs
Reposted by Hafez Ghaemi
shahabbakht.bsky.social
Thrilled to see this work accepted at NeurIPS!

Kudos to @hafezghm.bsky.social for the heroic effort in demonstrating the efficacy of seq-JEPA in representation learning from multiple angles.

#MLSky 🧠🤖
hafezghm.bsky.social
Excited to share that seq-JEPA has been accepted to NeurIPS 2025!
hafezghm.bsky.social
Preprint Alert 🚀

Can we simultaneously learn transformation-invariant and transformation-equivariant representations with self-supervised learning?

TL;DR Yes! This is possible via simple predictive learning & architectural inductive biases – without extra loss terms and predictors!

🧵 (1/10)
hafezghm.bsky.social
Excited to share that seq-JEPA has been accepted to NeurIPS 2025!
hafezghm.bsky.social
Preprint Alert 🚀

Can we simultaneously learn transformation-invariant and transformation-equivariant representations with self-supervised learning?

TL;DR Yes! This is possible via simple predictive learning & architectural inductive biases – without extra loss terms and predictors!

🧵 (1/10)
hafezghm.bsky.social
Interestingly, seq-JEPA shows path integration capabilities – an important research problem in neuroscience. By observing a sequence of views and their corresponding actions, it can integrate the path connecting the initial view to the final view.

(9/10)
hafezghm.bsky.social
Thanks to action conditioning, the visual backbone encodes rotation information which can be decoded from its representations, while the transformer encoder aggregates different rotated views, reduces intra-class variations (caused by rotations), and produces a semantic object representation.

8/10
hafezghm.bsky.social
On 3D Invariant-Equivariant Benchmark (3DIEBench) where each object view has a different rotation, seq-JEPA achieves top performance on both invariance-related object categorization and equivariance-related rotation prediction w/o sacrificing one for the other.

(7/10)
hafezghm.bsky.social
Seq-JEPA learns invariant-equivariant representations for tasks that contain sequential observations and transformations; e.g., it can learn semantic image representations by seeing a sequence of small image patches across simulated eye movements w/o hand-crafted augmentation or masking.

(6/10)
hafezghm.bsky.social
Post-training, the model has learned two segregated representations:

An action-invariant aggregate representation
Action-equivariant individual-view representations

💡No explicit equivariance loss or dual predictor required!

(5/10)
hafezghm.bsky.social
Inspired by this, we designed seq-JEPA which processes sequences of views and their relative transformations (actions).

➡️ A transformer encoder aggregates these action-conditioned view representations to predict a yet unseen view.

(4/10)
hafezghm.bsky.social
🧠 Humans learn to recognize new objects by moving around them, manipulating them, and probing them via eye movements. Different views of a novel object are generated through actions (manipulations & eye movements) that are then integrated to form new concepts in the brain.

(3/10)
hafezghm.bsky.social
Current SSL methods face a trade-off: optimizing for transformation invariance in representational space (useful in high-level classification) often reduces equivariance (needed for tasks related to details like object rotation & movement). Our world model, seq-JEPA, resolves this trade-off.

2/10
hafezghm.bsky.social
Preprint Alert 🚀

Can we simultaneously learn transformation-invariant and transformation-equivariant representations with self-supervised learning?

TL;DR Yes! This is possible via simple predictive learning & architectural inductive biases – without extra loss terms and predictors!

🧵 (1/10)