Lightnews — Scholar-powered news

SajayR

@sajayr.bsky.social

42 followers 510 following 12 posts

CV/NLP Undergrad Researcher
I post about cool papers in CV/NLP or about my own fun experiments :D

Posts Replies Media Videos

SajayR

@sajayr.bsky.social

ツ(psst also I am just a solo undergrad working on this, my fault if I get some stuff wrong)

March 2, 2025 at 12:21 PM

SajayR

@sajayr.bsky.social

Trained on a subset of AudioSet and CC3M for now, with lots of open questions to explore around architecture design and scaling before the full paper.
More experiments, improvements and 𝚙̶𝚛̶𝚘̶𝚙̶𝚎̶𝚛̶ evals coming soon.
HuggingFace: huggingface.co/SajayR/Triad
Github: github.com/SajayR/TRIAD

SajayR/Triad · Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

March 2, 2025 at 12:21 PM

SajayR

@sajayr.bsky.social

Current approaches learn global alignment between modalities (Think ImageBind or CLIP), Triad learns to map precise image regions to corresponding audio segments and text spans. So each patch/segment/span contains semantic concept information that are embedded in the same shared embedding space.

March 2, 2025 at 12:21 PM

SajayR

@sajayr.bsky.social

Nah, I've been following you for a decent while on there and just thought I'd say hi.

November 25, 2024 at 3:44 PM

SajayR

@sajayr.bsky.social

heyo :D

November 25, 2024 at 3:39 PM

SajayR

@sajayr.bsky.social

oh yeah you *totally* follow me on twitter
(but I will be just posting about cool papers I read in CV/NLP soooo :) )

November 24, 2024 at 4:38 PM

SajayR

@sajayr.bsky.social

Building off of DINO's teacher-student architecture (I'll should probably cover that too), they create multiple views of each frame by masking all but one tracked object. The student then has to match the teacher's predictions while only seeing these single-object views.

November 23, 2024 at 4:07 PM

SajayR

@sajayr.bsky.social

This tracking builds natural views that actually capture how objects transform in the real world, probably one of the major reasons for the SOTA evals.

November 23, 2024 at 4:07 PM

SajayR

@sajayr.bsky.social

They use the emergent property of transformer attention heads to focus on different parts of an image(as seen in DINO's paper), clean up these attention maps using Sinkhorn-Knopp to get non-overlapping object regions, and then track these objects temporally with cross-attention between frames.

November 23, 2024 at 4:07 PM

SajayR

@sajayr.bsky.social

Ah, was hoping for some deeper mech-interp insight but I guess it does make sense for the general user/devs.

November 21, 2024 at 10:16 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news