SajayR
sajayr.bsky.social
SajayR
@sajayr.bsky.social
CV/NLP Undergrad Researcher
I post about cool papers in CV/NLP or about my own fun experiments :D
ツ(psst also I am just a solo undergrad working on this, my fault if I get some stuff wrong)
March 2, 2025 at 12:21 PM
Trained on a subset of AudioSet and CC3M for now, with lots of open questions to explore around architecture design and scaling before the full paper.
More experiments, improvements and 𝚙̶𝚛̶𝚘̶𝚙̶𝚎̶𝚛̶ evals coming soon.
HuggingFace: huggingface.co/SajayR/Triad
Github: github.com/SajayR/TRIAD
SajayR/Triad · Hugging Face
We’re on a journey to advance and democratize artificial intelligence through open source and open science.
huggingface.co
March 2, 2025 at 12:21 PM
Current approaches learn global alignment between modalities (Think ImageBind or CLIP), Triad learns to map precise image regions to corresponding audio segments and text spans. So each patch/segment/span contains semantic concept information that are embedded in the same shared embedding space.
March 2, 2025 at 12:21 PM
Nah, I've been following you for a decent while on there and just thought I'd say hi.
November 25, 2024 at 3:44 PM
heyo :D
November 25, 2024 at 3:39 PM
oh yeah you *totally* follow me on twitter
(but I will be just posting about cool papers I read in CV/NLP soooo :) )
November 24, 2024 at 4:38 PM
Building off of DINO's teacher-student architecture (I'll should probably cover that too), they create multiple views of each frame by masking all but one tracked object. The student then has to match the teacher's predictions while only seeing these single-object views.
November 23, 2024 at 4:07 PM
This tracking builds natural views that actually capture how objects transform in the real world, probably one of the major reasons for the SOTA evals.
November 23, 2024 at 4:07 PM
They use the emergent property of transformer attention heads to focus on different parts of an image(as seen in DINO's paper), clean up these attention maps using Sinkhorn-Knopp to get non-overlapping object regions, and then track these objects temporally with cross-attention between frames.
November 23, 2024 at 4:07 PM
Ah, was hoping for some deeper mech-interp insight but I guess it does make sense for the general user/devs.
November 21, 2024 at 10:16 PM