jpoberhauser.bsky.social
@jpoberhauser.bsky.social
Computer vision | multi modal learning | machine learning
Kalman***
November 29, 2024 at 9:59 PM
TL;DR this approach is explained by “we replace the Kalman filter used in ByteTrack for motion estimation with the learned predicted motion”. Code: github.com/cvlab-epfl/n...
GitHub - cvlab-epfl/noid-nopb
Contribute to cvlab-epfl/noid-nopb development by creating an account on GitHub.
github.com
November 29, 2024 at 9:56 PM
“During training, we provide supervision for the detections and enforce consistency between the detections and displacements, which provides an additional supervisory signal without any additional annotations and increases performance”
November 29, 2024 at 9:56 PM
The paper above explains “In this paper, we propose exploiting motion clues while providing supervision only for the detections, which is much easier to do.” And
November 29, 2024 at 9:56 PM
For example if you want to track animals instead of pedestrians, they might display very different motion patterns. Some KF might be good at state predictions but maybe there is a more accurate approach? Especially if we have data that we can model motion by.
November 29, 2024 at 9:56 PM
Kalimantan filters provide good state estimators that combine measurements and state predictions in a Bayesian update step. But some papers aim at explicitly regression motion or state predictions given a data set yo model motion by…
November 29, 2024 at 9:56 PM
There are ways to specifically model motion given data which could provide a more accurate state representation. In other words, when we receive detections at time t, they are compared to state prediction for time t made at time t-1…
November 29, 2024 at 9:56 PM
“In this paper, we propose exploiting motion clues while providing supervision only for the detections, which is much easier to do.” Why would this be necessary? Well, trackers rely on kalmsn filters to predict the state of an object at the next step. That state prediction is compared to detection..
November 29, 2024 at 9:56 PM
Decreased ***** train times
November 26, 2024 at 4:28 PM
2) high masking ratio is essential. And 3) an asymmetric encoder decoder architecture where the encoder only works on unmasked patches dramatically increased train times!
November 26, 2024 at 4:24 PM
Challenges to make this work in vision were mostly around scalability and in designing the appropriate data reconstruction scheme that avoids models that just learnt easy shortcuts. Some important findings 1) quantized patch prediction works best instead of pure pixel prediction.
November 26, 2024 at 4:24 PM
MAEs do exactly the kind of training that BERT does for language. Remove some data and learn to predict said data.
November 26, 2024 at 4:24 PM
So with type massive success of self supervised pretraining in language, could Masked auto encoders be their vision counter parts?
November 26, 2024 at 4:24 PM
There’s always been unsupervised learning in vision but they mostly relied on pretext tasks like predicting the rotation of an image for example. That showed success, but it was harder to show that those learned representations where in fact useful and generalizable.
November 26, 2024 at 4:24 PM
The main idea borrows from language and it’s that if you train a large model on a reconstruction task you force the model (if done right) to learn useful semantics and representations. In the case of language it’s next token prediction. In the case of vision …… hadn’t been clear yet…
November 26, 2024 at 4:24 PM
Original paper is this: arxiv.org/abs/2111.06377. They propose a couple of intuitive similarities between vision and language and explore what has been done in language with so much success but applied to vision now.
Masked Autoencoders Are Scalable Vision Learners
This paper shows that masked autoencoders (MAE) are scalable self-supervised learners for computer vision. Our MAE approach is simple: we mask random patches of the input image and reconstruct the mis...
arxiv.org
November 26, 2024 at 4:24 PM
- RLT outperforms existing token reduction methods like Token Merging in terms of speed and performance.
November 25, 2024 at 2:08 PM
- Using FSQ-MagViT tokens as reconstruction targets improves performance compared to raw pixels.
- The model achieves impressive performance on long videos, demonstrating its potential for real-world applications
November 25, 2024 at 2:07 PM
could we, in theory, combine both? This integration would require _some_ consideration of how RLT's token removal affects LVMAE's adaptive decoder masking strategy.
November 25, 2024 at 1:57 PM
Run-Length Tokenization (RLT) to increase the speed and efficiency of video transformer training is the identification and removal of redundant visual tokens before they are processed by the model.
November 25, 2024 at 1:57 PM
LVMAE -- LVMAE utilizes to make video transformers faster and more efficient to train is an adaptive decoder masking strategy .
November 25, 2024 at 1:57 PM