Lightnews — Scholar-powered news

Gabriele Berton @berton-gabri.bsky.social · Apr 23

different datasets can end up in the same cluster). Intuitively, the first method is cheaper, while the latter more expensive and better performing.

DataDecide: arxiv.org/abs/2504.11393
CLIMB: arxiv.org/abs/2504.13161

DataDecide: How to Predict Best Pretraining Data with Small Experiments

Because large language models are expensive to pretrain on different datasets, using smaller-scale experiments to decide on data is crucial for reducing costs. Which benchmarks and methods of making d...

arxiv.org

Gabriele Berton @berton-gabri.bsky.social · Apr 23

data quality.

The main difference is that DataDecide splits the data according to its data source (usually training datasets are a collection of multiple datasets), while CLIMB creates clusters with each documents embeddings (meaning documents from ...

1

Gabriele Berton @berton-gabri.bsky.social · Apr 23

large LLM on many subsets would be unfeasibly expensive).

Here some similarities and differences between these two papers:

Both papers split the whole available training data into subsets, train a small LLM on the subsets, and see how this performs: its performance is used as a proxy for ...

1

Gabriele Berton @berton-gabri.bsky.social · Apr 23

How to select pre-training data for LLMs?

Two papers came out last week from AllenAI and Nvidia that do it in a similar way, building on the intuition that good data is good regardless the size of the LLM.

This intuition can be used to select good data in a cheap manner (training a ...

1 1

Reposted by Gabriele Berton

Zhenjun Zhao @ericzzj.bsky.social · Apr 9

To Match or Not to Match: Revisiting Image Matching for Reliable Visual Place Recognition

Davide Sferrazza, @berton-gabri.bsky.social, @gabtriv.bsky.social, Carlo Masone

tl;dr:VPR datasets saturate;re-ranking not good;image matching->uncertainty->inlier counts->confidence

arxiv.org/abs/2504.06116

2 5

Gabriele Berton @berton-gabri.bsky.social · Mar 29

When I read a paper, the only way I have to remember something about it six months from now is to use Anki

1

Gabriele Berton @berton-gabri.bsky.social · Mar 26

Probably nobody knows how to pronounce his name and so they avoid talking about him

1

Gabriele Berton @berton-gabri.bsky.social · Mar 26

And it gets better... for MCoT (Multimodal Chain-of-Thought) they should say "in recent weeks" 😂

3

Gabriele Berton @berton-gabri.bsky.social · Mar 26

I find mindblowing that LLM papers should start saying "in recent months" instead of years. OpenAI O1 and DeepSeek R1 are literally a few months old

1 7

Gabriele Berton @berton-gabri.bsky.social · Mar 20

The FastAPLoss gave us worse results than average, but again, it was preliminary results with batch size 32.

The SmoothAP and Recall@k are not in the PML so we didn't even consider them (we had already over 30 losses to try). It might be helpful to add your Recall@k to PML :)

Gabriele Berton @berton-gabri.bsky.social · Mar 20

Cool stuff :)

2 2

Gabriele Berton @berton-gabri.bsky.social · Mar 20

bsky.app/profile/bert...

Gabriele Berton @berton-gabri.bsky.social · Mar 20

Yeah intuitively it makes sense to perturb the student's images, not sure why it doesn't work in the 2021 distillation paper.
Someone should make a benchmark for distillation across tasks...

1

Gabriele Berton @berton-gabri.bsky.social · Mar 20

Yeah intuitively it makes sense to perturb the student's images, not sure why it doesn't work in the 2021 distillation paper.
Someone should make a benchmark for distillation across tasks...

Gabriele Berton @berton-gabri.bsky.social · Mar 20

I believe Beyer et al 2021 distillation paper says the images should be the same for teacher and student

2

Gabriele Berton @berton-gabri.bsky.social · Mar 20

🚀 Big news! Just got my O-1 visa, booked my flight to San Francisco, and I’m really happy to join Amazon in Palo Alto! Ready for this exciting new chapter 🚀

I'll be doing a PostDoc on Vision-Language Models!

15

Gabriele Berton @berton-gabri.bsky.social · Mar 19

The line is so blurry...

Two images of the same car are the same instance? (yes)

If it's the same car but re-painted?

If it's the same car but re-made?

If it's two different cars, same model with same color?

If same model, different color?

Same brand, different model?

Gabriele Berton @berton-gabri.bsky.social · Mar 19

Someone should add the GLDv2 dataset to the PML library datasets.
It should take a couple hours to write the code (maybe 10 minutes with cursor 😂), you'd be a contributor to the most important metric learning library

github.com/cvdfoundatio...

kevinmusgrave.github.io/pytorch-metr...

GitHub - cvdfoundation/google-landmark: Dataset with 5 million images depicting human-made and natural landmarks spanning 200 thousand classes.

Dataset with 5 million images depicting human-made and natural landmarks spanning 200 thousand classes. - cvdfoundation/google-landmark

github.com

1 1 3

Gabriele Berton @berton-gabri.bsky.social · Mar 19

Interesting work, happy to see people working on the field!

Also a bit disappointed not to see them comparing with methods that we found to be SOTA on the task, like RoMa and SIFT+LightGlue

1

Gabriele Berton @berton-gabri.bsky.social · Mar 19

I won't have time to run new experiments (starting a new job on Monday) but if anyone wants to add results with other losses or anything else I'm happy to update the paper :)

1

Gabriele Berton @berton-gabri.bsky.social · Mar 19

Interesting point, are you referring to e.g. the FastAPLoss?

To be fair, our preliminary results, which were used to select the shortlist of 12 losses (out of 34, all those in the pytorch-metric-learning library), were run on a batch size of 32, so there's a chance we missed out on good losses

2 3

Gabriele Berton @berton-gabri.bsky.social · Mar 19

I think I see your point, for you image retrieval is about retrieving an image of exactly the same object (e.g. exactly that one car, not a car of the same model)?

Then isn't that instance retrieval?

But anyway, naming conventions are very blurry in our field

1 1

Gabriele Berton @berton-gabri.bsky.social · Mar 19

Also, the paper is only on arxiv, we have no plans to submit, and the code is super simple

If anyone wants to add results we're pretty flexible with it, and we can add new authors

My main goal is to have a good reference paper for anyone doing retrieval, so I'm happy to update the paper as needed

11

Gabriele Berton @berton-gabri.bsky.social · Mar 19

And I'd call GLD, Oxford, etc "landmark retrieval" 😆
To be fair they're all image retrieval datasets, but GLD-oxford and CUB-Cars are just different subcategories of it

The nice things about the datasets we used is that train-test sets are well defined, whereas e.g. oxford, paris have no train sets

2 2

Gabriele Berton @berton-gabri.bsky.social · Mar 19

I'll have to pay a visit 🪴

Gabriele Berton @berton-gabri.bsky.social · Mar 18

The one and only fern! Where is it?

While writing this I've realized that fern is an anagram of NeRF, definitely not a coincidence

1 2