Haoli Yin
@haoliyin.bsky.social
300 followers 250 following 28 posts
multimodal data curation @datologyai.com. https://haoliyin.me
Posts Media Videos Starter Packs
Pinned
haoliyin.bsky.social
Web-Scale Data Curation is a frontier challenge - I'm excited to show the progress we've made in just 6 months @datologyai

tl;dr: we've pretrained the most data-efficient and best-in-class CLIP models!

Read on to see how our product powers multimodal data curation
1/n 🧵
leavittron.bsky.social
🧵We’ve spent the last few months at @datologyai.bsky.social
building a state-of-the-art data curation pipeline and I’m SO excited to share our first results: we curated image-text pretraining data and massively improved CLIP model quality, training speed, and inference efficiency 🔥🔥🔥
haoliyin.bsky.social
For more details about where I'll be, come visit me at
@datologyai.com's booth 303

Times I’ll be there (in local time):
- Tuesday Dec 10th, 12pm-4pm
- Wednesday Dec 11th, 1pm-5pm
- Thursday Dec 12th, 9am-12:30pm

#neurips
haoliyin.bsky.social
ensembling logits (e.g. avg modality embeddings) from contrastively trained models actually can achieve this in certain settings. Previous work on a specific downstream task: arxiv.org/abs/2310.18812

To truly do this in early fusion models you'd have to capture synergy (arxiv.org/abs/2306.04539)
‪UniCat: Crafting a Stronger Fusion Baseline for Multimodal Re-Identification‬
‪J Crawford, H Yin, L McDermott, D Cummings‬, ‪NeurIPS 2023 UniReps Workshop, 2023‬ - ‪Cited by 3‬
scholar.google.com
haoliyin.bsky.social
looks like my reach on Twitter is low 😅
haoliyin.bsky.social
I'll be at NeurIPS next week starting Tuesday! Please reach out if you want to talk anything multimodal, data curation, synthetic data, and inference optimizations.

I'd love to learn more about your research area as well :))
Reposted by Haoli Yin
danielvanstrien.bsky.social
I'm re-sharing some recent blog posts on using VLMs for synthetic data generation since there are no link penalties here!

How to generate a dataset of queries for training and fine-tuning domain-specific ColPali models using a VLM.

🔗 danielvanstrien.xyz/posts/post-w...
Generating a dataset of queries for training and fine-tuning ColPali models on a UFO dataset – Daniel van Strien
Learn how to generate custom ColPali dataset using an open VLM for multimodal retrieval model training and fine-tuning.
danielvanstrien.xyz
haoliyin.bsky.social
Working on making data curation dirt cheap btw

If you're a cracked engineer we'd love to have you :))
DM me if you have any questions!

jobs.ashbyhq.com/DatologyAI

(also looking for enthusiastic research interns)
DatologyAI Jobs
DatologyAI Jobs
jobs.ashbyhq.com
haoliyin.bsky.social
The text team cooked so much 🧑‍🍳 it might be better than your Thanksgiving meal

Check out this super thorough thread on what and how we achieved the best curated text dataset using public data
leavittron.bsky.social
Tired: Bringing up politics at Thanksgiving

Wired: Bringing up @datologyai.com’s new text curation results at Thanksgiving

That’s right, we applied our data curation pipeline to text pretraining data and the results are hot enough to roast a 🦃
🧵
haoliyin.bsky.social
Ah so some details I left out:

- I set first n tokens to be generated by target model where here n=3.
- I'm using Qwen2-VL family here
- Prompt is "Describe this image" so the first three tokens are always the same

This was just to baseline and now to experiment with various tasks and configs
haoliyin.bsky.social
hopefully this side project will get to a point where there's something novel to write up 😅
haoliyin.bsky.social
Was working on some model inference optimization research (speculative decoding) but in the multimodal setting with vision-language models (i.e. conditioned on images)

blue = draft model tokens
red = target model tokens
yellow = bonus target model tokens

#dataviz am I doing this right?
haoliyin.bsky.social
🙋🏻‍♂️
haoliyin.bsky.social
now using uv for any new project and trying to migrate existing projects to uv

Starting a new project:
uv init
uv venv --python 3.xx,
source .venv/bin/activate
uv add (dependencies) or uv pip install -r requirements.txt

❤️
- installing torch in like 10 seconds
- uv sync for fast startup
crmarsh.com
Looks like uv is the #1 trending Rust repo over the last month 🚀🚀🚀
Reposted by Haoli Yin
ethanrosenthal.com
Massive, impressive post on data curation strategies for producing better models with less data and compute. The best part of data curation is that it's a (relatively small) one time cost that gets amortized over all future models.

Link to the technical write-up: www.datologyai.com/post/product...
haoliyin.bsky.social
If you've made it this far, you clearly recognize the immense potential of data curation and our team.

For researchers/engineers/anons: Excited about multimodal data? Have innovative ideas? Join us!

(also recruiting cracked interns)
jobs.ashbyhq.com/DatologyAI
14/n
DatologyAI Jobs
DatologyAI Jobs
jobs.ashbyhq.com
haoliyin.bsky.social
Final Note: this is the worst we’ll ever be.

And it’s also not the only thing we’ve been working on. The rest of the team has been cooking on text curation since the beginning, so stay tuned for our text curation results coming soon for LLM pretraining!

13/n
haoliyin.bsky.social
(Bonus!) Pretrain best-in-class models

While not the target, working on data curation resulted in competitive/superior CLIP models. We’ve extensively benchmarked our models against external models, with >10x data efficiency and better performance. See blog post for more!

12/n
haoliyin.bsky.social
We train ViT-S/32 (63M param) on curated data and compare against baselines trained with ViT-B/32 (151M param)

Even with ~2.4x FLOPs reduction, we attain absolute 13% improvement for retrieval and 9.2% for classification

11/n
haoliyin.bsky.social
What if you can have a smaller, domain-specific model in the first place?

Specialized pretraining is the future, powered by curation at scale
Cost of training the smaller model also quickly amortizes over time (think millions of API calls/day)

10/n
haoliyin.bsky.social
Claim #3: train models **smaller**

Productionizing models requires inference optimizations, trading off speed for lower quality and doesn’t work for overtrained generalist models.

9/n
haoliyin.bsky.social
Claim #2: train models **better**

Improve model quality by up to ~13% absolute (22% relative) for the same train cost
Curation means training with in-domain data on end tasks you care about!

8/n
haoliyin.bsky.social
Claim #1: train models **faster**

Retrieval: 28.8x-43x training speedup vs baseline
Classification: 2.88x-13x vs baseline

We filter out redundant & harmful data to achieve equal performance much faster

7/n
haoliyin.bsky.social
What we did:

Data: image-text data from DataComp’s CommonPool up to 1B samples

Curate: separate strategies for retrieval and classification tasks

Train: CLIP ViT-B/16 & 32, 8k batch

Evals: DataComp+SugarCrepe

6/n
haoliyin.bsky.social
What’s the product?

It’s a data pipeline that composes of cutting-edge research put in production and battle tested at scale. Shown below are a few themes that individual algorithms fall under and more details can be found in the blog post!

5/n
haoliyin.bsky.social
Solution: @datologyai.bsky.social

It’s a pivotal time in AI to unlock tremendous societal value & train domain-specific foundation models outside of large labs

With scalable data curation, we can:
1) train models **faster**
2) train models **better**
3) train models **smaller**
4/n