Lightnews — Scholar-powered news

Haoli Yin

@haoliyin.bsky.social

300 followers 250 following 28 posts

multimodal data curation @datologyai.com. https://haoliyin.me

Posts Replies Media Videos

Pinned

Haoli Yin @haoliyin.bsky.social · Nov 14

Web-Scale Data Curation is a frontier challenge - I'm excited to show the progress we've made in just 6 months @datologyai

tl;dr: we've pretrained the most data-efficient and best-in-class CLIP models!

Read on to see how our product powers multimodal data curation
1/n 🧵

Matthew Leavitt @leavittron.bsky.social · Nov 14

🧵We’ve spent the last few months at @datologyai.bsky.social
building a state-of-the-art data curation pipeline and I’m SO excited to share our first results: we curated image-text pretraining data and massively improved CLIP model quality, training speed, and inference efficiency 🔥🔥🔥

Haoli Yin

@haoliyin.bsky.social

For more details about where I'll be, come visit me at
@datologyai.com's booth 303

Times I’ll be there (in local time):
- Tuesday Dec 10th, 12pm-4pm
- Wednesday Dec 11th, 1pm-5pm
- Thursday Dec 12th, 9am-12:30pm

#neurips

December 8, 2024 at 3:54 AM

Haoli Yin

@haoliyin.bsky.social

I'll be at NeurIPS next week starting Tuesday! Please reach out if you want to talk anything multimodal, data curation, synthetic data, and inference optimizations.

I'd love to learn more about your research area as well :))

December 4, 2024 at 6:27 AM

Reposted by Haoli Yin

Daniel van Strien

@danielvanstrien.bsky.social

I'm re-sharing some recent blog posts on using VLMs for synthetic data generation since there are no link penalties here!

How to generate a dataset of queries for training and fine-tuning domain-specific ColPali models using a VLM.

🔗 danielvanstrien.xyz/posts/post-w...

Generating a dataset of queries for training and fine-tuning ColPali models on a UFO dataset – Daniel van Strien

Learn how to generate custom ColPali dataset using an open VLM for multimodal retrieval model training and fine-tuning.

danielvanstrien.xyz

November 25, 2024 at 12:31 PM

Haoli Yin

@haoliyin.bsky.social

Working on making data curation dirt cheap btw

If you're a cracked engineer we'd love to have you :))
DM me if you have any questions!

jobs.ashbyhq.com/DatologyAI

(also looking for enthusiastic research interns)

DatologyAI Jobs

jobs.ashbyhq.com

November 25, 2024 at 8:37 PM

Haoli Yin

@haoliyin.bsky.social

The text team cooked so much 🧑‍🍳 it might be better than your Thanksgiving meal

Check out this super thorough thread on what and how we achieved the best curated text dataset using public data

Matthew Leavitt @leavittron.bsky.social · Nov 25

Tired: Bringing up politics at Thanksgiving

Wired: Bringing up @datologyai.com’s new text curation results at Thanksgiving

That’s right, we applied our data curation pipeline to text pretraining data and the results are hot enough to roast a 🦃
🧵

November 25, 2024 at 8:29 PM

Haoli Yin

@haoliyin.bsky.social

Was working on some model inference optimization research (speculative decoding) but in the multimodal setting with vision-language models (i.e. conditioned on images)

blue = draft model tokens
red = target model tokens
yellow = bonus target model tokens

#dataviz am I doing this right?

November 24, 2024 at 8:31 AM

Haoli Yin

@haoliyin.bsky.social

now using uv for any new project and trying to migrate existing projects to uv

Starting a new project:
uv init
uv venv --python 3.xx,
source .venv/bin/activate
uv add (dependencies) or uv pip install -r requirements.txt

❤️
- installing torch in like 10 seconds
- uv sync for fast startup

Charlie Marsh @crmarsh.com · Nov 20

Looks like uv is the #1 trending Rust repo over the last month 🚀🚀🚀

November 21, 2024 at 8:22 PM

Reposted by Haoli Yin

Ethan

@ethanrosenthal.com

Massive, impressive post on data curation strategies for producing better models with less data and compute. The best part of data curation is that it's a (relatively small) one time cost that gets amortized over all future models.

Link to the technical write-up: www.datologyai.com/post/product...

November 14, 2024 at 7:16 PM

Haoli Yin

@haoliyin.bsky.social

Matthew Leavitt @leavittron.bsky.social · Nov 14

November 14, 2024 at 5:30 PM

Haoli Yin

@haoliyin.bsky.social

Pouring one out for my bluesky x data gang 🍻 Get ready to see the culmination of the web-scale multimodal data curation work we've been cooking up at DatologyAI!

I'm beyond excited to share soon :))

if there's enough interest, might drop something earlier 👀

November 11, 2024 at 5:44 AM

Haoli Yin

@haoliyin.bsky.social

hello world! I was convinced by @codestar.bsky.social so lets see how much signal is here

October 29, 2024 at 5:41 AM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news