Pratyush Maini
@pratyushmaini.bsky.social
290 followers 210 following 24 posts
Data Quality x Privacy PhD student @ CMU with Zico Kolter and Zack Lipton | Founding Member @datologyai.com | Prev. Comp Sc @iitdelhi http://pratyushmaini.github.io/
Posts Media Videos Starter Packs
pratyushmaini.bsky.social
I have been thinking about data privacy, data curation for video models, finetuning v/s pretraining, how alignment data interacts with LLM safety, and its relation to unlearning. Also, very curious to hear what are some of the most exciting problems folks in India are working on!
pratyushmaini.bsky.social
I’ll also be spending time at the @datologyai.com booth to talk about how we curated our way to the best LLM training dataset! Please DM if you would like to chat. The best part about being a researcher is to share the excitement of what we have been working on with each other.
pratyushmaini.bsky.social
Came to #NeurIPS2024 for the research news, but staying for these incredible views. I am presenting some recent works that (I think) significantly advance the discourse on LLM memorization, training data detection; & a study on hallucinations x model collapse in diffusion models.
Reposted by Pratyush Maini
vaishnavh.bsky.social
if you're a PhD student at CMU doing AI/ML, lmk if you want to be added to this starter pack.

(I don't belong in this list, but I don't know how to remove myself from this pack 😂)

go.bsky.app/9APVxQQ
Reposted by Pratyush Maini
vishaalurao.bsky.social
🚀New Paper: Active Data Curation Effectively Distills Multimodal Models
arxiv.org/abs/2411.18674

Smol models are all the rage these days & knowledge distillation (KD) is key for model compression!

We show how data curation can effectively distill to yield SoTA FLOP-efficient {C/Sig}LIPs!!
🧵👇
Reposted by Pratyush Maini
jbhuang0604.bsky.social
How to drive your research forward?

“I tested the idea we discussed last time. Here are some results. It does not work. (… awkward silence)”

Such conversations happen so many times when meetings with students. How do we move forward?

You need …
pratyushmaini.bsky.social
4/We ended up simulating the bias as a company that "acts in good faith", & found that even in such a case, merely sharing an annotator pool (b/w curators and evaluators) can give the company's customers a 44-point ELO boost.... massive bragging rights in today's LLM landscape.
pratyushmaini.bsky.social
3/(Risk 2): The mere commonality in infra b/w data curators & evaluators can cause significant eval bias, even when they do not have ill-founded financial motives.

"common infra" includes question templates, topics, styles, annotators, etc.
> common annotators being the least privileged access.
pratyushmaini.bsky.social
2/Taking a closer look at SEAL: ScaleAI specializes in data curation for LLM trainers and has now begun establishing its own private evaluations. Two major concerns:

(Risk 1): There is a massive financial incentive for such companies to design evals that even marginally favor their own customers.
pratyushmaini.bsky.social
1/Open LLM evals often face data contamination concerns. Private curators (like ScaleAI) have addressed this with private + expert evaluations.

We argue that this shift poses new risks including financial incentives & eval bias.
w/ @hbxnov.bsky.social

📝: pratyushmaini.github.io/blog/2024/ri... 🧵
Reposted by Pratyush Maini
zacharylipton.bsky.social
Medically adapted foundation models (think Med-*) turn out to be more hot air than hot stuff. Correcting for fatal flaws in evaluation, the current crop are no better on balance than generic foundation models, even on the very tasks for which benefits are claimed.
arxiv.org/abs/2411.04118
Medical Adaptation of Large Language and Vision-Language Models: Are We Making Progress?
Several recent works seek to develop foundation models specifically for medical applications, adapting general-purpose large language models (LLMs) and vision-language models (VLMs) via continued pret...
arxiv.org
Reposted by Pratyush Maini
iamgroot42.bsky.social
Temporally shifted data splits in membership inference can be misleading ⚠️ Be cautious when interpreting these benchmarks!
pratyushmaini.bsky.social
1/6 A lot of us are grappling with peer review these days, but its worst manifestation is when prestigious conference awards overlook critical flaws.

Case in point: #EMNLP2024 ’s Best Paper Award.

I & @iamgroot42.bsky.social wrote a blog on what went wrong: www.anshumansuri.com/blog/2024/ca... 🧵
pratyushmaini.bsky.social
5/6 This isn’t just a one-off issue with awards in ML. We are repeatedly seeing this concerning trend. It misguides researchers, misrepresents progress & harms trust in our field. Remember the ICML awards fiasco from a few years ago? www.reddit.com/r/MachineLea...
From the MachineLearning community on Reddit: [D] ICML 2022 Outstanding Paper Awards 🔥
Explore this post and more from the MachineLearning community
www.reddit.com
pratyushmaini.bsky.social
4/6 We re-implemented the method; tested on corrected setups & found results suggestive of a temporal shift, both via false-positives & false-negatives

Even more unfortunate, this paper cites Duan et. al. (they are aware of the flaws in the setup), yet creates a new temporally shifted MIA benchmark
pratyushmaini.bsky.social
2/6 One of the Best Paper Awards at EMNLP went to a paper claiming successful MIAs for LLMs.

Unfortunately, the benchmarks studied are all "temporally shifted". At this point, we know very well that these benchmarks give a false sense of membership success by detecting distributional differences.
pratyushmaini.bsky.social
1/6 A lot of us are grappling with peer review these days, but its worst manifestation is when prestigious conference awards overlook critical flaws.

Case in point: #EMNLP2024 ’s Best Paper Award.

I & @iamgroot42.bsky.social wrote a blog on what went wrong: www.anshumansuri.com/blog/2024/ca... 🧵
pratyushmaini.bsky.social
5/5 Check out @leavittron.bsky.social's detailed bsky thread below:
bsky.app/profile/leav...

And join us (@arimorcos.bsky.social
@agcrnz.bsky.social @alvin-d.bsky.social and many more who shaped this work)!

We are only getting started: jobs.ashbyhq.com/DatologyAI
leavittron.bsky.social
Tired: Bringing up politics at Thanksgiving

Wired: Bringing up @datologyai.com’s new text curation results at Thanksgiving

That’s right, we applied our data curation pipeline to text pretraining data and the results are hot enough to roast a 🦃
🧵
pratyushmaini.bsky.social
4/5 This was no small feat.
A small team, punching far above its weight, took on giants in an extremely competitive space and delivered kick-ass results. Huge shoutout to my amazing teammates, especially Jack Urbanek & @leavittron.bsky.social —absolute legends. 🙌
Let’s keep pushing 👊
pratyushmaini.bsky.social
3/5 How did we do it?
🎯 Carefully designed quality filters.
🔍 Deep understanding of synthetic data.
📐 Analyzing geometric properties of unsupervised data.
👀 Constantly looking at data!
It’s all in our deep dive: tinyurl.com/best-llm-data
Technical Deep-Dive: Curating Our Way to a State-of-the-Art Text Dataset
Our data curation pipeline to obtain substantial improvements in LLM quality, training speed, and inference efficiency.
datologyai.com
pratyushmaini.bsky.social
2/5 🥁Results🥁 We smashed past results, beating both DCLM and FW-Edu by significant margins. 🚀
Our models trained on curated data saw:
• 4.4% better than DCLM.
• 2x faster training than FW-edu
• Our 1.3B model outperforms 2.7B models trained on DCLM & FW-edu
pratyushmaini.bsky.social
1/5 Earlier this year, I joined @datologyai.com to give wings to the data research I had been doing in academia. Today, I am absolutely thrilled to share what we’ve been working on!

Techvember Ep 2: How we made the #1 LLM Pre-training Data Recipe.

Blog: 👉 tinyurl.com/best-llm-data 🧵