Lightnews — Scholar-powered news

Rachel Hong

@rachelhong.bsky.social

PhD student at University of Washington
machine learning fairness, algorithmic bias, dataset audits, data privacy, tech policy. she/her

Posts Replies Media Videos

Pinned

Rachel Hong @rachelhong.bsky.social · Jun 30

New paper alert! In a collaboration between computer scientists and legal scholars, we find a significant amount of PII in a common AI training dataset and conduct a legal analysis showing that these issues put web-scale datasets in tension with existing privacy law. [🧵1/N] arxiv.org/abs/2506.17185

A Common Pool of Privacy Problems: Legal and Technical Lessons from a Large-Scale Web-Scraped Machine Learning Dataset

We investigate the contents of web-scraped data for training AI systems, at sizes where human dataset curators and compilers no longer manually annotate every sample. Building off of prior privacy con...

arxiv.org

Rachel Hong

@rachelhong.bsky.social

Super excited and thankful to have Tech Review feature our work!

MIT Technology Review @technologyreview.com · Jul 18

Millions of images of passports, credit cards, birth certificates, and other documents containing personally identifiable information are likely included in one of the biggest open-source AI training sets, new research has found.

A major AI training data set contains millions of examples of personal data

Personally identifiable information has been found in DataComp CommonPool, one of the largest open-source data sets used to train image generation models.

www.technologyreview.com

July 18, 2025 at 3:51 PM

Rachel Hong

@rachelhong.bsky.social

A Common Pool of Privacy Problems: Legal and Technical Lessons from a Large-Scale Web-Scraped Machine Learning Dataset

arxiv.org

June 30, 2025 at 9:15 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news