Rachel Hong
@rachelhong.bsky.social
8 followers 15 following 15 posts
PhD student at University of Washington machine learning fairness, algorithmic bias, dataset audits, data privacy, tech policy. she/her
Posts Media Videos Starter Packs
Pinned
rachelhong.bsky.social
New paper alert! In a collaboration between computer scientists and legal scholars, we find a significant amount of PII in a common AI training dataset and conduct a legal analysis showing that these issues put web-scale datasets in tension with existing privacy law. [🧵1/N] arxiv.org/abs/2506.17185
A Common Pool of Privacy Problems: Legal and Technical Lessons from a Large-Scale Web-Scraped Machine Learning Dataset
We investigate the contents of web-scraped data for training AI systems, at sizes where human dataset curators and compilers no longer manually annotate every sample. Building off of prior privacy con...
arxiv.org
rachelhong.bsky.social
Super excited and thankful to have Tech Review feature our work!
technologyreview.com
Millions of images of passports, credit cards, birth certificates, and other documents containing personally identifiable information are likely included in one of the biggest open-source AI training sets, new research has found.
A major AI training data set contains millions of examples of personal data
Personally identifiable information has been found in DataComp CommonPool, one of the largest open-source data sets used to train image generation models.
www.technologyreview.com
rachelhong.bsky.social
💡 What can ML researchers do instead? Prior work has explored various alternatives for web datasets, including limiting license terms to prevent commercial use, evaluating automated sanitization tools, attributing training data, and creating explicit consent mechanisms [13/N]
rachelhong.bsky.social
While privacy laws carve out publicly available data, being web-accessible isn’t the same as being legally “public.” We call for enforcing reasonable basis standards for web-scraped data and modernizing the “publicly available” exception in consumer privacy and data protection laws [12/N]
rachelhong.bsky.social
Legal findings: Web-scraping doesn’t look at context or intent behind personal information and instead vacuums all of the web. We show that certain parts of data protection laws are not met: there’s a lack of reasonable basis, purpose specification, and data minimization [11/N]
rachelhong.bsky.social
With web-scale, it’s hard for people to be aware, find, and take down their images, as data replicates across sites even if the original is taken down. Opt-out doesn’t address dataset monoculture, as many models may have already trained on central datasets like CommonPool [10/N]
rachelhong.bsky.social
The Wayback Machine tracks the earliest recorded timestamp of a subset of images with non-blurred faces. We find a significant portion existed before 2020, raising questions of how anyone can consent to the use of their personal data before the rise of large AI systems [9/N]
rachelhong.bsky.social
DataComp (like other datasets) optionally includes automatic face blurring as a way to preserve privacy. However, the face blurring algorithm fails to catch an estimated 102 million samples of real human faces, of which some samples that reveal children or people’s names [8/N]
rachelhong.bsky.social
We link resumes to online profiles (like LinkedIn) and estimate at least 142K samples (out of 12.8B) depict resumes of individuals with public online presence. We annotate the presence of personal data of resumes (with online profiles), split by geographic region below [7/N]
rachelhong.bsky.social
Several common websites in DataComp no longer have images available to download, but at the time of curation did exist. Upon inspection of download errors, we find that some errors are “Forbidden” errors due to a lack of permissions to access the image [6/N]
rachelhong.bsky.social
Some samples reveal names and faces linked to demographic and children’s information (see paraphrased examples below). Many come from news sites, where someone may have disclosed the information for an article, rather than consenting their data to be used to train a model [5/N]
rachelhong.bsky.social
🌳 DataComp CommonPool is an image dataset crawled from the web, following LAION-5B (taken down in Dec 2023 for illegal material). DataComp has been downloaded ≥2M times (!) a huge amount of downstream dataset users and model users (i.e. the leaves) relying on one source [4/N]
rachelhong.bsky.social
🚀 Empirically, we find:
1. Examples of credit card numbers, passport/ID numbers, resumes, faces, and children’s data
2. Attempts at data sanitization (such as face blurring) aren’t perfect
3. Data on the web isn’t always “publicly available” according to legal frameworks
[3/N]
rachelhong.bsky.social
🔒 Our dataset audit findings inform our legal analysis with regards to existing consumer privacy and data protection laws, like the CCPA and GDPR. We surface various privacy risks of current data curation practices built upon the indiscriminate scraping of the web. [2/N]
rachelhong.bsky.social
New paper alert! In a collaboration between computer scientists and legal scholars, we find a significant amount of PII in a common AI training dataset and conduct a legal analysis showing that these issues put web-scale datasets in tension with existing privacy law. [🧵1/N] arxiv.org/abs/2506.17185
A Common Pool of Privacy Problems: Legal and Technical Lessons from a Large-Scale Web-Scraped Machine Learning Dataset
We investigate the contents of web-scraped data for training AI systems, at sizes where human dataset curators and compilers no longer manually annotate every sample. Building off of prior privacy con...
arxiv.org