#CommonCrawl
Common Crawl Criticized for 'Quietly Funneling Paywalled Articles to AI Developers'
www.msn.com/en-us/money/...
November 9, 2025 at 2:40 AM
Common Crawl - Setting the Record Straight: Common Crawl’s Commitment to Transparency, Fair Use, and the Public Good commoncrawl.org/blog/setting-t… #AI #CommonCrawl #data #WebArchiving (wow, that Atlantic piece was bad, needing this rebuttal)
November 7, 2025 at 11:27 PM
This contamination is not intentional: we identified websites that reframed splits of MMLU as user-friendly quizzes
These websites can then be found in CommonCrawl dumps that are generally used for pretraining data curation...
November 7, 2025 at 9:11 PM
This contamination is not intentional: we identified websites that reframed splits of MMLU as user-friendly quizzes
These websites can then be found in CommonCrawl dumps that are generally used for pretraining data curation...
November 7, 2025 at 8:46 PM
The Nonprofit Doing the AI Industry’s Dirty Work -- The web archive Common Crawl has been quietly funneling paywalled articles to AI companies—and lying to publishers about it. #AI #CommonCrawl #TheAtlantic

www.theatlantic.com/technology/2...
November 5, 2025 at 3:22 PM
A non-profit has built a massive #internet database—and served training data to #AI firms despite pleas from publishers to stop, Alex Reisner reports.
Generative AI in its current form would probably not be possible without #CommonCrawl

www.theatlantic.com/technology/2...
The Company Quietly Funneling Paywalled Articles to AI Developers
“You shouldn’t have put your content on the internet if you didn’t want it to be on the internet,” Common Crawl’s executive director says.
www.theatlantic.com
November 5, 2025 at 10:51 AM
Common Crawl supplies paywalled content to AI companies despite publisher objections #AI #Journalism #DataEthics #Paywall #CommonCrawl
Common Crawl supplies paywalled content to AI companies despite publisher objections
Nonprofit organization Common Crawl provides major AI companies access to millions of paywalled news articles while claiming compliance with publisher removal requests, investigation reveals.
ppc.land
November 5, 2025 at 10:50 AM
Common Crawl supplies paywalled content to AI companies despite publisher objections #AI #Journalism #DataEthics #Paywall #CommonCrawl
Common Crawl supplies paywalled content to AI companies despite publisher objections
Nonprofit organization Common Crawl provides major AI companies access to millions of paywalled news articles while claiming compliance with publisher removal requests, investigation reveals.
ppc.land
November 5, 2025 at 10:50 AM
NEW: Common Crawl, the massive archiver of the web, has gotten cozy with AI companies and is providing paywalled articles for training data. They’re also lying to publishers who have asked for material to be removed. “The robots are people too,” CC’s exec director told us when we asked about this.
The Nonprofit Feeding the Entire Internet to AI Companies
Common Crawl claims to provide a public benefit, but it lies to publishers about its activities.
www.theatlantic.com
November 4, 2025 at 1:12 PM
Upcoming Event (October 22nd) Hosted by @stanfordhai.bsky.social: Common Crawl Foundation: Preserving Humanity's Knowledge and Making it Accessible: Addressing Challenges of Public Web Data
hai.stanford.edu/events/commo... @commoncrawl.bsky.social #AI #commoncrawl #datasets #data
Common Crawl Foundation | Preserving Humanity's Knowledge and Making it Accessible: Addressing Challenges of Public Web Data | Stanford HAI
Learn about Common Crawl's insights from a recent data product and informed solutions for the future of public web data.
hai.stanford.edu
October 19, 2025 at 3:24 PM
CommonForms is cool. Guy filtered CommonCrawl down to PDFs and trained a 3-class YOLO model on it. Performs better than Acrobat.

arxiv.org/pdf/2509.16506
arxiv.org
October 11, 2025 at 11:11 PM
2 organismes professionnels représentant 800 titres de la presse française obtiennent le retrait de leurs contenus de presse sur #CommonCrawl qui sert à alimenter notamment les IA. Une information à bien retenir.
www.lalettre.fr/fr/medias_pr...
L'APIG et le SEPM obtiennent le retrait des contenus de presse pillés par l'IA sur Common Crawl - LA LETTRE
La première offensive des éditeurs de quotidiens et de magazines pour se protéger du pillage de leurs contenus par les géants de l'IA s'est avérée payante.
www.lalettre.fr
October 3, 2025 at 11:37 AM
Without being able to verify, it's just going to have to stay in the category of humorous meme lol. I did do a little half assed footwork by checking ye olde waybacke and I queried CommonCrawl but no dice. This one will remain a mystery.
September 22, 2025 at 5:24 PM
Mehrere französische #Medienhäuser protestieren gegen die unautorisierte Nutzung ihrer Inhalte durch #KI-Systeme.

Besonders im Fokus stehen frei zugängliche Datenbanken wie #commoncrawl, deren Inhalte zum Training von #sprachmodellen genutzt werden.

Die #Verlage fordern die Entfernung […]
Original post on mastodon.online
mastodon.online
September 14, 2025 at 8:11 AM
Pourquoi ne suis je pas surpris ?
Parce ce que je savais déjà pour les dialogues des films US.
Et pour CommonCrawl.
Et pour la presse.
Et pour mon blog.
La liste (du pillage) est sans fin.
Et la loi ne couvre que très mal les créateurs :
www.precisement.org/blog/Les-IA-...
September 8, 2025 at 11:56 AM
pretty interesting!

bootstrapping a large *and* quality URL crawl list is hard, and a barrier to entry. you can spider top stuff from, eg, Alexa top million, wikidata, and commoncrawl. but true long tail is important and hard: "not linked-to but good"

sometimes via old tweet links, reddit, etc
September 3, 2025 at 12:29 AM
Pourtant, en ce qui concerne CommonCrawl, il semble qu'il obéisse aux demandes de retrait simples par email (cf email supra). Une preuve indirecte pourrait être la nette baisse de taille de leur dataset depuis déc. 2023 :
en.wikipedia.org/wiki/Common_...
Beaucoup de bruit pour rien ?
5/5
September 2, 2025 at 9:42 AM
Hello presse française, c'est en 2025 qu'on se réveille ?
CommonCrawl, cette "pompeuse", alimentait déjà les moteurs de rech. (GG, BG ...) dans les années 2000.
Simple particulier, je m'en suis fait retirer il y a 2 ans.
2/
September 2, 2025 at 9:25 AM
La presse attaque enfin à la racine les abus d' "emprunt" de propriété intellectuelle par l'IA générative.
Ce faisant, elle attaque une institution : CommonCrawl et ses dérivés existent depuis ... 18 ans !
1/
September 2, 2025 at 9:23 AM
ありがとうございます
しかし、laionがcommonCrawlのデータ(大量のhtmlファイルでしょうか)を分析して、画像のリンクとaltタグの内容を取得し、加えて画像をclipによって分別することで新たにアノテーションを生成したのであれば、件の裁判で裁判所が「クローラは画像ストックサイトの利用規約に自然言語で示された「ボットによるスクレイピング禁止」のような規約を認識できるから、robots.txtでなくても十分オプトアウトの要件を満たす」というようなことを述べているのと辻褄が合っていないような気がします
August 11, 2025 at 2:56 PM
LAION-5Bは、CommonCrawlが収集した画像のURLとALT属性の値のリストを、CLIPモデルを使用してフィルタリングして精度を高めたデータセットだそうですね。
laion.ai/faq/

gigazine.net/news/2022121...

クローラがページ内のHTMLのALT属性の値を抽出できるのなら、他のテキストも取得できるでしょうし、
また、「CLIPモデルを使用して画像とALT属性の値=自然言語の結びつきをフィルタリング」しているのなら、画像内やALT属性の値での自然言語のテキストでのAI学習の拒否(TDMの権利留保)の明示も理解できるだろう、と思われますね。
August 11, 2025 at 2:17 PM
has anyone ever tried to visualize the entirety of commoncrawl somehow
August 9, 2025 at 10:43 AM