Lightnews — Scholar-powered news

mattotcha.bsky.social

@mattotcha.bsky.social

Common Crawl Criticized for 'Quietly Funneling Paywalled Articles to AI Developers'
www.msn.com/en-us/money/...

CommonCrawl BlackHole The Company Quietly Funneling Paywalled Articles to AI Developers
© Illustration by Matteo Giuseppe Pani / The Atlantic

November 9, 2025 at 2:40 AM

Nicole Hennig

@nic221.bsky.social

Common Crawl - Setting the Record Straight: Common Crawl’s Commitment to Transparency, Fair Use, and the Public Good commoncrawl.org/blog/setting-t… #AI #CommonCrawl #data #WebArchiving (wow, that Atlantic piece was bad, needing this rebuttal)

Text Shot: A recent article in The Atlantic (“The Nonprofit Doing the AI Industry’s Dirty Work,” November 4, 2025) makes several false and misleading claims about the Common Crawl Foundation, including the accusation that our organization has “lied to publishers” about our activities.

This allegation is untrue. It misrepresents both how Common Crawl operates and the values that guide our work.

November 7, 2025 at 11:27 PM

Nathan Godey

@nthngdy.bsky.social

This contamination is not intentional: we identified websites that reframed splits of MMLU as user-friendly quizzes
These websites can then be found in CommonCrawl dumps that are generally used for pretraining data curation...

November 7, 2025 at 9:11 PM

Nathan Godey

@nthngdy.bsky.social

This contamination is not intentional: we identified websites that reframed splits of MMLU as user-friendly quizzes
These websites can then be found in CommonCrawl dumps that are generally used for pretraining data curation...

November 7, 2025 at 8:46 PM

Marketing News

@marketingnews.bsky.social

Common Crawl defends archive practices amid deletion claims #CommonCrawl #DataArchive #Nonprofit #DigitalPreservation #DataCollection

Common Crawl defends archive practices amid deletion claims

Nonprofit Common Crawl issued November 4 statement defending data collection methods, citing technical constraints preventing content deletion.

ppc.land

November 6, 2025 at 1:23 PM

PPC Land

@ppc.land

Common Crawl defends archive practices amid deletion claims #CommonCrawl #DataArchive #Nonprofit #DigitalPreservation #DataCollection

Common Crawl defends archive practices amid deletion claims

Nonprofit Common Crawl issued November 4 statement defending data collection methods, citing technical constraints preventing content deletion.

ppc.land

November 6, 2025 at 1:23 PM

Human Creator - Human.global

@humancreator.bsky.social

The Nonprofit Doing the AI Industry’s Dirty Work -- The web archive Common Crawl has been quietly funneling paywalled articles to AI companies—and lying to publishers about it. #AI #CommonCrawl #TheAtlantic

www.theatlantic.com/technology/2...

November 5, 2025 at 3:22 PM

Davide Galati

@davidegalati.bsky.social

A non-profit has built a massive #internet database—and served training data to #AI firms despite pleas from publishers to stop, Alex Reisner reports.
Generative AI in its current form would probably not be possible without #CommonCrawl

www.theatlantic.com/technology/2...

The Company Quietly Funneling Paywalled Articles to AI Developers

“You shouldn’t have put your content on the internet if you didn’t want it to be on the internet,” Common Crawl’s executive director says.

www.theatlantic.com

November 5, 2025 at 10:51 AM

PPC Land

@ppc.land

Common Crawl supplies paywalled content to AI companies despite publisher objections #AI #Journalism #DataEthics #Paywall #CommonCrawl

Common Crawl supplies paywalled content to AI companies despite publisher objections

Nonprofit organization Common Crawl provides major AI companies access to millions of paywalled news articles while claiming compliance with publisher removal requests, investigation reveals.

ppc.land

November 5, 2025 at 10:50 AM

Marketing News

@marketingnews.bsky.social

Common Crawl supplies paywalled content to AI companies despite publisher objections #AI #Journalism #DataEthics #Paywall #CommonCrawl

Common Crawl supplies paywalled content to AI companies despite publisher objections

Nonprofit organization Common Crawl provides major AI companies access to millions of paywalled news articles while claiming compliance with publisher removal requests, investigation reveals.

ppc.land

November 5, 2025 at 10:50 AM

killerrabbit90.bsky.social

@killerrabbit90.bsky.social

The Nonprofit Doing the AI Industry’s Dirty Work www.theatlantic.com/technology/2... #tech #AI #CommonCrawl #PrivacyRights #TechRegulation #SiliconValley #BigBrother

The Company Quietly Funneling Paywalled Articles to AI Developers

“You shouldn’t have put your content on the internet if you didn’t want it to be on the internet,” Common Crawl’s executive director says.

www.theatlantic.com

November 5, 2025 at 12:58 AM

kathykadane.bsky.social

@kathykadane.bsky.social

#AI
#THEFT
#COPYRIGHT
#COMMONCRAWL

Damon Beres @damonberes.com · 8d

NEW: Common Crawl, the massive archiver of the web, has gotten cozy with AI companies and is providing paywalled articles for training data. They’re also lying to publishers who have asked for material to be removed. “The robots are people too,” CC’s exec director told us when we asked about this.

The Nonprofit Feeding the Entire Internet to AI Companies

Common Crawl claims to provide a public benefit, but it lies to publishers about its activities.

www.theatlantic.com

November 4, 2025 at 1:12 PM

infoDOCKET

@infodocket.bsky.social

Upcoming Event (October 22nd) Hosted by @stanfordhai.bsky.social: Common Crawl Foundation: Preserving Humanity's Knowledge and Making it Accessible: Addressing Challenges of Public Web Data
hai.stanford.edu/events/commo... @commoncrawl.bsky.social #AI #commoncrawl #datasets #data

Common Crawl Foundation | Preserving Humanity's Knowledge and Making it Accessible: Addressing Challenges of Public Web Data | Stanford HAI

Learn about Common Crawl's insights from a recent data product and informed solutions for the future of public web data.

hai.stanford.edu

October 19, 2025 at 3:24 PM

Andrew Duffy

@a10y.dev

CommonForms is cool. Guy filtered CommonCrawl down to PDFs and trained a 3-class YOLO model on it. Performs better than Acrobat.

arxiv.org/pdf/2509.16506

arxiv.org

October 11, 2025 at 11:11 PM

Baptiste Hugot

@baptistehugot.com

2 organismes professionnels représentant 800 titres de la presse française obtiennent le retrait de leurs contenus de presse sur #CommonCrawl qui sert à alimenter notamment les IA. Une information à bien retenir.
www.lalettre.fr/fr/medias_pr...

L'APIG et le SEPM obtiennent le retrait des contenus de presse pillés par l'IA sur Common Crawl - LA LETTRE

La première offensive des éditeurs de quotidiens et de magazines pour se protéger du pillage de leurs contenus par les géants de l'IA s'est avérée payante.

www.lalettre.fr

October 3, 2025 at 11:37 AM

Ungovernable Human🏳️‍🌈🏳️‍⚧️✊🏿

@najakaouthia.bsky.social

Without being able to verify, it's just going to have to stay in the category of humorous meme lol. I did do a little half assed footwork by checking ye olde waybacke and I queried CommonCrawl but no dice. This one will remain a mystery.

September 22, 2025 at 5:24 PM

Tino Eberl

@tinoeberl.mastodon.online.ap.brid.gy

Mehrere französische #Medienhäuser protestieren gegen die unautorisierte Nutzung ihrer Inhalte durch #KI-Systeme.

Besonders im Fokus stehen frei zugängliche Datenbanken wie #commoncrawl, deren Inhalte zum Training von #sprachmodellen genutzt werden.

Die #Verlage fordern die Entfernung […]

Original post on mastodon.online

mastodon.online

September 14, 2025 at 8:11 AM

precisement.org

@precisement.bsky.social

Pourquoi ne suis je pas surpris ?
Parce ce que je savais déjà pour les dialogues des films US.
Et pour CommonCrawl.
Et pour la presse.
Et pour mon blog.
La liste (du pillage) est sans fin.
Et la loi ne couvre que très mal les créateurs :
www.precisement.org/blog/Les-IA-...

September 8, 2025 at 11:56 AM

bryan newbold

@bnewbold.net

pretty interesting!

bootstrapping a large *and* quality URL crawl list is hard, and a barrier to entry. you can spider top stuff from, eg, Alexa top million, wikidata, and commoncrawl. but true long tail is important and hard: "not linked-to but good"

sometimes via old tweet links, reddit, etc

September 3, 2025 at 12:29 AM

precisement.org

@precisement.bsky.social

Pourtant, en ce qui concerne CommonCrawl, il semble qu'il obéisse aux demandes de retrait simples par email (cf email supra). Une preuve indirecte pourrait être la nette baisse de taille de leur dataset depuis déc. 2023 :
en.wikipedia.org/wiki/Common_...
Beaucoup de bruit pour rien ?
5/5

September 2, 2025 at 9:42 AM

precisement.org

@precisement.bsky.social

Hello presse française, c'est en 2025 qu'on se réveille ?
CommonCrawl, cette "pompeuse", alimentait déjà les moteurs de rech. (GG, BG ...) dans les années 2000.
Simple particulier, je m'en suis fait retirer il y a 2 ans.
2/

September 2, 2025 at 9:25 AM

precisement.org

@precisement.bsky.social

La presse attaque enfin à la racine les abus d' "emprunt" de propriété intellectuelle par l'IA générative.
Ce faisant, elle attaque une institution : CommonCrawl et ses dérivés existent depuis ... 18 ans !
1/

September 2, 2025 at 9:23 AM

モデラーDYR

@dyrthought.bsky.social

ありがとうございます
しかし、laionがcommonCrawlのデータ(大量のhtmlファイルでしょうか)を分析して、画像のリンクとaltタグの内容を取得し、加えて画像をclipによって分別することで新たにアノテーションを生成したのであれば、件の裁判で裁判所が「クローラは画像ストックサイトの利用規約に自然言語で示された「ボットによるスクレイピング禁止」のような規約を認識できるから、robots.txtでなくても十分オプトアウトの要件を満たす」というようなことを述べているのと辻褄が合っていないような気がします

August 11, 2025 at 2:56 PM

N

@beech7245.bsky.social

LAION-5Bは、CommonCrawlが収集した画像のURLとALT属性の値のリストを、CLIPモデルを使用してフィルタリングして精度を高めたデータセットだそうですね。
laion.ai/faq/

gigazine.net/news/2022121...

クローラがページ内のHTMLのALT属性の値を抽出できるのなら、他のテキストも取得できるでしょうし、
また、「CLIPモデルを使用して画像とALT属性の値＝自然言語の結びつきをフィルタリング」しているのなら、画像内やALT属性の値での自然言語のテキストでのAI学習の拒否（TDMの権利留保）の明示も理解できるだろう、と思われますね。

August 11, 2025 at 2:17 PM

alto! (Ugly Sweater Edition)

@probablytoo.online

has anyone ever tried to visualize the entirety of commoncrawl somehow

August 9, 2025 at 10:43 AM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news