Lightnews — Scholar-powered news

Jörg Lehmann

@jrglmn.bsky.social

@lastlines.bsky.social

December 26, 2025 at 6:20 PM

Jörg Lehmann

@jrglmn.bsky.social

Feedback to these publications is most welcome!

Find all project results here:
mmk.sbb.berlin/deliverables...

#bigdata #ML #culturalheritage #ELSI #digitalculturalheritage

Deliverables – Mensch.Maschine.Kultur

mmk.sbb.berlin

December 17, 2025 at 5:17 PM

Jörg Lehmann

@jrglmn.bsky.social

Conclusion by Bloomberg staff writers Parmy Olson and Carolyn Silverman in an article reviewing the financial tally of generative AI impacts

#cost-benefitratio

December 12, 2025 at 9:26 AM

Jörg Lehmann

@jrglmn.bsky.social

Bloomberg: The data raise “an uncomfortable prospect: that this supposedly revolutionary technology might never deliver on its promise of broad economic transformation, but instead just concentrate more wealth at the top.“

bloomberg.com/opinion/arti...

Bloomberg - Business News, Stock Markets, Finance, Breaking & World News

Bloomberg delivers business and markets news, data, analysis, and video to the world, featuring stories from Businessweek and Bloomberg News

bloomberg.com

December 12, 2025 at 9:26 AM

Jörg Lehmann

@jrglmn.bsky.social

See the interesting and literally playful study by Tarek Saier here (in German only):

illdepence.github.io/slf-origins/

doi.org/10.5281/zeno...

Seit wann gibt es Stadt, Land, Fluss?

Recherche zu den Ursprüngen von Stadt, Land, Fluss

illdepence.github.io

December 12, 2025 at 9:26 AM

Jörg Lehmann

@jrglmn.bsky.social

Well, we Germans had that back in the 1920‘s: A typographic culture war. ‚Modernists‘ used Antiqua, Nationalists used Fraktur. The resulting effect is that you open a book and think you know the mindset of the person who wrote it; which is complete nonsense…

December 10, 2025 at 9:57 PM

Jörg Lehmann

@jrglmn.bsky.social

Happy to be part of the proceedings with our contribution with the paper "How Scalable is Quality Assessment of Text Recognition?"
anthology.ach.org/volumes/vol0...
Our colleague Michał Bubula will present a related poster at the #CHR2025 and be available for questions ... Enjoy!

How Scalable is Quality Assessment of Text Recognition? A Combination of Ground Truth and Confidence Scores

anthology.ach.org

December 10, 2025 at 9:35 AM

Jörg Lehmann

@jrglmn.bsky.social

Yes, and I saw yours. Will be coming soon around the corner with some 4.500 maps from @stabiberlin.bsky.social …

December 9, 2025 at 1:31 PM

Jörg Lehmann

@jrglmn.bsky.social

And on another note: How come that these stickers can be found next to each other on the same door?

A „Stabibi“ sticker which imitates Berlin Gangsta vocabulary by collapsing the words „Stabi“ (short for Staatsbibliothek) and the Arabic „habibi“, alongside a sticker from the machine learning platform Hugging Face

December 9, 2025 at 1:21 PM

Jörg Lehmann

@jrglmn.bsky.social

Thank you so much for these slides, Daniel! They add much to the discussion on „Fifty shades of openness“ ..

December 9, 2025 at 1:12 PM

Jörg Lehmann

@jrglmn.bsky.social

bsky.app/profile/jrgl...

Jörg Lehmann @jrglmn.bsky.social · Dec 8

#FF2025 pickings:
This year has been extremely productive with regard to the AI & commons debate, as well as in view of the publication of open, public domain datasets.

Paul Keller & Europeana Foundation: Publishing cultural heritage data in the age of AI, Dec 2025
openfuture.eu/publication/...

Impulse paper: Publishing cultural heritage data in the age of AI – Open Future

This paper proposes a framework to help cultural heritage institutions decide when and how to share collection data for AI training, balancing open access with managing large-scale AI reuse aligned wi...

openfuture.eu

December 9, 2025 at 1:07 PM

Jörg Lehmann

@jrglmn.bsky.social

Please extend this reading list!

#AITrainingData #Commons #OpenAccess #PublicDomain

@stabiberlin.bsky.social @europeana.bsky.social @bldigischol.bsky.social @nfitzger.glammr.us.ap.brid.gy @miaout.bsky.social @amsichani.bsky.social

December 8, 2025 at 10:48 AM

Jörg Lehmann

@jrglmn.bsky.social

Nikhil Kandpal et al.: The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text, June 2025
doi.org/10.48550/arX...

Stefan Baack et al.: Towards Best Practices for Open Datasets for LLM Training, Jan 2025
doi.org/10.48550/arX...

The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text

Large language models (LLMs) are typically trained on enormous quantities of unlicensed text, a practice that has led to scrutiny due to possible intellectual property infringement and ethical concern...

doi.org

December 8, 2025 at 10:48 AM

Jörg Lehmann

@jrglmn.bsky.social

Lukas Gienapp et al.: The German Commons – 154 Billion Tokens of Openly Licensed Text for German Language Models, Oct 2025
doi.org/10.48550/arX...

Pierre-Carl Langlais et al.: Common Corpus: The Largest Collection of Ethical Data for LLM Pre-Training, June 2025
doi.org/10.48550/arX...

The German Commons - 154 Billion Tokens of Openly Licensed Text for German Language Models

Large language model development relies on large-scale training corpora, yet most contain data of unclear licensing status, limiting the development of truly open models. This problem is exacerbated f...

doi.org

December 8, 2025 at 10:48 AM

Jörg Lehmann

@jrglmn.bsky.social

Thomas Padilla et al: Public Interest Corpus Principles and Goals, Dec 2025
www.authorsalliance.org/2025/12/03/r...

Paul Keller & Europeana Foundation: Outline for a European Books Data Commons, Nov 2025
openfuture.eu/publication/...

Releasing The Public Interest Corpus Principles and Goals

Today, we are pleased to release The Public Interest Corpus Principles and Goals. This release builds on the recap of our final planning workshop and anticipates release of our final deliverable la…

www.authorsalliance.org

December 8, 2025 at 10:48 AM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news