Jörg Lehmann
banner
jrglmn.bsky.social
Jörg Lehmann
@jrglmn.bsky.social
Digital humanism | machine learning | digital cultural heritage | Berlin State Library | „Name a bias – we have it!“
December 26, 2025 at 6:20 PM
Feedback to these publications is most welcome!

Find all project results here:
mmk.sbb.berlin/deliverables...

#bigdata #ML #culturalheritage #ELSI #digitalculturalheritage
Deliverables – Mensch.Maschine.Kultur
mmk.sbb.berlin
December 17, 2025 at 5:17 PM
Conclusion by Bloomberg staff writers Parmy Olson and Carolyn Silverman in an article reviewing the financial tally of generative AI impacts

#cost-benefitratio
December 12, 2025 at 9:26 AM
Bloomberg: The data raise “an uncomfortable prospect: that this supposedly revolutionary technology might never deliver on its promise of broad economic transformation, but instead just concentrate more wealth at the top.“

bloomberg.com/opinion/arti...
Bloomberg - Business News, Stock Markets, Finance, Breaking & World News
Bloomberg delivers business and markets news, data, analysis, and video to the world, featuring stories from Businessweek and Bloomberg News
bloomberg.com
December 12, 2025 at 9:26 AM
See the interesting and literally playful study by Tarek Saier here (in German only):

illdepence.github.io/slf-origins/

doi.org/10.5281/zeno...
Seit wann gibt es Stadt, Land, Fluss?
Recherche zu den Ursprüngen von Stadt, Land, Fluss
illdepence.github.io
December 12, 2025 at 9:26 AM
Well, we Germans had that back in the 1920‘s: A typographic culture war. ‚Modernists‘ used Antiqua, Nationalists used Fraktur. The resulting effect is that you open a book and think you know the mindset of the person who wrote it; which is complete nonsense…
December 10, 2025 at 9:57 PM
Happy to be part of the proceedings with our contribution with the paper "How Scalable is Quality Assessment of Text Recognition?"
anthology.ach.org/volumes/vol0...
Our colleague Michał Bubula will present a related poster at the #CHR2025 and be available for questions ... Enjoy!
How Scalable is Quality Assessment of Text Recognition? A Combination of Ground Truth and Confidence Scores
anthology.ach.org
December 10, 2025 at 9:35 AM
Yes, and I saw yours. Will be coming soon around the corner with some 4.500 maps from @stabiberlin.bsky.social
December 9, 2025 at 1:31 PM
And on another note: How come that these stickers can be found next to each other on the same door?
December 9, 2025 at 1:21 PM
Thank you so much for these slides, Daniel! They add much to the discussion on „Fifty shades of openness“ ..
December 9, 2025 at 1:12 PM
#FF2025 pickings:
This year has been extremely productive with regard to the AI & commons debate, as well as in view of the publication of open, public domain datasets.

Paul Keller & Europeana Foundation: Publishing cultural heritage data in the age of AI, Dec 2025
openfuture.eu/publication/...
Impulse paper: Publishing cultural heritage data in the age of AI – Open Future
This paper proposes a framework to help cultural heritage institutions decide when and how to share collection data for AI training, balancing open access with managing large-scale AI reuse aligned wi...
openfuture.eu
December 9, 2025 at 1:07 PM
Nikhil Kandpal et al.: The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text, June 2025
doi.org/10.48550/arX...

Stefan Baack et al.: Towards Best Practices for Open Datasets for LLM Training, Jan 2025
doi.org/10.48550/arX...
The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text
Large language models (LLMs) are typically trained on enormous quantities of unlicensed text, a practice that has led to scrutiny due to possible intellectual property infringement and ethical concern...
doi.org
December 8, 2025 at 10:48 AM
Lukas Gienapp et al.: The German Commons – 154 Billion Tokens of Openly Licensed Text for German Language Models, Oct 2025
doi.org/10.48550/arX...

Pierre-Carl Langlais et al.: Common Corpus: The Largest Collection of Ethical Data for LLM Pre-Training, June 2025
doi.org/10.48550/arX...
The German Commons - 154 Billion Tokens of Openly Licensed Text for German Language Models
Large language model development relies on large-scale training corpora, yet most contain data of unclear licensing status, limiting the development of truly open models. This problem is exacerbated f...
doi.org
December 8, 2025 at 10:48 AM
Thomas Padilla et al: Public Interest Corpus Principles and Goals, Dec 2025
www.authorsalliance.org/2025/12/03/r...

Paul Keller & Europeana Foundation: Outline for a European Books Data Commons, Nov 2025
openfuture.eu/publication/...
Releasing The Public Interest Corpus Principles and Goals
Today, we are pleased to release The Public Interest Corpus Principles and Goals. This release builds on the recap of our final planning workshop and anticipates release of our final deliverable la…
www.authorsalliance.org
December 8, 2025 at 10:48 AM