IDI
@institutional.org
220 followers 2 following 10 posts
A research center at Harvard working to strengthen society’s connection to knowledge by advancing our access to and understanding of the data that shapes AI.
Posts Media Videos Starter Packs
institutional.org
Join us tomorrow at 10AM EST:
tinyurl.com/y3ye6cz6
institutional.org
Can a small visual language model read documents as effectively as models 27 times its size?

Next Friday, IDI will host Michele Dolfi and Peter Staar from IBM Research Zurich to discuss their work on SmolDocling, an “ultra-compact” model for diverse OCR tasks.
Reposted by IDI
leppert.me
This Monday, @institutionaldatainitiative.org will host Petr Knoth to share his experience leading CORE ("The world’s largest collection of open access research papers") as the rise of AI brings new meaning, and challenges, to stewarding knowledge repositories. Join us virtually via the link below.
institutional.org
We hope Institutional Books will be the beginning of a process that makes millions more books accessible to the public for a variety of uses.

We welcome feedback as we continue to expand this dataset, refine its contents, and sharpen our process.
www.institutionaldatainitiative.org/institutiona...
Institutional Books | Institutional Data Initiative
Institutional Books 1.0 is our first release of public domain books. This set was originally digitized through Harvard Library’s participation in the Google Books project..
www.institutionaldatainitiative.org
institutional.org
We look forward to growing Institutional Books through community. We welcome collaboration from researchers and model makers as we:
- Evaluate the dataset’s impact on model outputs
- Continuing to refine our OCR pipelines

View the dataset on Hugging Face: huggingface.co/datasets/ins...
institutional/institutional-books-1.0 · Datasets at Hugging Face
We’re on a journey to advance and democratize artificial intelligence through open source and open science.
huggingface.co
institutional.org
As part of our refinement work, we supplemented the original OCR-extracted text with a post-processed version that utilizes line detection to reassemble the text according to the line type.
institutional.org
We included extensive volume-level metadata with both original and generated components, such as results from text-level language detection.
institutional.org
We analyzed the dataset’s coverage across time, topic, and language and found:
- 40% of English text + long tail of 254 languages
- 20 clear topical tranches
- Largely published in the 19th and 20th centuries

Technical report here: arxiv.org/abs/2506.08300
institutional.org
Today we released Institutional Books 1.0, a 242B token dataset from Harvard Library's collections, refined for accuracy and usability. 🧵
Reposted by IDI
leppert.me
The @institutionaldatainitiative.org is proud to support The New Commons challenge. $100k grants along with mentorship. Let's get impactful data into the AI ecosystem.
thegovlab.org
(1/4) CALL FOR APPLICATIONS FOR DATA COMMONS FOR AI

🏆Today, The Open Data Policy Lab (a collaboration btwn The GovLab & @microsoft.com launched The New Commons Challenge—an innovation challenge to foster the creation of data commons that can support generative AI developed in the public interest.