Library Innovation Lab
@harvardlil.bsky.social
2.3K followers 210 following 27 posts
A crowd of coders, lawyers, librarians, designers, & tinkerers building tools like Perma.cc & Caselaw Access Project at the Harvard Law School Library. Where @institutionaldatainitiative.org got started. 🌐 https://lil.law.harvard.edu
Posts Media Videos Starter Packs
harvardlil.bsky.social
Join our team! LIL is looking for a Product and Research Manager to help create, shape, and execute on our portfolio of open knowledge projects. PRMs work across every piece of the LIL ecosystem, from software experimentation to convening of events. Learn more at careers.harvard.edu/job/product-...
Product and Research Manager
careers.harvard.edu
Reposted by Library Innovation Lab
harvardlil.bsky.social
Ed Summers at Stanford wrote this great deep dive of how and why we designed our data.gov archiver the way we did. Thanks for digging in, Ed, this is excellent. inkdroid.org/2025/02/17/n...
Bagging data.gov
inkdroid.org
harvardlil.bsky.social
We just launched a 16TB archive of every dataset that has been available on data.gov since November. This will be updated day by day as new datasets appear. It can be freely copied, and we're sharing the code behind it to help others make their own archives of data they depend on.
Announcing the Data.gov Archive | Library Innovation Lab
Today we released our archive of data.gov on Source Cooperative. The 16TB collection includes over 311,000 datasets harvested during 2024 and 2025, a complet...
lil.law.harvard.edu
Reposted by Library Innovation Lab
lyndamk.bsky.social
Penn is getting a lot of questions about Data Refuge. That effort no longer exists, but several efforts are currently active. I've created a doc from what I & others have suggested. I'll update as I hear more. Feel free to share or suggest: docs.google.com/document/d/1...
Data Rescue Efforts
Data / Website Rescue Efforts End of Term Crawl - The main coordinated effort to archive websites, but datasets have been more of a challenge. EDGI - They have been focused on environmental data. A ...
docs.google.com
harvardlil.bsky.social
What are we all missing? Anything you can't get by clicking from link to link like EOT, or downloading datasets directly from data.gov. If there's things you care about preserving that fit that description, that's where to focus.
harvardlil.bsky.social
Our collection from data.gov is limited: if an entry points directly to the data, such as a csv, we have the data. If it points to an html landing page, we just have the landing page. This means many, many datasets are not included. What we have from data.gov adds up to 15 or 20TB.
Data.gov Home - Data.gov
data.gov
harvardlil.bsky.social
Speaking of telling someone, here’s our update: we have copies of all metadata from data.gov, and all of the dataset URLs it points to (shallow crawl); all federal Github repositories with issues, comments, etc.; and articles from PubMed.
Data.gov Home - Data.gov
data.gov
harvardlil.bsky.social
Third, tell someone. Archive.org is one good place to store public data for discovery, and we at LIL will consider storing and signing data in some cases as well. Just posting data somewhere search engines can find is good too.
Internet Archive: Digital Library of Free & Borrowable Texts, Movies, Music & Wayback Machine
Archive.org
harvardlil.bsky.social
FOIA requests are another great way to scale up — check out @muckrock.com to get started.
harvardlil.bsky.social
Next, scale up. If you’re a programmer (or can team up with one), write a python script to download a full collection — say, everything from the data portal of a given government website. Run it yourself, and share it so we libraries can use it too.
harvardlil.bsky.social
If you’re a data scientist, good news — your work isn't just downloading data and publishing about it, but also keeping safe copies!
harvardlil.bsky.social
To keep access to stuff you care about: first just make a copy. Use ArchiveWeb.page to click around and download all the parts of a website you’re interested in. We like the desktop version to avoid capturing login cookies or extensions, but the browser extension is good too.
ArchiveWeb.page
ArchiveWeb.page
harvardlil.bsky.social
Public data gets taken down all the time. Everyone needs to understand that there are no complete copies — it belongs to all of us, and we paid for it, but it is too large for anyone to copy. This is why libraries work together to preserve stuff.
harvardlil.bsky.social
Why trust us? We’re the web archiving folks behind Perma.cc and the open data folks behind Case.law. Starting late last year we moved into collecting online government datasets, since it can be hard to get the datasets researchers need from web archives.
Websites change. Perma Links don't.
Perma.cc helps scholars, journals, courts, and others create permanent records of the web sources they cite.
Perma.cc
harvardlil.bsky.social
Last year we started a project to download and preserve public data. lil.law.harvard.edu/blog/2025/01... Since saving public data is in the news today — but is always needed — let’s talk about what you can do to help.
Preserving Public U.S. Federal Data | Library Innovation Lab
lil.law.harvard.edu
harvardlil.bsky.social
Could a tool like this help you overcome both language and knowledge barriers when exploring large collections of information? How might LLMs help people access and understand legal information that is either in a foreign language or requires specialized knowledge?
harvardlil.bsky.social
In this case study, @matteocargnelutti.dev and @kristimukk.bsky.social (in collaboration with Betty Queffelec, University of Western Brittany) investigate how such a tool might help non-French speakers of varying expertise ask questions in English to explore French law.
harvardlil.bsky.social
What insights emerge when a librarian, a software engineer, and a legal scholar come together to experiment with Retrieval Augmented Generation (RAG) to explore over 800,000 French legal articles 🇫🇷?

Blog post: lil.law.harvard.edu/blog/2025/01...
Case study: lil.law.harvard.edu/open-french-...
Title "Open French Law RAG" overlaid on llama dressed in a French lawyer robe in front of an abstract background.
Reposted by Library Innovation Lab
ktmac.bsky.social
I’ve been thinking so much lately about data storage, formats, inscription, archives & deep time for the #dataloss project.

So cool to see this piece come out from @harvardlil.bsky.social on CENTURY-SCALE STORAGE by @maxy.bsky.social
(ty @louravn.bsky.social!)

lil.law.harvard.edu/century-scal...
Century-Scale Storage
If you had to store something for 100 years, how would you do it?
lil.law.harvard.edu
harvardlil.bsky.social
"We picked a century scale because most physical objects can survive 100 years in good care. It is attainable, and yet we selected it because the design of mainstream digital storage mediums are nowhere close to even considering this mark."

lil.law.harvard.edu/century-scale-storage