💾💜 Digital ⚓️ Vagabond 💜💾
banner
b33tk33p3r.bsky.social
💾💜 Digital ⚓️ Vagabond 💜💾
@b33tk33p3r.bsky.social
Digital preservation consultant. Systems analyst, software architect, domain specialist. Leipzig.

https://linktr.ee/ross.spencer
Pinned
Hello world! My colleague Andrea and I have been working on a mission statement for the Digital Preservation is People initiative. Read more about why #digipres is all about people about the wider net #digipres can cast, and how we hope everyone can be part of this journey.

write.as/dpip/hello-w...
Hello world!
This is the first blog post of the new Digital Preservation is People initiative (DPIP) and the announcement of its 2024 Mission Statemen...
write.as
Reposted by 💾💜 Digital ⚓️ Vagabond 💜💾
Hero Mark Kelly on Kimmel Live

youtu.be/oZMqvm_owGU?...
Senator Mark Kelly on Trump Suggesting He Be Executed & Hegseth Opening an Investigation into Him
YouTube video by Jimmy Kimmel Live
youtu.be
November 26, 2025 at 7:45 AM
Is your scientist conscious? Are they muttering incomprehensible sentences? Check the thermometers in their lab and make sure they aren't suffering from nearly fatal mercury poisoning...

#NoAI #AI
can’t fucking catch a breath

make it stop
November 25, 2025 at 10:27 PM
Reposted by 💾💜 Digital ⚓️ Vagabond 💜💾
Hark! There's a planet! We could sail there through the void. There's an animal there like a prairie dog with insufficient hair. Do you want to watch the sunset there?
November 25, 2025 at 8:03 AM
Reposted by 💾💜 Digital ⚓️ Vagabond 💜💾
maybe i am going insane
November 24, 2025 at 6:49 PM
Reposted by 💾💜 Digital ⚓️ Vagabond 💜💾
"there are no third spaces anymore" wrong. blast furnace
November 25, 2025 at 12:59 AM
Reposted by 💾💜 Digital ⚓️ Vagabond 💜💾
Hey #digipres folks, over on the #ApplesauceFDC discord, someone's been working through how to archive punch cards (since they've got a large stack of them), and put together a documented #format for #punchcards […]
Original post on infosec.exchange
infosec.exchange
November 16, 2025 at 1:11 PM
Reposted by 💾💜 Digital ⚓️ Vagabond 💜💾
#FFmpeg version 8.0.1 (“Huffman”) – “a complete, cross-platform solution to record, convert and stream audio and video” – has been released early this morning.
#DigiPres #AVpres
FFmpeg
ffmpeg.org
November 20, 2025 at 6:15 AM
Reposted by 💾💜 Digital ⚓️ Vagabond 💜💾
ICYMI, our #OhioDIG Insta this week was a #KentStateLibraries take over!
Check it out: www.instagram.com/ohiodig/
And Ohio #digitization / #digipres folks, watch for a call soon to do your own take over!
#OhioArchivists #DigitalArchives
November 21, 2025 at 2:17 PM
November 24, 2025 at 8:38 AM
Reposted by 💾💜 Digital ⚓️ Vagabond 💜💾
UK Budget – is this the end of democracy? youtu.be/g0lEbH2kEw8
UK Budget – The End of Democracy?
YouTube video by Garys Economics
youtu.be
November 23, 2025 at 11:06 AM
Reposted by 💾💜 Digital ⚓️ Vagabond 💜💾
📣 Really proud to announce the publication of Reframing Failure in Digital Scholarship, an #OpenAccess collection of essays co-edited with @amsichani.bsky.social and published by @uolpress.bsky.social that examines the role of failure in #DH and research more broadly

@sas-news.bsky.social
Reframing Failure in Digital Scholarship - University of London Press
Failure is ordinary. From technological failures and computational obsolescence to rejected applications and challenging collaborations, failure is an unavoidable part of any scholarly endeavour. This...
uolpress.co.uk
November 20, 2025 at 9:26 AM
Reposted by 💾💜 Digital ⚓️ Vagabond 💜💾
Two weeks ago I gave a talk at Australian National Uni that included a list of things I would do with an Sands & Mac volume (1910) and .... THIS WAS ONE OF THEM
Love this so much
Good to hear today that my new Sands & Mac is already being used by front-of-house librarians at the SLV to help people with their family history queries. https://updates.timsherratt.org/2025/11/12/a-new-way-of-searching.html
In the fortnight I spent onsite at the State Library of Victoria, ‘Sands & Mac’ was mentioned many times. And no wonder. The Sands & McDougall’s directories are a goldmine for anyone researching family, local, or social history. They list thousands of names and addresses, enabling you to find individuals, and explore changing land use over time. When people ask the SLV’s librarians, ‘What can you tell me about the history of my house?’, Sands & Mac is one of the first resources consulted. The SLV has digitised 24 volumes of Sands & Mac, one every five years from 1860 to 1974. You can browse the contents of each volume in the SLV image viewer, using the partial contents listing to help you find your way to sections of interest. To search the full text content you need to use the PDF version, either in the built-in viewer, or by downloading the PDF. There’s a handy guide to using Sands & Mac that explains the options. **However, there’s currently no way of searching across all 24 volumes, so as part of my residency at the SLV LAB, I thought I’d make one!** **Try it now!** My new Sands & Mac database follows the pattern I’ve used previously to create fully-searchable versions of the NSW Post Office directories, Sydney telephone directories, and Tasmanian Post Office directories. Every line of text is saved to a database, so a single query searches for entries across all volumes. You can also use advanced search features like wildcards and boolean operators. Search across all 24 volumes! Once you’ve found a relevant entry you can view it in context, alongside a zoomable image of the page. You can even use Zotero to save individual entries to your own research database. This blog post from the Everyday Heritage project describes how the Tasmanian directories have been used to map Tasmania’s Chinese population. View each entry in context! (Here's my Dad building his first house in Beaumaris in the 1950s.) There’s still a few things I’d like to try, such as making use of the table of contents information for each volume. I’d also like to create some additional entry points to take users directly to listings for individual suburbs (maybe even streets!). Each volume has a directory of suburbs, so it would be a matter of extracting and cleaning the data and linking the entries to digitised pages. Certainly possible, but I don’t think I’ll have time to get it all done before the end of my residency. Perhaps I’ll try to get at least one volume done to demonstrate how it might work, and the value it would add. As I was writing this blog post I also realised there’s a dataset of businesses extracted from the Sands & Mac, so I need to think about how I can use that as well! ## Technical information follows… I’ve documented the process I used to create fully-searchable versions of the Tasmanian and NSW directories in the GLAM Workbench. I followed a similar method for Sands and Mac, though with a few dead-ends and discoveries along the way. ### Downloading the PDFs I assumed that it would be easiest to work from the PDF versions of each volume, as I’d done for Tasmania. So I set about finding a way to download them all. There’s only 24 volumes, so I _could_ have downloaded them manually, but where’s the fun in that? I started with a CSV file listing the Sands & Mac volumes that I downloaded from the catalogue. This gave me the Alma identifiers for each volume. To download the PDFs I needed two more identifiers, the `IE` identifier assigned to each digitised item, and a file identifier that points to the PDF version of the item. The `IE` identifier can be extracted from the item’s MARC record, as I described in my post on exploring urls. The PDF file identifier was a bit more difficult to track down. The PDF links in the image viewer are generated dynamically, so the data had to be coming from somewhere. Eventually I found that the viewer loaded a JSON file with all sorts of useful metadata in it! The url to download the JSON file is: `https://viewerapi.slv.vic.gov.au/?entity=[IE identifier]&dc_arrays=1`. In the `summary` section I found identifiers for `small_pdf` and `master_pdf`. I could then use these identifiers to construct urls to download the PDFs themselves: `https://rosetta.slv.vic.gov.au/delivery/DeliveryManagerServlet?dps_func=stream&dps_pid=[PDF id]` Once I had the PDFs I used PyMuPDF to extract all the text and images. As I suspected the text wasn’t really fit for purpose. The OCR was ok, but the column structures were a mess. Because I wanted to index each entry individually, it was important to try and get the columns represented as accurately as possible. The images in the small PDFs were already bitonal, so I started feeding them to Tesseract to see if I could get better results. After a bit of tweaking, things were looking pretty good. But when I came to compile all the data, I realised there was a potential problem matching the PDF pages to the images available through IIIF. I found one case where some pages were missing from the PDF, and another couple where the page order was different. As I was looking around for a solution, I realised that those JSON files I downloaded to get the PDF identifiers also included links to ALTO XML files that contain all the original OCR data (before it got mangled by the PDF formatting). There was one ALTO file for every page. Even better, the JSON linked the identifiers for the text and the image together – no more page mismatches! ### Downloading the ALTO files Let’s start this again shall we. After wasting several days futzing about with the PDFs, I decided to download all the ALTO files and extract the text from them. As I downloaded each XML file, I also grabbed the corresponding image identifier from the JSON and included both identifiers in the file name for safe keeping. The ALTO files break the text down by block, line, and word. To extract the text, I just looped through every line, joining the words back together as a string, and writing the result to a new text file – one for each page. It’s worth noting that the ALTO files include _all_ the positional data generated by the OCR process, so you have the size and position of every word on every page. I just pulled out the text, but there are many more interesting things you could do… ### Assembling and publishing the database From here on everything pretty much followed the pattern of the NSW and Tasmanian directories. I looped through each volume, page, and line of text, adding the text and metadata to a SQLite database using sqlite_utils. I then indexed the text for full-text searching. At the same time I populated a metadata file with titles, urls, and few configuration details. The metadata file is used by Datasette to fill in parts of the interface. I made some minor changes to the Datasette template I used for the other directories. In particular, I had to update the urls that loaded the IIIF images into the OpenSeadragon viewer. But it mostly just worked. It’s so nice to be able to reuse existing patterns! Finally, I used Datasette’s `publish` command to push everything to Google Cloudrun. The final database contains details of more than 50,000 pages, and over 19 million lines of text! It weighs in at about 1.7gb. The Cloudrun service will ‘scale to zero’ when not in use. This saves some money and resources, but means it can take a little while to spin up. Once it’s loaded, it’s very fast. My original post on the Tasmanian directories included a little note on costs, if you’re interested. ## More information The notebooks I used are on GitHub: * Download Sands and Mac PDFs and OCR text * Load data from the Sands and Mac directories into an SQLite database (for use with Datasette) Here are some posts about the NSW and Tasmanian directories: * Making NSW Postal Directories (and other digitised directories) easier to search with the GLAM Workbench and Datasette (September 2022) * From 48 PDFs to one searchable database – opening up the Tasmanian Post Office Directories with the GLAM Workbench (September 2022) * Where’s 1920? Missing volume added to Tasmanian Post Office Directories! (September 2024) * Six more volumes added to the searchable database of Tasmanian Post Office Directories! (November 2024)
updates.timsherratt.org
November 20, 2025 at 2:22 AM
Absolutely. The rhetoric that "these are violent criminal illegal aliens" is likely 1000% resulting in increased *something* on ICE officers.

SO STOP THE RHETORIC -- SEND ICE BACK HOME
ICE spox Tricia McLaughlin responds to a request for comment, 24 hours later, saying, “Brad Lander’s obsession with attacking the brave men and women of law enforcement, physically and rhetorically, must stop NOW.”
November 19, 2025 at 9:49 PM
Reposted by 💾💜 Digital ⚓️ Vagabond 💜💾
Few things bug me more than higher ed leaders saying that we lost our mission and lost the trust of the public, when we have actually been the target of a decades-long smear campaign by the right wing that worked. The moment we’re losing our mission is right now, in capitulation.
November 19, 2025 at 1:03 PM
GitHub being down is getting the same traction as Cloudflare being down... one is backed by definition by a distributed source control system and I don't deny it *may* have been painful to lose the "service", one of those systems gives YOU the tools to avoid it in future.

#GitHub #Cloudflare #Git
November 19, 2025 at 7:52 AM
Reposted by 💾💜 Digital ⚓️ Vagabond 💜💾
Jaw dropped. Trump Administration Removes Report on Missing and Murdered Native Americans, Calling It DEI Content
Trump Administration Removes Report on Missing and Murdered Native Americans, Calling It DEI Content - Oklahoma Watch
The Trump administration removed a congressionally mandated report on missing and murdered Native Americans from the DOJ website, citing compliance with an executive order against DEI. Senators who ch...
oklahomawatch.org
November 17, 2025 at 3:22 PM
Reposted by 💾💜 Digital ⚓️ Vagabond 💜💾
Wikimedia Commons provides the internet with 117 million freely usable images — and your photos could be there too.

If you're near Ashburton, join our Wikipedian at Large Dr Mike Dickison (@adzebill.bsky.social) for a free workshop.

10am–4pm, Sun 23 Nov @ Ashburton Art Gallery

Please register ⬇️
Wikipedia Photo Day with Dr. Mike Dickison
Join us for a day with Dr. Mike Dickison to learn all about Wikipedia photography, including how to take and upload your own photographs.
events.humanitix.com
November 17, 2025 at 9:09 PM
But it's your creepy orange skinned premier tearing down the East Wing of the building and blowing up people in #Venezuela while sitting on the #Epstein files because he knows they will extinguish the embers of his "legitimacy" once and for all...
'A House of Dynamite' but it's your uncles trying to deep fry a turkey.
November 17, 2025 at 3:46 PM
Reposted by 💾💜 Digital ⚓️ Vagabond 💜💾
Vote Labour
November 17, 2025 at 8:48 AM
Reposted by 💾💜 Digital ⚓️ Vagabond 💜💾
Jarvis Cocker croons, coos and dances his way through this career-spanning Tiny Desk with Pulp. n.pr/4pbIgkn
Pulp: Tiny Desk Concert
Jarvis Cocker croons, coos and dances his way through this career-spanning Tiny Desk with Pulp.
n.pr
November 13, 2025 at 1:59 PM
Reposted by 💾💜 Digital ⚓️ Vagabond 💜💾
epstein emails released and everyone coming out as a pedophile to defend donald trump
November 16, 2025 at 11:22 PM
Reposted by 💾💜 Digital ⚓️ Vagabond 💜💾
I feel like the pope being an American cinephile is God finally rewarding Scorsese for a lifetime of devotion
Pope Leo absolutely cooking
November 17, 2025 at 4:35 AM
Reposted by 💾💜 Digital ⚓️ Vagabond 💜💾
A few demos running on day one of the Retro Computer Fest in Cambridge. v5 of the Core War simulated annealing script on a Milk-V MARS, Core War on the FUSE #zxspectrum emulator on HaikuOS and Core War genetic algorithms on a Raspberry Pi 3B #retrofest #corewar
November 16, 2025 at 8:00 AM
Watching a gameshow in the #UK

HOST: what will you do with the money?

GUEST: I can make social welfare a reality and take sensible maternity leave and support my parents in their healtcare.

It’s not a labor/con thing. Neither will provide this.

We need a new left.

It’s grim out there folks.
November 15, 2025 at 9:02 PM
Trump: there is no list
Also Trump: here’s part of the list.

Sorry y’all are going through this America but they can’t sustain this. Hopefully the course will be corrected soon.

Release the list!

#Espstein
November 15, 2025 at 7:23 AM