Lightnews — Scholar-powered news

Daniel van Strien @danielvanstrien.bsky.social · 7h

@wjbmattingly.bsky.social has done tons on handwritten text and VLMs!

Reposted by Daniel van Strien

Daniel van Strien @danielvanstrien.bsky.social · 1d

DoTS.ocr just got native vLLM support!

I built a UV script so you can run SOTA multilingual OCR in seconds with zero setup using @hf.co Jobs

Tested on 1800s library cards - works great ✨

Screenshot of an index card with annotated bounding box predictions from the ocr model

2 1 11

Daniel van Strien @danielvanstrien.bsky.social · 1d

Built with uv for zero setup

Example output from historical library catalog: huggingface.co/datasets/dav...

Input dataset:
huggingface.co/datasets/big...

100+ languages supported!

davanstrien/dots-ocr-bpl-card-catalog · Datasets at Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

Daniel van Strien @danielvanstrien.bsky.social · 1d

DoTS.ocr just got native vLLM support!

I built a UV script so you can run SOTA multilingual OCR in seconds with zero setup using @hf.co Jobs

Tested on 1800s library cards - works great ✨

2 1 11

Daniel van Strien @danielvanstrien.bsky.social · 2d

Also uploaded related datasets for index cards bsky.app/profile/dani...

Daniel van Strien @danielvanstrien.bsky.social · 2d

Card catalogues aren't just a relic of the past - many institutions still rely on them because full migration is too expensive. VLMs could help change that.

I uploaded two new @hf.co datasets (~470K cards) for training/evaluating models to extract structured metadata from catalogue cards.

2

Daniel van Strien @danielvanstrien.bsky.social · 2d

huggingface.co/collections/...

Index card datasets - a biglam Collection

Index card datasets for training and evaulating models for conversion of index cards to structured data/metadata

huggingface.co

2

Daniel van Strien @danielvanstrien.bsky.social · 2d

Card catalogues aren't just a relic of the past - many institutions still rely on them because full migration is too expensive. VLMs could help change that.

I uploaded two new @hf.co datasets (~470K cards) for training/evaluating models to extract structured metadata from catalogue cards.

1 3 12

Reposted by Daniel van Strien

Jay 🦋 @jay.bsky.team · 8d

We’re hiring for two machine learning roles. A chance to do cutting edge things with ML to make this place a lot more personalized.

jobs.gem.com/bluesky/am9i...

Bluesky Jobs

jobs.gem.com

40 100 460

Daniel van Strien @danielvanstrien.bsky.social · 6d

Let me know if you think it's good to add any more context about that in the dataset card!

1

Daniel van Strien @danielvanstrien.bsky.social · 6d

New @hf.co BigLAM dataset: 9,363 OA books with page images + rich MARC metadata for evaluating (and training) VLMs on metadata extraction.

Libraries are starting to explore AI-assisted cataloguing, but we lack public evaluation data. Hoping this helps fill that gap.

huggingface.co/datasets/big...

Screenshot of the dataset viewer showing a column of marc data + the first few pages of an open access monograph

2 8 31

Reposted by Daniel van Strien

Daniel van Strien @danielvanstrien.bsky.social · Sep 4

Blogged: Fine-tuning a VLM for art history in hours, not weeks

iconclass-vlm generates museum catalog codes (fun fact: "71H7131" = "Bathsheba with David's letter"!)

@hf.co TRL + Jobs = magic ✨

Guide here: danielvanstrien.xyz/posts/2025/i...

danielvanstrien.xyz

4 12

Daniel van Strien @danielvanstrien.bsky.social · Sep 4

Blogged: Fine-tuning a VLM for art history in hours, not weeks

iconclass-vlm generates museum catalog codes (fun fact: "71H7131" = "Bathsheba with David's letter"!)

@hf.co TRL + Jobs = magic ✨

Guide here: danielvanstrien.xyz/posts/2025/i...

danielvanstrien.xyz

4 12

Daniel van Strien @danielvanstrien.bsky.social · Sep 3

cc @epoz.org!

1 1

Daniel van Strien @danielvanstrien.bsky.social · Sep 3

Model: huggingface.co/davanstrien/...

Space to explore predictions on a test set: huggingface.co/spaces/davan...

davanstrien/iconclass-vlm · Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

1 5

Daniel van Strien @danielvanstrien.bsky.social · Sep 3

I fine-tuned a smol VLM to generate specialized art history metadata!

iconclass-vlm: Qwen2.5-VL-3B trained using SFT to generate ICONCLASS codes (think Dewey Decimal for art!)

Trained with @hf.co TRL + Jobs - single UV script, no GPU needed!

Blog soon!

Screenshot of the iconclass-vlm model demo showing predictions for a 17th century portrait painting of a standing woman in black dress with white ruff collar. The interface displays the model's raw JSON prediction with ICONCLASS codes, then compares predictions against ground truth labels in two columns. Model correctly identifies "31A231 standing figure" and "61B(+55) historical persons (portraits and scenes from the life) (+ full length portrait)" among others, achieving 3 out of 6 matches. Some predictions marked as "Not a valid iconclass label" showing areas where the model needs improvement.

1 1 12

Daniel van Strien @danielvanstrien.bsky.social · Aug 7

Try it with one line of code via Jobs!

It processes images from any dataset and outputs a new dataset with extracted markdown - all using HF GPUs.

See the full OCR uv scripts collection: huggingface.co/datasets/uv-...

Screenshot of a hf jobs uv run command with some flags and a URL pointing to a script.

4

Daniel van Strien @danielvanstrien.bsky.social · Aug 7

Model here: huggingface.co/numind/NuMar...

numind/NuMarkdown-8B-Thinking · Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

1 3

Daniel van Strien @danielvanstrien.bsky.social · Aug 7

What if OCR models could show you their thought process?

NuMarkdown-8B-Thinking from NuMind (YC S22) doesn't just extract text - it reasons through documents first.

Could be pretty valuable for weird historical documents?

Example here: davanstrien-ocr-time-capsule.static.hf.space/index.html?d...

Screenshot of an app showing an image from a page + model reasoning showing how the model is parsing the text and layout.

1 2 14

Daniel van Strien @danielvanstrien.bsky.social · Aug 6

You can now generate synthetic data using OpenAIs GPT OSS models on @hf.co Jobs!

One command, no setup:

hf jobs uv run --flavor l4x4 [script-url] \
--input-dataset your/dataset \
--output-dataset your/output

Works on L4 GPUs ⚡

huggingface.co/datasets/uv-...

uv-scripts/openai-oss · Datasets at Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

1 11

Reposted by Daniel van Strien

Toby Hodges @tobyhodges.bsky.social · Aug 5

My latest post on @carpentries.carpentries.org blog is a call to action for the community to engage in curriculum development for workshops about genAI. How can we help our target audience make more informed choices about when and how to use it?

carpentries.org/blog/2025/08...

AI Carpentry? Helping learners make better choices with genAI.

In two recent community discussion sessions, we explored what mental model of machine learning/deep learning we could teach to learners already familiar with the basics of programming, to help them sa...

carpentries.org

12 9

Daniel van Strien @danielvanstrien.bsky.social · Aug 5

I’m continuing my experiments with VLM-based OCR…

How well do these models handle Victorian theatre playbills from @bldigischol.bsky.social?

RolmOCR vs traditional OCR on tricky playbills (ornate fonts, faded ink, DRAMATIC ALL CAPS!)

@hf.co Demo: huggingface.co/spaces/davan...

Screenshot of a plyabill with some OCR results on the right

14

Daniel van Strien @danielvanstrien.bsky.social · Aug 4

It's often not documented, but "traditional" OCR in this case is whatever libraries and archives used in the past to generate some OCR. My goal with this work is mainly to see how much better VLMs might be (and in which situations), to get some better sense of when redoing OCR might be worth it.

1 1

Reposted by Daniel van Strien

Daniel van Strien @danielvanstrien.bsky.social · Aug 1

Many VLM-based OCR models have been released recently. Are they useful for libraries and archives?

I made a quick Space to compare VLM OCR with "traditional" OCR using 11k Scottish exam papers from @natlibscot.bsky.social

huggingface.co/spaces/davanstrien/ocr-time-capsule

Screenshot of the app showing a page from a book + different views of existing and new ocr.

4 15 48

Daniel van Strien @danielvanstrien.bsky.social · Aug 1

I'm planning to add more example datasets & OCR models using HF Jobs. Feel free to suggest collections to test with: I need image + existing OCR!

Even better: upload your GLAM datasets to @hf.co! 🤗

2

Daniel van Strien @danielvanstrien.bsky.social · Aug 1

Based on this great collection: huggingface.co/datasets/NationalLibraryOfScotland/Scottish-School-Exam-Papers

You can browse visually (press V!), see quality metrics, and compare outputs side-by-side.

NationalLibraryOfScotland/Scottish-School-Exam-Papers · Datasets at Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

1