Lightnews — Scholar-powered news

William J.B. Mattingly

@wjbmattingly.bsky.social

470 followers 100 following 300 posts

Digital Nomad · Historian · Data Scientist · NLP · Machine Learning Cultural Heritage Data Scientist at Yale Former Postdoc in the Smithsonian Maintainer of Python Tutorials for Digital Humanities https://linktr.ee/wjbmattingly

linktr.ee

Posts Media Videos Starter Packs

William J.B. Mattingly @wjbmattingly.bsky.social · 14d

🚨Job ALERT🚨! My old postdoc is available!

I cannot emphasize enough how much a life-altering position this was for me. It gave me the experience that I needed for my current role. As a postdoc, I was able to define my projects and acquire a lot of new skills as well as refine some I already had.

7 6

Reposted by William J.B. Mattingly

Ben Lee @bcgl.bsky.social · 29d

Excited to be co-editing a special issue of @dhquarterly.bsky.social on Artificial Intelligence for Digital Humanities: Research problems and critical approaches
dhq.digitalhumanities.org/news/news.html

We're inviting abstracts now - please feel free to reach out with any questions!

DHQ: Digital Humanities Quarterly: News

dhq.digitalhumanities.org

10 20

William J.B. Mattingly @wjbmattingly.bsky.social · Aug 25

Ahh no worries!! Thanks! I hope you had a nice vacation

William J.B. Mattingly @wjbmattingly.bsky.social · Aug 14

Something I've realized over the last couple weeks with finetuning various VLMs is that we just need more data. Unfortunately, that takes a lot of time. That's why I'm returning to my synthetic HTR workflow. This will be packaged now and expanded to work with other low-resource languages. Stay tuned

William J.B. Mattingly @wjbmattingly.bsky.social · Aug 13

No problem! It's hard to fit a good answer in 300 characters =) Feel free to DM me any time.

William J.B. Mattingly @wjbmattingly.bsky.social · Aug 13

Also, if you are doing a full finetune vs LoRa adapters is another thing to consider. Also, depends on the model arch.

William J.B. Mattingly @wjbmattingly.bsky.social · Aug 13

I hate saying this, but it's true: it depends. For line-level medieval Latin (out of scope, but small problem size), 1-3k examples seems to be fine. For page level out of scope problems, it really becomes more challenging and very model dependent, 1-10k in my experience.

1 1

William J.B. Mattingly @wjbmattingly.bsky.social · Aug 13

I've been getting asked training scripts when a new VLM drops. Instead of scripts, I'm going to start updating this new Python package. It's not fancy. It's for full finetunes. This was how I first trained Qwen 2 VL last year.

1 2 14

William J.B. Mattingly @wjbmattingly.bsky.social · Aug 13

Let's go! Training LFM2-VL 1.6B on Catmus dataset on @hf.co now. Will start posting some benchmarks on this model soon.

1 4

William J.B. Mattingly @wjbmattingly.bsky.social · Aug 13

Training on full catmus now and the results after first checkpoint are very promising. Character and massive word-level improvement.

1 1 3

William J.B. Mattingly @wjbmattingly.bsky.social · Aug 13

Thanks!! =)

William J.B. Mattingly @wjbmattingly.bsky.social · Aug 13

Congrats on the new job!!

William J.B. Mattingly @wjbmattingly.bsky.social · Aug 13

LiquidAI cooked with LFM2-VL. At the risk of sounding like an X AI influencer, don't sleep on this model. I'm finetuning right now on Catmus. A small test over night on only 3k examples is showing remarkable improvement. Training now on 150k samples. I see this as potentially replacing TrOCR.

1 2 10

William J.B. Mattingly @wjbmattingly.bsky.social · Aug 12

New super lightweight VLM just dropped from Liquid AI in two flavors: 450M and 1.6B. Both models can work out-of-the-box with medieval Latin at the line level. I'm fine-tuning on Catmus/medieval right now on an h200.

1 3 9

William J.B. Mattingly @wjbmattingly.bsky.social · Aug 12

Awesome!!

Reposted by William J.B. Mattingly

Rainer Simon @aboutgeo.bsky.social · Aug 12

With #IMMARKUS, you can already use popular AI services for image transcription. Now, you can also use them for translation! Transcribe a historic source, select the annotation—and translate it with a click.

1 2 7

William J.B. Mattingly @wjbmattingly.bsky.social · Aug 12

GLM-4.5V with line-level transcription of medieval Latin in Caroline Miniscule. Inference was run through @hf.co Inferencevia Novita.

1 4

William J.B. Mattingly @wjbmattingly.bsky.social · Aug 11

Qwen 3-4B Thinking finetune nearly ready to share. It can convert unstructured natural language, non-linkedart JSON, and HTML into LinkedArt JSON.

William J.B. Mattingly @wjbmattingly.bsky.social · Aug 9

I need to get back to my Voynich work soon! I will finally have time in a couple months I think.

Reposted by William J.B. Mattingly

Claire @chirila.bsky.social · Aug 8

Discover Magazine did a nice feature on the Voynich Manuscript. I had a delightful conversation with Sam Waters, and here's the result. (there's also a print version in the current issue) #linguistics #Yale #YaleLibrary #BeineckeLibrary #conlangs

www.discovermagazine.com/was-the-worl...

Was the World’s Most Mysterious Manuscript from the Middle Ages A Hoax?

Its indecipherable text and illustrations have stumped scholars throughout history. Here’s what we know about the Voynich Manuscript’s content and creation, and whether the text is truly as mystifying...

www.discovermagazine.com

2 2 8

William J.B. Mattingly @wjbmattingly.bsky.social · Aug 7

No problem!

William J.B. Mattingly @wjbmattingly.bsky.social · Aug 7

Hmm, I think in those scenarios it may default to character parsing, but it wouldn't leverage the language model component very well. If you have some examples, I can test them out and see what happens.

William J.B. Mattingly @wjbmattingly.bsky.social · Aug 7

Good question! I think it could handle the layout parsing aspect with enough training data (maybe 2k pages?). The problem is where to put the HTR/OCR output for reading order. Also, the quality of the HTR/OCR will depend on the language. Is this for medieval Latin?

1 1

William J.B. Mattingly @wjbmattingly.bsky.social · Aug 7

5. Overall, this is a great model and at 1.7B I am seriously amazed at how well it handles two complex tasks (layout parsing and OCR/HTR) in tandem.

William J.B. Mattingly @wjbmattingly.bsky.social · Aug 7

4. Getting Dots.OCR to learn the features and syntax of a new language is a daunting task. I have 2.3k pages (some bi-paginal) of Old Chruch Slavonic. This is an entirely unsupported language. It started to learn some of the new characters (ligatures), but struggled in getting the syntax.