Lightnews — Scholar-powered news

GitHub - huggingface/nanoVLM: The simplest, fastest repository for training/finetuning small-sized VLMs.

2

Andi @andimara.bsky.social · May 21

Train your Vision-Language Model in just two commands:
> git clone github.com/huggingface/...
> python train.py

The simplest, fastest repository for training/finetuning small-sized VLMs. - huggingface/nanoVLM

github.com

Link: webml-community-smolvlm-realtime-webgpu.static.hf.space/index.html

Andi @andimara.bsky.social · May 21

New Blog📖✨:
nanoVLM: The simplest way to train your own Vision-Language Model in pure PyTorch explained step-by-step!
Easy to read, even easier to use. Train your first VLM today!

1 3

Andi @andimara.bsky.social · May 14

Camera Interaction App

webml-community-smolvlm-realtime-webgpu.static.hf.space

Paper page - SmolVLM: Redefining small and efficient multimodal models

Andi @andimara.bsky.social · May 14

Real-time SmolVLM in a web browser with transformers.js.

All running locally with no installs. Just open the website.

1 1 3

Andi @andimara.bsky.social · Apr 8

If you’re into efficient multimodal models, you’ll love this one.
Check out the paper: huggingface.co/papers/2504....

Join the discussion on this paper page

Andi @andimara.bsky.social · Apr 8

📱 Real-world Efficiency: We've created an app using SmolVLM on an iPhone 15 and got real-time inference directly from its camera!
🌐 Browser-based Inference? Yep! We get lightning-fast inference speeds of 40-80 tokens per second directly in a web browser. No tricks, just compact, efficient models!

Andi @andimara.bsky.social · Apr 8

🌟 State-of-the-Art Performance, SmolVLM comes in three powerful yet compact sizes—256M, 500M, and 2.2B parameters—each setting new SOTA benchmarks for their hardware constraints in image and video understanding.

Andi @andimara.bsky.social · Apr 8

✨ Less CoT, more efficiency: Turns out, too much Chain-of-Thought (CoT) data actually hurts performance in small models. They dumb
✨ Longer videos, better results: Increasing video length during training enhanced performance on both video and image tasks.

Andi @andimara.bsky.social · Apr 8

✨ System prompts and special tokens are key: Introducing system prompts and dedicated media intro/outro tokens significantly boosted our compact VLM’s performance—especially for video tasks.

Andi @andimara.bsky.social · Apr 8

✨ Pixel shuffling magic: Aggressively pixel shuffling helped our compact VLMs "see" better, same performance with sequences 16x shorter!
✨ Learned positional tokens FTW: For compact models, learned positional tokens significantly outperform raw text tokens, enhancing efficiency and accuracy.

Andi @andimara.bsky.social · Apr 8

✨ Smaller is smarter with SigLIP: Surprise! Smaller LLMs didn't benefit from the usual large SigLIP (400M). Instead, we use the 80M base SigLIP that performs equally well at just 20% of the original size!

Andi @andimara.bsky.social · Apr 8

Here are the coolest insights from our experiments:
✨ Longer context = Big wins: Increasing the context length from 2K to 16K gave our tiny VLMs a 60% performance boost!

Paper page - SmolVLM: Redefining small and efficient multimodal models

Andi @andimara.bsky.social · Apr 8

Today, we share the tech report for SmolVLM: Redefining small and efficient multimodal models.
🔥 Explaining how to create a tiny 256M VLM that uses less than 1GB of RAM and outperforms our 80B models from 18 months ago!
huggingface.co/papers/2504....

Join the discussion on this paper page

ds4sd/SmolDocling-256M-preview · Hugging Face

2 8 22

Andi @andimara.bsky.social · Mar 17

SmolDocling is available today 🏗️

🔗 Model: huggingface.co/ds4sd/SmolDo...
📖 Paper: huggingface.co/papers/2503....
🤗 Space: huggingface.co/spaces/ds4sd...

Try it and let us know what you think! 💬

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

6

Andi @andimara.bsky.social · Mar 17

At only 256M parameters, SmolDocling outperforms much larger models on key document conversion tasks:
🖋️ Full-page transcription: Beats models 27× bigger!
📑 Equations: Matches or beats leading models like GOT
💻 Code recognition: We introduce the first benchmark for code OCR

Andi @andimara.bsky.social · Mar 17

What makes it unique?
📌 Handles everything a document has: tables, charts, code, equations, lists, and more
📌 Works beyond scientific papers—supports business docs, patents, and forms
📌 It runs with less than 1GB of RAM, so running at large batch sizes is super cheap!

Andi @andimara.bsky.social · Mar 17

How does SmolDocling beat models 27× bigger? SmolDocling transforms any document into structured metadata with DocTags, being SOTA in:

✅ Full-page conversion
✅ Layout identification
✅ Equations, tables, charts, plots, code OCR

A Deepdive into Aya Vision: Advancing the Frontier of Multilingual Multimodality

Andi @andimara.bsky.social · Mar 17

🚀 We just dropped SmolDocling: a 256M open-source vision LM for complete document OCR! 📄✨
Lightning fast, process a page in 0.35 sec on consumer GPU using < 500MB VRAM ⚡
SOTA in document conversion, beating every competing model we tested (including models 27x more params) 🤯
But how? 🧶⬇️

3 10 35

Andi @andimara.bsky.social · Mar 5

Extremely bullish on @CohereForAI's Aya Vision (8B & 32B) - new SOTA open-weight VLMs

- 8B wins up to 81% of the time in its class, better than Gemini Flash
- 32B beats Llama 3.2 90B!
- Integrated on @hf.co from Day 0!

Check out their blog! huggingface.co/blog/aya-vis...

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

Andi @andimara.bsky.social · Jan 31

Me too! Highlight of my career so far :)