The goal of an effective document parser SHOULD be to help with the hardest problems. a low-quality scan of a Norwegian document with statistical tables from 1926 is the perfect test doc 🔥
The goal of an effective document parser SHOULD be to help with the hardest problems. a low-quality scan of a Norwegian document with statistical tables from 1926 is the perfect test doc 🔥
✓ Full methodology
✓ Corrected OCRBench v2 ground truth
✓ Comparative analysis across all major providers
Read the full benchmark: tlake.link/benchmarks
Stop benchmarking vanity metrics. Start measuring what breaks.
✓ Full methodology
✓ Corrected OCRBench v2 ground truth
✓ Comparative analysis across all major providers
Read the full benchmark: tlake.link/benchmarks
Stop benchmarking vanity metrics. Start measuring what breaks.
Tensorlake: 86.8% TEDS, 91.7% F1
AWS Textract: 80.7% TEDS, 88.4% F1
Azure: 78.1% TEDS, 88.1% F1
Docling: 63.8% TEDS, 68.9% F1
The gap? 670 fewer manual reviews per 10k documents.
Tensorlake: 86.8% TEDS, 91.7% F1
AWS Textract: 80.7% TEDS, 88.4% F1
Azure: 78.1% TEDS, 88.1% F1
Docling: 63.8% TEDS, 68.9% F1
The gap? 670 fewer manual reviews per 10k documents.
TEDS (Tree Edit Dist): Measures if tables stay tables
JSON F1: Measures if downstream systems can use the output
Not "is the text similar?" but "can automation actually work?"
TEDS (Tree Edit Dist): Measures if tables stay tables
JSON F1: Measures if downstream systems can use the output
Not "is the text similar?" but "can automation actually work?"
But your production failures come from:
- Collapsed tables
- Jumbled reading order
- Missing visual content
- Hallucinated extractions
None of this shows up in your scores.
But your production failures come from:
- Collapsed tables
- Jumbled reading order
- Missing visual content
- Hallucinated extractions
None of this shows up in your scores.
tlake.link/notebooks/vl...
Shows how to extract cryptocurrency metrics from 10-Ks and 10-Qs using page classification
Full changelog: tlake.link/changelog/vlm
What would you build with this?
tlake.link/notebooks/vl...
Shows how to extract cryptocurrency metrics from 10-Ks and 10-Qs using page classification
Full changelog: tlake.link/changelog/vlm
What would you build with this?
📄 Page Classification: Large docs, specific sections needed
📊 Table/Figure Summarization: Visual data in reports
⚡ skip_ocr=True: When reading order is complex and for diagrams and scanned docs
Text extraction still uses OCR for best quality
📄 Page Classification: Large docs, specific sections needed
📊 Table/Figure Summarization: Visual data in reports
⚡ skip_ocr=True: When reading order is complex and for diagrams and scanned docs
Text extraction still uses OCR for best quality
- 1,500+ total pages
- 427 relevant pages identified by VLM
- Processing time: 5 minutes → 45 seconds per document
All without sacrificing accuracy
- 1,500+ total pages
- 427 relevant pages identified by VLM
- Processing time: 5 minutes → 45 seconds per document
All without sacrificing accuracy
Example: Extracting crypto holdings from SEC filings
1. VLM classifies which pages contain financial data (~50 out of 200 pages)
2. Extract only from relevant pages
3. Skip 70% of processing
Result: 80-90% faster ⚡
Example: Extracting crypto holdings from SEC filings
1. VLM classifies which pages contain financial data (~50 out of 200 pages)
2. Extract only from relevant pages
3. Skip 70% of processing
Result: 80-90% faster ⚡
Traditional approach:
OCR everything → Convert to text → Search → Extract
This wastes time processing irrelevant pages
Traditional approach:
OCR everything → Convert to text → Search → Extract
This wastes time processing irrelevant pages
Live now in our API, SDK, and Cloud.
Live now in our API, SDK, and Cloud.
- <del> tags for deletions
- <ins> tags for insertions
- <span class="comment"> for reviewer notes
- <del> tags for deletions
- <ins> tags for insertions
- <span class="comment"> for reviewer notes
→ RAG pipelines (better chunking)
→ Knowledge graphs (accurate trees)
→ Document navigation
→ Table of contents generation
Changelog: tlake.link/changelog/he...
Try it: tlake.link/notebooks/he...
→ RAG pipelines (better chunking)
→ Knowledge graphs (accurate trees)
→ Document navigation
→ Table of contents generation
Changelog: tlake.link/changelog/he...
Try it: tlake.link/notebooks/he...
- level: 0 for #, 1 for ##, 2 for ###, etc
- content: clean text
- proper nesting for up to 6 levels
Enable with:
cross_page_header_detection=True
That's it.
- level: 0 for #, 1 for ##, 2 for ###, etc
- content: clean text
- proper nesting for up to 6 levels
Enable with:
cross_page_header_detection=True
That's it.
Then corrects misidentified header levels automatically.
Works even when headers span page breaks.
Then corrects misidentified header levels automatically.
Works even when headers span page breaks.
Dive deeper and try out the Colab notebook linked at the bottom of the blog
Dive deeper and try out the Colab notebook linked at the bottom of the blog
Once your chunks carry anchors, retrieval doesn’t change. You can use the dense, hybrid, or reranker setup you already have. Consider hiding the anchors in prose, while keeping them in output and making IDs clickable.
Once your chunks carry anchors, retrieval doesn’t change. You can use the dense, hybrid, or reranker setup you already have. Consider hiding the anchors in prose, while keeping them in output and making IDs clickable.
Iterate through page fragment objects and create appropriately sized chunks by combining them. As you create the chunks, you can create contextualized metadata to help during retrieval.
Iterate through page fragment objects and create appropriately sized chunks by combining them. As you create the chunks, you can create contextualized metadata to help during retrieval.