Lightnews — Scholar-powered news

Tensorlake

@tensorlake.ai

Amazing!!! We always want you to try it out with the "hardest" documents - So glad to hear it worked!

The goal of an effective document parser SHOULD be to help with the hardest problems. a low-quality scan of a Norwegian document with statistical tables from 1926 is the perfect test doc 🔥

November 6, 2025 at 12:10 AM

Tensorlake

@tensorlake.ai

We published:
✓ Full methodology
✓ Corrected OCRBench v2 ground truth
✓ Comparative analysis across all major providers

Read the full benchmark: tlake.link/benchmarks

Stop benchmarking vanity metrics. Start measuring what breaks.

Benchmarking the Most Reliable Document Parsing API | Tensorlake

Learn how Tensorlake built the most reliable document parsing API by measuring what actually matters: structural preservation, reading order accuracy, and downstream usability. See benchmark results c...

tlake.link

November 5, 2025 at 5:05 PM

Tensorlake

@tensorlake.ai

The results were clear:

Tensorlake: 86.8% TEDS, 91.7% F1
AWS Textract: 80.7% TEDS, 88.4% F1
Azure: 78.1% TEDS, 88.1% F1
Docling: 63.8% TEDS, 68.9% F1

The gap? 670 fewer manual reviews per 10k documents.

November 5, 2025 at 5:05 PM

Tensorlake

@tensorlake.ai

We evaluated on OCRBench v2, OmniDocBench, and 100 real enterprise docs using two metrics that predict production success:

TEDS (Tree Edit Dist): Measures if tables stay tables
JSON F1: Measures if downstream systems can use the output

Not "is the text similar?" but "can automation actually work?"

November 5, 2025 at 5:05 PM

Tensorlake

@tensorlake.ai

Traditional benchmarks test on clean PDFs and measure text accuracy.

But your production failures come from:
- Collapsed tables
- Jumbled reading order
- Missing visual content
- Hallucinated extractions

None of this shows up in your scores.

November 5, 2025 at 5:05 PM

Tensorlake

@tensorlake.ai

Try it yourself with our SEC filing analysis notebook:
tlake.link/notebooks/vl...

Shows how to extract cryptocurrency metrics from 10-Ks and 10-Qs using page classification

Full changelog: tlake.link/changelog/vlm

What would you build with this?

New: Vision Language Models for Document Processing

Tensorlake now uses Vision Language Models (VLMs) across multiple features including page classification, figure/table summarization, and structured extraction, enabling faster and more intelligent do...

tlake.link

October 16, 2025 at 9:44 PM

Tensorlake

@tensorlake.ai

Where we leverage VLM support:
📄 Page Classification: Large docs, specific sections needed
📊 Table/Figure Summarization: Visual data in reports
⚡ skip_ocr=True: When reading order is complex and for diagrams and scanned docs

Text extraction still uses OCR for best quality

October 16, 2025 at 9:44 PM

Tensorlake

@tensorlake.ai

Real results from analyzing 8 SEC filings:
- 1,500+ total pages
- 427 relevant pages identified by VLM
- Processing time: 5 minutes → 45 seconds per document

All without sacrificing accuracy

October 16, 2025 at 9:44 PM

Tensorlake

@tensorlake.ai

Our solution: VLMs understand document structure visually
Example: Extracting crypto holdings from SEC filings
1. VLM classifies which pages contain financial data (~50 out of 200 pages)
2. Extract only from relevant pages
3. Skip 70% of processing

Result: 80-90% faster ⚡

October 16, 2025 at 9:44 PM

Tensorlake

@tensorlake.ai

The problem: Processing 200-page documents when you only need specific information is slow and expensive

Traditional approach:
OCR everything → Convert to text → Search → Extract

This wastes time processing irrelevant pages

October 16, 2025 at 9:44 PM

Tensorlake

@tensorlake.ai

Build approval workflows that trigger on specific feedback. Extract complete edit history for regulatory compliance. Route documents based on flagged sections, all programmatically.

Live now in our API, SDK, and Cloud.

October 10, 2025 at 5:25 PM

Tensorlake

@tensorlake.ai

Now you can parse .docx files with tracked changes preserved as clean, structured HTML:
- <del> tags for deletions
- <ins> tags for insertions
- <span class="comment"> for reviewer notes

October 10, 2025 at 5:25 PM

Tensorlake

@tensorlake.ai

Perfect for:
→ RAG pipelines (better chunking)
→ Knowledge graphs (accurate trees)
→ Document navigation
→ Table of contents generation

Changelog: tlake.link/changelog/he...
Try it: tlake.link/notebooks/he...

Try it in Colab

No description

tlake.link

October 2, 2025 at 4:21 PM

Tensorlake

@tensorlake.ai

Every section header now returns:
- level: 0 for #, 1 for ##, 2 for ###, etc
- content: clean text
- proper nesting for up to 6 levels

Enable with:
cross_page_header_detection=True

That's it.

October 2, 2025 at 4:21 PM

Tensorlake

@tensorlake.ai

Tensorlake analyzes numbering patterns (1, 1.1, 1.2) and visual structure across the ENTIRE document.

Then corrects misidentified header levels automatically.

Works even when headers span page breaks.

October 2, 2025 at 4:21 PM

Tensorlake

@tensorlake.ai

There's no reason your applications should not be citation-ready.

Dive deeper and try out the Colab notebook linked at the bottom of the blog

Citation-Aware RAG: How to add Fine Grained Citations in Retrieval and Response Synthesis | Tensorlake

Learn how to build citation-aware RAG systems that link AI responses back to exact source locations in documents. This technical guide covers document parsing with spatial metadata, chunking strategie...

tlake.link

September 19, 2025 at 5:44 PM

Tensorlake

@tensorlake.ai

Step 3: Generating AI responses with verifiable citations

Once your chunks carry anchors, retrieval doesn’t change. You can use the dense, hybrid, or reranker setup you already have. Consider hiding the anchors in prose, while keeping them in output and making IDs clickable.

RAG citation workflow diagram on dark green background showing document processing pipeline: Document (PDF/Image) → Tensorlake Document AI → Parsed Elements (Text, Tables, Figures, and Bounding Box) → merge and insert anchors → Chunks and Anchors (Clean text and citation IDs) → splits to Citation Metadata (page, bounding box, citation IDs) and Vector DB (embeddings, text, and metadata). URL: https://tlake.link/blog/rag-citations

September 19, 2025 at 5:44 PM

Tensorlake

@tensorlake.ai

Step 2: Create contextualized chunks

Iterate through page fragment objects and create appropriately sized chunks by combining them. As you create the chunks, you can create contextualized metadata to help during retrieval.

Before and after comparison of document chunking on dark green background. Top panel "Without Contextualized Chunking" shows plain text: "SMOTE creates a broader decision region for the minority class...". Bottom panel "With Contextualized Chunking" shows same text with citation anchor "<c>2.1</c>" and metadata: {"2.1": {"page": 23, "bbox": {...}}}. URL: https://tlake.link/blog/rag-citations

September 19, 2025 at 5:44 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news