No accomplishment by any individual justifies receiving the same compensation as 3M people working full-time.”
No accomplishment by any individual justifies receiving the same compensation as 3M people working full-time.”
Not that 1926 Norwegian statistical tables is a generally useful benchmark…
We tested every major parser on real enterprise documents.
The results will change how you think about OCR accuracy 🧵
Not that 1926 Norwegian statistical tables is a generally useful benchmark…
We tested every major parser on real enterprise documents.
The results will change how you think about OCR accuracy 🧵
We tested every major parser on real enterprise documents.
The results will change how you think about OCR accuracy 🧵
We tested every major parser on real enterprise documents.
The results will change how you think about OCR accuracy 🧵
Learn how to extract data from unstructured documents with @tensorlake.ai, store them in @qdrant.bsky.social, and then use @langchain.bsky.social to for natural language querying.
Check out our lesson 👇
In the free Qdrant Essentials Course, learn how to:
- Architect vector-powered data lakes
- Optimize ETL pipelines
- Create knowledge graphs
- Integrate @langchain.bsky.social agents for natural language queries
t.co/OoPZswrL7z
Learn how to extract data from unstructured documents with @tensorlake.ai, store them in @qdrant.bsky.social, and then use @langchain.bsky.social to for natural language querying.
Check out our lesson 👇
In the free Qdrant Essentials Course, learn how to:
- Architect vector-powered data lakes
- Optimize ETL pipelines
- Create knowledge graphs
- Integrate @langchain.bsky.social agents for natural language queries
t.co/OoPZswrL7z
In the free Qdrant Essentials Course, learn how to:
- Architect vector-powered data lakes
- Optimize ETL pipelines
- Create knowledge graphs
- Integrate @langchain.bsky.social agents for natural language queries
t.co/OoPZswrL7z
We're using VLMs for:
- Page classification in large documents
- Table/figure summarization
- Fast structured extraction (skip_ocr mode)
Here's what this means for document processing 🧵
We're using VLMs for:
- Page classification in large documents
- Table/figure summarization
- Fast structured extraction (skip_ocr mode)
Here's what this means for document processing 🧵
That means:
❌ Lost audit trails
❌ Manual review of revision history
❌ No programmatic access to reviewer comments
❌ Workflows that can't route based on specific edits
That means:
❌ Lost audit trails
❌ Manual review of revision history
❌ No programmatic access to reviewer comments
❌ Workflows that can't route based on specific edits
When you need to have an accurate representation of the document (e.g. header levels), you need something more than OCR.
Tensorlake fixes OCR results, detecting and correcting header levels when parsing. 👇
Section 2.2 becomes a top-level header (##) instead of nested (###).
We just shipped automatic header correction.
🧵 How it works:
When you need to have an accurate representation of the document (e.g. header levels), you need something more than OCR.
Tensorlake fixes OCR results, detecting and correcting header levels when parsing. 👇
Section 2.2 becomes a top-level header (##) instead of nested (###).
We just shipped automatic header correction.
🧵 How it works:
Section 2.2 becomes a top-level header (##) instead of nested (###).
We just shipped automatic header correction.
🧵 How it works:
(She’s the Dalmatian)
Or just my kid?
(She’s the Dalmatian)
Or just my kid?
(She’s the Dalmatian)
Or just my kid?
(She’s the Dalmatian)
Or just my kid?
Being able to add spatial information from the original parse api call makes it super easy, but I'm curious how others are also handling it?
When users ask "where did this come from?" your system should point to the exact page fragment...not just "file_name.pdf".
Built citation-aware RAG with spatial metadata has:
→ Parse docs with bounding boxes
→ Embed citation anchors in chunks
→ Return page numbers + coordinates
A 🧵
Being able to add spatial information from the original parse api call makes it super easy, but I'm curious how others are also handling it?
When users ask "where did this come from?" your system should point to the exact page fragment...not just "file_name.pdf".
Built citation-aware RAG with spatial metadata has:
→ Parse docs with bounding boxes
→ Embed citation anchors in chunks
→ Return page numbers + coordinates
A 🧵
When users ask "where did this come from?" your system should point to the exact page fragment...not just "file_name.pdf".
Built citation-aware RAG with spatial metadata has:
→ Parse docs with bounding boxes
→ Embed citation anchors in chunks
→ Return page numbers + coordinates
A 🧵
Those of us working in the space know that the bar is set *much* higher than that. AI is a tool. You wouldn't just hammer everything and shrug when a window breaks.
Use tools intelligently.
Every answer should come with receipts (citations + context).
Learn how to make your AI correct and verifiable in this month’s Document Digest newsletter 👇
Those of us working in the space know that the bar is set *much* higher than that. AI is a tool. You wouldn't just hammer everything and shrug when a window breaks.
Use tools intelligently.
How are smaller companies deciding between what to build, what infra to pay directly for, and what tools to just leverage?
How are smaller companies deciding between what to build, what infra to pay directly for, and what tools to just leverage?
This is the beginning of better integration with Microsoft Azure and Tensorlake.
If you are using Azure, and need better Document Ingestion and ETL for unstructured data reach out to us!
This is the beginning of better integration with Microsoft Azure and Tensorlake.
If you are using Azure, and need better Document Ingestion and ETL for unstructured data reach out to us!
Get citations for every field extracted with Tensorlake.
Read the blog and try our citations with the example notebooks: tlake.link/blog/citations
Great article by @guthals.com.
dev.to/drguthals/th...
Great article by @guthals.com.
dev.to/drguthals/th...
It's why I love this industry is because it's really all about learning 🤓
Regardless of my title or how tech evolves
It's why I love this industry is because it's really all about learning 🤓
Regardless of my title or how tech evolves
It was fun coming up with the colab notebooks for this one: Compare the claims made in news articles about Tesla with actual Tesla SEC filings 👀
Check out the thread and blog (with notebooks) 👇
What’s dead is cosine‑N without a retrieval plan.
We ship advanced RAG...out of the box:
• Classify pages → target sections
• Extract structured fields → filter by form_type, fiscal_period
• Verify data; cite page/bbox
Want to know how? 🧵👇
It was fun coming up with the colab notebooks for this one: Compare the claims made in news articles about Tesla with actual Tesla SEC filings 👀
Check out the thread and blog (with notebooks) 👇
I’m so grateful he ever came into my life.
I am forever changed by him in the best ways, and would have never survived the last three years without him.
If you need to rest, then fine, but I’d much prefer if you haunt me
I’m so grateful he ever came into my life.
I am forever changed by him in the best ways, and would have never survived the last three years without him.
If you need to rest, then fine, but I’d much prefer if you haunt me