Dipankar Mazumdar
banner
dipankartnt.bsky.social
Dipankar Mazumdar
@dipankartnt.bsky.social
Director (Data/AI) @Cloudera
Contributor: Apache Iceberg | Apache Hudi
Distributed Computing | Technical Author (O’Reilly, Packt)

Prev📍: DevRel @Onehouse.ai, Dremio, Engineering @Qlik, @OTIS Elevators

Book 📕: https://a.co/d/fUDs7G6
Pinned
Announcing my new book - "Engineering Lakehouses with Open Table Formats" 🎉

TBH, I have been thinking about this for quite some time.
Most of the times, in conversations with engineers exploring these formats, so many questions have come up.
[NEW BLOG]: What is Apache Arrow Flight / Flight SQL / ADBC? 🎉

We need to ask, Why don’t ODBC & JDBC fit in today’s analytical world?

These protocols were designed particularly for row-based workloads.

What about columnar “Arrow” based data?

dipankar-tnt.medium.com/what-is-apac...
What is Apache Arrow Flight, Flight SQL & ADBC?
ODBC (Open Database Connectivity) and JDBC (Java Database Connectivity) have long been the industry standard for connecting databases with…
dipankar-tnt.medium.com
March 18, 2025 at 3:39 PM
Blogged: ACID in Lakehouse - How Apache Hudi, Iceberg & Delta Lake implements it?

www.onehouse.ai/blog/acid-tr...
ACID Transactions in an Open Data Lakehouse
Ensuring Atomicity, Consistency, Isolation, and Durability (ACID) in data systems is crucial for maintaining data integrity, especially in environments with concurrent operations. By the end of this b...
www.onehouse.ai
February 22, 2025 at 5:28 PM
Reposted by Dipankar Mazumdar
📅 Save the Date 📅

Community Over Code North America 2025 has been announced!

Where: Minneapolis, MN (USA)
When: September 11-14, 2025

Read more about #CommunityOverCode --> https://buff.ly/4jQx36S
February 10, 2025 at 11:34 PM
Blogged: Concurrency control methods in a Lakehouse with Apache Hudi, Iceberg & Delta Lake.

In this blog I go into the fundamentals of concurrency control, explore why it is essential for lakehouses with OCC, MVCC & Non blocking control.

hudi.apache.org/blog/2025/01...
Concurrency Control in Open Data Lakehouse | Apache Hudi
Introduction
hudi.apache.org
January 30, 2025 at 5:20 PM
Blogged: Clustering Algorithms in Open Lakehouse formats such as Apache Hudi, Apache Iceberg & Delta Lake.

Querying huge volumes of data from storage demands optimized query speed

Your queries are fast today, but they might not be over time!

Read: www.onehouse.ai/blog/what-is...
January 24, 2025 at 4:00 PM
“Bringing the database kernel to data lakes” - this is what Apache Hudi started with before the world heard of something called “Lakehouse”.

Lakehouse means only one thing- data lakes needed the “transactional layer” on top of Parquet for running db-style workloads (both transactional & analytical)
January 13, 2025 at 9:10 PM
Reposted by Dipankar Mazumdar
Now seems like a good time to remind people of my starter pack.

go.bsky.app/T1SxhAe
January 7, 2025 at 7:42 PM
Reposted by Dipankar Mazumdar
The 200th edition of Data Engineering Weekly is out. Thank you all for your kind support
www.dataengineeringw...
Data Engineering Weekly #200
The Weekly Data Engineering Newsletter
www.dataengineeringweekly.com
December 9, 2024 at 3:14 PM
Reposted by Dipankar Mazumdar
Apache XTable at Scale in Production!

This is a solid example to show the metadata translation capability for open table formats like Iceberg, Hudi & Delta.

Fabric users can work with Iceberg tables written by Snowflake without any rewrites/stuff.
Link: blog.fabric.microsoft.com/en-us/blog/s...
November 21, 2024 at 11:46 PM
Apache XTable at Scale in Production!

This is a solid example to show the metadata translation capability for open table formats like Iceberg, Hudi & Delta.

Fabric users can work with Iceberg tables written by Snowflake without any rewrites/stuff.
Link: blog.fabric.microsoft.com/en-us/blog/s...
November 21, 2024 at 11:46 PM
Reposted by Dipankar Mazumdar
New blog post on the fun new hardware advancements which databases can leverage for great gains, and why the cloud means it doesn't matter that they exist. 🫠

transactional.blog/b...
November 20, 2024 at 12:13 AM
Reposted by Dipankar Mazumdar
This might be the first time an open source app is at the top of the app store. Definitely the first open source social app.
Here’s the source code by the way github.com/bluesky-soci...
November 13, 2024 at 6:50 PM
5 techniques to optimize performance for your Open Lakehouse.

No matter how performant your compute engine is, your storage needs data to be optimally organized.

Here’s a new blog I published!

www.onehouse.ai/blog/how-to-...
How to Optimize Performance for Your Open Data Lakehouse
How can you speed up queries, and deal with new workloads and shifting query patterns? Learn about the tools you have to optimize your lakehouse for performance.
www.onehouse.ai
November 9, 2024 at 4:51 AM
Fast "Copy-On-Write" on Apache Parquet.

Upserts are crucial for use cases like CDC.

There are two ways for record-level updates in data lakes:
Copy-On-Write (CoW)
Merge-On-Read (MoR)

A 🧵on Uber's new CoW optimization for Parquet.
November 4, 2024 at 8:40 PM
Announcing my new book - "Engineering Lakehouses with Open Table Formats" 🎉

TBH, I have been thinking about this for quite some time.
Most of the times, in conversations with engineers exploring these formats, so many questions have come up.
November 1, 2024 at 7:53 PM
Reposted by Dipankar Mazumdar
I started this new repo to serve as a central hub for content that I create & insights that I share from different sources about the Lakehouse architecture.

Here’s what you will find in the repo:
⭐️ Blogs
⭐️ Research Papers summary
⭐️ Code
⭐️ Crisp Social posts

Link: github.com/dipankarmazu...
October 29, 2024 at 3:46 AM
I started this new repo to serve as a central hub for content that I create & insights that I share from different sources about the Lakehouse architecture.

Here’s what you will find in the repo:
⭐️ Blogs
⭐️ Research Papers summary
⭐️ Code
⭐️ Crisp Social posts

Link: github.com/dipankarmazu...
October 29, 2024 at 3:46 AM
Hey 👋 folks! It’s very nice to see a lot of data community people here.

I am Dipankar & I have been involved in open source space for some time. Currently I focus on Data Infra with projects such as Apache Hudi, Iceberg, Arrow & XTable. In my prev gigs, I worked on different spectrum of Data.
October 28, 2024 at 8:43 PM
What is Incremental Processing & how orgs like Uber benefit from it in a Lakehouse?

Incremental processing is a technique that processes data in small increments rather than in one large batch.
October 28, 2024 at 4:16 AM
Reposted by Dipankar Mazumdar
DataFusion implements one of the most advanced Parquet readers🚀, checkout how:
blog.haoxp.xyz/posts/parque...
October 25, 2024 at 1:24 AM
Reposted by Dipankar Mazumdar
A list of Data Engineers gathered from my Twitter following bsky.app/profile/ssp....
October 21, 2024 at 12:53 PM