Lightnews — Scholar-powered news

Dipankar Mazumdar

@dipankartnt.bsky.social

[NEW BLOG]: What is Apache Arrow Flight / Flight SQL / ADBC? 🎉

We need to ask, Why don’t ODBC & JDBC fit in today’s analytical world?

These protocols were designed particularly for row-based workloads.

What about columnar “Arrow” based data?

dipankar-tnt.medium.com/what-is-apac...

What is Apache Arrow Flight, Flight SQL & ADBC?

ODBC (Open Database Connectivity) and JDBC (Java Database Connectivity) have long been the industry standard for connecting databases with…

dipankar-tnt.medium.com

March 18, 2025 at 3:39 PM

Dipankar Mazumdar

@dipankartnt.bsky.social

Blogged: ACID in Lakehouse - How Apache Hudi, Iceberg & Delta Lake implements it?

www.onehouse.ai/blog/acid-tr...

ACID Transactions in an Open Data Lakehouse

Ensuring Atomicity, Consistency, Isolation, and Durability (ACID) in data systems is crucial for maintaining data integrity, especially in environments with concurrent operations. By the end of this b...

www.onehouse.ai

February 22, 2025 at 5:28 PM

Reposted by Dipankar Mazumdar

Apache Software Foundation (The ASF)

@apache.org

📅 Save the Date 📅

Community Over Code North America 2025 has been announced!

Where: Minneapolis, MN (USA)
When: September 11-14, 2025

Read more about #CommunityOverCode --> https://buff.ly/4jQx36S

February 10, 2025 at 11:34 PM

Dipankar Mazumdar

@dipankartnt.bsky.social

Blogged: Concurrency control methods in a Lakehouse with Apache Hudi, Iceberg & Delta Lake.

In this blog I go into the fundamentals of concurrency control, explore why it is essential for lakehouses with OCC, MVCC & Non blocking control.

hudi.apache.org/blog/2025/01...

Concurrency Control in Open Data Lakehouse | Apache Hudi

Introduction

hudi.apache.org

January 30, 2025 at 5:20 PM

Dipankar Mazumdar

@dipankartnt.bsky.social

Blogged: Clustering Algorithms in Open Lakehouse formats such as Apache Hudi, Apache Iceberg & Delta Lake.

Querying huge volumes of data from storage demands optimized query speed

Your queries are fast today, but they might not be over time!

Read: www.onehouse.ai/blog/what-is...

January 24, 2025 at 4:00 PM

Dipankar Mazumdar

@dipankartnt.bsky.social

“Bringing the database kernel to data lakes” - this is what Apache Hudi started with before the world heard of something called “Lakehouse”.

Lakehouse means only one thing- data lakes needed the “transactional layer” on top of Parquet for running db-style workloads (both transactional & analytical)

January 13, 2025 at 9:10 PM

Reposted by Dipankar Mazumdar

Arynnpost

@arynn.bsky.social

Now seems like a good time to remind people of my starter pack.

go.bsky.app/T1SxhAe

January 7, 2025 at 7:42 PM

Reposted by Dipankar Mazumdar

Ananth Packkildurai

@ananthdurai.bsky.social

The 200th edition of Data Engineering Weekly is out. Thank you all for your kind support
www.dataengineeringw...

Data Engineering Weekly #200

The Weekly Data Engineering Newsletter

www.dataengineeringweekly.com

December 9, 2024 at 3:14 PM

Reposted by Dipankar Mazumdar

Dipankar Mazumdar

@dipankartnt.bsky.social

Apache XTable at Scale in Production!

This is a solid example to show the metadata translation capability for open table formats like Iceberg, Hudi & Delta.

Fabric users can work with Iceberg tables written by Snowflake without any rewrites/stuff.
Link: blog.fabric.microsoft.com/en-us/blog/s...

November 21, 2024 at 11:46 PM

Dipankar Mazumdar

@dipankartnt.bsky.social

Apache XTable at Scale in Production!

This is a solid example to show the metadata translation capability for open table formats like Iceberg, Hudi & Delta.

Fabric users can work with Iceberg tables written by Snowflake without any rewrites/stuff.
Link: blog.fabric.microsoft.com/en-us/blog/s...

November 21, 2024 at 11:46 PM

Reposted by Dipankar Mazumdar

Alex Miller

@alexmillerdb.bsky.social

New blog post on the fun new hardware advancements which databases can leverage for great gains, and why the cloud means it doesn't matter that they exist. 🫠

transactional.blog/b...

November 20, 2024 at 12:13 AM

Reposted by Dipankar Mazumdar

Jay 🦋

@jay.bsky.team

This might be the first time an open source app is at the top of the app store. Definitely the first open source social app.

Paul Frazee @pfrazee.com · Nov 13

Here’s the source code by the way github.com/bluesky-soci...

November 13, 2024 at 6:50 PM

Dipankar Mazumdar

@dipankartnt.bsky.social

5 techniques to optimize performance for your Open Lakehouse.

No matter how performant your compute engine is, your storage needs data to be optimally organized.

Here’s a new blog I published!

www.onehouse.ai/blog/how-to-...

How to Optimize Performance for Your Open Data Lakehouse

How can you speed up queries, and deal with new workloads and shifting query patterns? Learn about the tools you have to optimize your lakehouse for performance.

www.onehouse.ai

November 9, 2024 at 4:51 AM

Dipankar Mazumdar

@dipankartnt.bsky.social

Fast "Copy-On-Write" on Apache Parquet.

Upserts are crucial for use cases like CDC.

There are two ways for record-level updates in data lakes:
Copy-On-Write (CoW)
Merge-On-Read (MoR)

A 🧵on Uber's new CoW optimization for Parquet.

November 4, 2024 at 8:40 PM

Dipankar Mazumdar

@dipankartnt.bsky.social

Announcing my new book - "Engineering Lakehouses with Open Table Formats" 🎉

TBH, I have been thinking about this for quite some time.
Most of the times, in conversations with engineers exploring these formats, so many questions have come up.

November 1, 2024 at 7:53 PM

Reposted by Dipankar Mazumdar

Dipankar Mazumdar

@dipankartnt.bsky.social

I started this new repo to serve as a central hub for content that I create & insights that I share from different sources about the Lakehouse architecture.

Here’s what you will find in the repo:
⭐️ Blogs
⭐️ Research Papers summary
⭐️ Code
⭐️ Crisp Social posts

Link: github.com/dipankarmazu...

October 29, 2024 at 3:46 AM

Dipankar Mazumdar

@dipankartnt.bsky.social

I started this new repo to serve as a central hub for content that I create & insights that I share from different sources about the Lakehouse architecture.

Here’s what you will find in the repo:
⭐️ Blogs
⭐️ Research Papers summary
⭐️ Code
⭐️ Crisp Social posts

Link: github.com/dipankarmazu...

October 29, 2024 at 3:46 AM

Dipankar Mazumdar

@dipankartnt.bsky.social

Hey 👋 folks! It’s very nice to see a lot of data community people here.

I am Dipankar & I have been involved in open source space for some time. Currently I focus on Data Infra with projects such as Apache Hudi, Iceberg, Arrow & XTable. In my prev gigs, I worked on different spectrum of Data.

October 28, 2024 at 8:43 PM

Dipankar Mazumdar

@dipankartnt.bsky.social

What is Incremental Processing & how orgs like Uber benefit from it in a Lakehouse?

Incremental processing is a technique that processes data in small increments rather than in one large batch.