Lightnews — Scholar-powered news

Dipankar Mazumdar

@dipankartnt.bsky.social

780 followers 140 following 53 posts

Director (Data/AI) @Cloudera
Contributor: Apache Iceberg | Apache Hudi
Distributed Computing | Technical Author (O’Reilly, Packt)

Prev📍: DevRel @Onehouse.ai, Dremio, Engineering @Qlik, @OTIS Elevators

Book 📕: https://a.co/d/fUDs7G6

Posts Replies Media Videos

Dipankar Mazumdar

@dipankartnt.bsky.social

Blogged: Clustering Algorithms in Open Lakehouse formats such as Apache Hudi, Apache Iceberg & Delta Lake.

Querying huge volumes of data from storage demands optimized query speed

Your queries are fast today, but they might not be over time!

Read: www.onehouse.ai/blog/what-is...

January 24, 2025 at 4:00 PM

Dipankar Mazumdar

@dipankartnt.bsky.social

“Bringing the database kernel to data lakes” - this is what Apache Hudi started with before the world heard of something called “Lakehouse”.

Lakehouse means only one thing- data lakes needed the “transactional layer” on top of Parquet for running db-style workloads (both transactional & analytical)

January 13, 2025 at 9:10 PM

Dipankar Mazumdar

@dipankartnt.bsky.social

Apache XTable at Scale in Production!

This is a solid example to show the metadata translation capability for open table formats like Iceberg, Hudi & Delta.

Fabric users can work with Iceberg tables written by Snowflake without any rewrites/stuff.
Link: blog.fabric.microsoft.com/en-us/blog/s...

November 21, 2024 at 11:46 PM

Dipankar Mazumdar

@dipankartnt.bsky.social

3. Clustering: this is a technique used to reorganize & group data within files. The core problem that clustering addresses is the misalignment between how data is written & how it is queried. Often, data is written based on arrival time, which doesn’t necessarily align with the event time.

November 9, 2024 at 4:59 AM

Dipankar Mazumdar

@dipankartnt.bsky.social

2. Compaction: as data is ingested into storage, numerous small files are generated, leading to what is known as the "small file problem." These small files cause query engines to process many files, increasing I/O overhead & slowing down query performance.

You need to compact these files.

November 9, 2024 at 4:56 AM

Dipankar Mazumdar

@dipankartnt.bsky.social

1. Partitioning: A known method of course - dividing your data into smaller, more manageable chunks, or partitions, based on specific columns. Be aware of the partitioning vices though - www.onehouse.ai/blog/knowing...

November 9, 2024 at 4:54 AM

Dipankar Mazumdar

@dipankartnt.bsky.social

Announcing my new book - "Engineering Lakehouses with Open Table Formats" 🎉

TBH, I have been thinking about this for quite some time.
Most of the times, in conversations with engineers exploring these formats, so many questions have come up.

November 1, 2024 at 7:53 PM

Dipankar Mazumdar

@dipankartnt.bsky.social

I started this new repo to serve as a central hub for content that I create & insights that I share from different sources about the Lakehouse architecture.

Here’s what you will find in the repo:
⭐️ Blogs
⭐️ Research Papers summary
⭐️ Code
⭐️ Crisp Social posts

Link: github.com/dipankarmazu...

October 29, 2024 at 3:46 AM

Dipankar Mazumdar

@dipankartnt.bsky.social

Hey 👋 folks! It’s very nice to see a lot of data community people here.

I am Dipankar & I have been involved in open source space for some time. Currently I focus on Data Infra with projects such as Apache Hudi, Iceberg, Arrow & XTable. In my prev gigs, I worked on different spectrum of Data.

October 28, 2024 at 8:43 PM

Dipankar Mazumdar

@dipankartnt.bsky.social

What is Incremental Processing & how orgs like Uber benefit from it in a Lakehouse?

Incremental processing is a technique that processes data in small increments rather than in one large batch.

October 28, 2024 at 4:16 AM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news