maybe35xxxv.bsky.social
@maybe35xxxv.bsky.social
Reposted
Vol:18 No:12 → Ursa: A Lakehouse-Native Data Streaming Engine for Kafka
👥 Authors: Sijie Guo, Matteo Merli, Hang Chen, Neng Lu, Penghui Li
📄 PDF: https://www.vldb.org/pvldb/vol18/p5184-guo.pdf
September 2, 2025 at 2:00 AM
Reposted
Published a new post: "PSA: SQLite WAL checksums fail silently and may lose data"

This is a follow up to my previous posts. When SQLite encounters checksum failures in WAL, instead of raising an error, it drops all subsequent frames; even if they are not corrupt. It's not a bug
July 24, 2025 at 3:11 PM
Reposted
I went through DuckDB's WAL, and it does everything I was asking for in my blog post:

1. Per record checksum
2. Explicit error on checksum failure
3. Configurable behavior
4. Partial recovery
5. Safe truncation of the WAL only when WAL contents are checkpointed
avi.im v @avi.im · Jul 24
Published a new post: "PSA: SQLite WAL checksums fail silently and may lose data"

This is a follow up to my previous posts. When SQLite encounters checksum failures in WAL, instead of raising an error, it drops all subsequent frames; even if they are not corrupt. It's not a bug
August 10, 2025 at 11:30 AM
Reposted
SlateDB now has snapshot support! so the next step will soon be (possibly SSI?) transactions 😁

github.com/slatedb/slat...
feat: add DbSnapshot by flaneur2020 · Pull Request #688 · slatedb/slatedb
this pr aims to add the API for creating snapshot. a DbSnapshot contains all the read-only operations in a Db object. it records the seqnum at the moment when the DbSnapshot is created, and always ...
github.com
August 12, 2025 at 2:30 AM
Reposted
Woah, I didn't know this was happening. Open source Kafka now works with Iceberg!
August 18, 2025 at 5:16 PM
Reposted
"When not to use Tokio"--Loving this section in the docs of Tokio (tokio.rs/tokio/tutorial), a runtime for building async applications in Rust. There are no silver bullets, and it's vital to understand when a given library or tool adds value, and when it does not.
August 5, 2025 at 8:22 PM
Reposted
Joy wrote a good summary of the state (haha) of sync engines.

"Sync engines introduce familiar concepts (reactive programming and stream processing) to the web, with the added benefit of enabling local-first software."
April 1, 2025 at 5:40 PM
Reposted
[arXiv] Are Joins over LSM-trees Ready: Take RocksDB as an Example
arxiv.org/abs/2501.1...

> We implement all 29 join methods within our configuration space [...] to derive guidelines for join method selection in LSM-based stores.
February 6, 2025 at 11:50 PM
Reposted
This is a really cool idea. Seems like a nice clean way to implement range deletions.
Proposal: Introduce deletion vector file to reduce write amplification
Proposal: Introduce deletion vector file to reduce write amplification Motivation Deletion vector is a commonly used technique to reduce write amplification in columnar storage. This idea is quite sim...
docs.google.com
January 25, 2025 at 5:46 PM
Reposted
While zero dependencies is practically impossible, everyone I've spoken to agrees that minimizing dependencies is ideal. Rust and JavaScript work against this ideal. But they could change at any time. And Bun and Deno are already examples of this.

notes.eatonphil.com/2025-01-25-a...
January 25, 2025 at 2:37 PM
Reposted
I think @thegeeknarrator.bsky.social ‘s podcast might be my favorite tech podcast right now. Signal to noise is off the charts.
I had a great time talking to Kaivalya Apte about Aurora DSQL on the Geek Narrator podcast. Watch our episode here: www.youtube.com/watch?v=ONkf...

We cover the design of Aurora DSQL, interesting aspects of the design, PostgreSQL, Firecracker, and trade-offs of database systems. A lot in an hour!
AWS Aurora Distributed SQL internals with Marc Brooker - ​ @amazonwebservices
YouTube video by The Geek Narrator
www.youtube.com
January 25, 2025 at 5:35 PM
Reposted
Lakehouse brought many advances in data management at scale. Is it good enough for the next generation of data management? I took a look at Google's Napa system and envisioned what next-gen Lakehouse could look like from data lake -> lakehouse -> LakeDB
www.dataengineeringw...
Envisioning LakeDB: The Next Evolution of the Lakehouse Architecture
A Conceptual Framework for Next-Generation Data Platforms
www.dataengineeringweekly.com
January 24, 2025 at 8:42 PM
Reposted
I spent some time checking Fluss, the new streaming platform from Alibaba. Here are some findings: www.streamingdata.tech/p/fluss-firs...
Fluss: First Impression
Table is a new stream.
www.streamingdata.tech
December 5, 2024 at 9:42 PM
Reposted
The Flink community has compiled a growing list of Flink CDC articles on this Wiki page. Check it out! github.com/apache/flink...
Flink CDC Blog
Flink CDC is a streaming data integration tool. Contribute to apache/flink-cdc development by creating an account on GitHub.
github.com
December 16, 2024 at 11:49 AM
Reposted
I truly believe Flink CDC is the future of change data capture solutions and we are doing a lot of work to adopt it. If you want to know why, @rmoff.net has published a wonderful article showing all the benefits with real examples #dataBS www.decodable.co/blog/explori...
Exploring Flink CDC
Explore the functionality of Flink CDC, which provides Flink source connectors that can be configured using YAML, and where it can help simplify your pipelines compared to using Flink SQL.
www.decodable.co
December 11, 2024 at 6:16 PM
Reposted
The Warpstream team are flying.
November 21, 2024 at 1:40 PM
Reposted
Apache DataFusion Comet 0.4.0 has been released! See the blog post for details.

datafusion.apache.org/blog/2024/11...
Apache DataFusion Comet 0.4.0 Release
<!–
datafusion.apache.org
November 20, 2024 at 9:39 PM
Reposted
This is nice, producer driven changes.

Allowing producers to change schemas, without breaking consumers, e.g producers add new fields and can delete optional fields.

speakerdeck.com/gunnarmorlin...
Data Contracts In Practice With Debezium and Apache Flink
Log-based change data capture (CDC) is an invaluable part of the data engineering toolbox: it enables a variety of use cases such as real-time analytics&hellip;
speakerdeck.com
November 7, 2024 at 12:16 PM
Reposted
Got some burning #Debezium / CDC question you never dared to ask? Then join me for a live AMA session next week, and I'll try my best to get it answered! Sign up at dcdbl.co/3YCYx5Z, would love to see you there 🤓.

🗓️ Nov 14
⏰ 9-10am PST, 5-6 pm UTC
Join Gunnar Morling, ex-Debezium lead, for a live Q&A on CDC.
Join Gunnar Morling, ex-Debezium lead, for a live Q&A on CDC. Learn CDC basics, implementation tips, and compare Debezium with other approaches. Don't miss out!
dcdbl.co
November 7, 2024 at 6:48 PM
Reposted
Data lakehouse in a box!

Hits a lot of my favorites!

✅ S3
✅ Iceberg
✅ DuckDB
✅ Postgres

github.com/BemiHQ/BemiDB
GitHub - BemiHQ/BemiDB: Postgres read replica optimized for analytics
Postgres read replica optimized for analytics. Contribute to BemiHQ/BemiDB development by creating an account on GitHub.
github.com
November 7, 2024 at 7:07 PM
Reposted
I've seen many infrastructures standardize on having a Kafka Proxy before writing to Kafka. Is there any implementation that supports an ordering guarantee (writes the same order it receives to Kafka)?
November 7, 2024 at 4:49 PM
Reposted
This is a great write up on data contracts by @gunnar.bsky.social . I particularly like the focus on data inside Vs data on the outside. Discovered via @boyney123.bsky.social

www.decodable.co/blog/change-...
“Change Data Capture Breaks Encapsulation”. Does it, though?
When—and when not—CDC can break encapsulation, whether it matters, and strategies for avoiding these problems when it does happen.
www.decodable.co
November 7, 2024 at 8:13 PM
Reposted
Looking for a distraction? Try this great interview between @hannes.muehleisen.org and @medriscoll.bsky.social covering all things @duckdb.org. I especially enjoyed the philosophy around improving SQL usability. www.youtube.com/watch?v=a-Rm... #databs
Data Talks on the Rocks 5 - Hannes Mühleisen, DuckDB
YouTube video by Rill Data
www.youtube.com
November 7, 2024 at 11:16 PM