Lightnews — Scholar-powered news

Reposted by Andy Pavlo

CMU Database Group @db.cs.cmu.edu · 2d

Today's Future Data Systems Seminar Speaker: Jordan Tigani (@jrdntgn.bsky.social) will present how @motherduck.com supports modern workloads with DuckLake. Zoom talk open to public at 4:30pm ET. YouTube video available after: db.cs.cmu.edu/events/futur...

[Future Data] DuckLake: Learning from Cloud Data Warehouses to Build a Robust "Lakehouse" - Carnegie Mellon Database Group

When building scalable data systems, it is easy to focus on the... Read More +

db.cs.cmu.edu

6 11

Andy Pavlo @andypavlo.bsky.social · 5d

Good. You need to build your strength back up to prep for your next challenge as CS dept chair!

1 3

Andy Pavlo @andypavlo.bsky.social · 6d

What are you talking about? MapReduce is the opposite of "moving compute to the data". It was all about moving/pulling the data to compute in a shared-disk architecture. See this old paper: dl.acm.org/doi/10.1145/...

A comparison of approaches to large-scale data analysis | Proceedings of the 2009 ACM SIGMOD International Conference on Management of data

dl.acm.org

1

Andy Pavlo @andypavlo.bsky.social · 7d

There was collaboration attempt between CMU, Tsinghua, Meta, CWI, Nvidia, Voltron, & SpiralDB. But then lawyers got involved and it fell apart. Everyone released their own format:
→ Meta Nimble: github.com/facebookincu...
→ CWI FastLanes: github.com/cwida/FastLa...
→ SpiralDB Vortex: vortex.dev

GitHub - facebookincubator/nimble: New file format for storage of large columnar datasets.

New file format for storage of large columnar datasets. - facebookincubator/nimble

github.com

2 5 25

Andy Pavlo @andypavlo.bsky.social · 7d

Our F3 files embed small WASM programs to decode data. If somebody creates a new encoding and the DBMS does not have native impl, it can still read data using WASM passing Arrow buffers. Our experiments show WASM is 15-20% slower than native. We use @spiraldb.com's Vortex encoding impls.

Overview of F3's decoding pipeline with WASM support.

1 3 19

Andy Pavlo @andypavlo.bsky.social · 7d

One problem with Parquet is many implementations are not updated when the official spec improves. Everyone just uses the lowest version feature set. That means if Parquet adds a better data encoding scheme and a file uses it, many common reader libraries won't be able retrieve the data.

Survey of the features used in public Parquet files.

1 1 5

Andy Pavlo @andypavlo.bsky.social · 7d

Our SIGMOD paper with our friends at Tsinghua + @wesmckinney.com + @pateljm.bsky.social on creating a next generation open-source data file format is out. F3 is a future-proof file format avoids the mistakes of Parquet.
📄 Paper: db.cs.cmu.edu/papers/2025/...
📁 Code: github.com/future-file-...

F3: The Open-Source Data File Format for the Future
SIGMOD 2025

4 21 65

Reposted by Andy Pavlo

CMU Database Group @db.cs.cmu.edu · 9d

Today's Future Data Systems Seminar Speaker: Vinoth Chandar will present the internals of Apache Hudi and his work at Onehouse. Zoom talk open to public at 4:30pm ET. YouTube video available after: db.cs.cmu.edu/events/futur...

[Future Data] Apache Hudi: A Database Layer over Cloud Storage for Fast Mutations and Efficient Queries - Carnegie Mellon Database Group

Data lakes emerged as a way to store vast amounts of data... Read More +

db.cs.cmu.edu

1 4

Reposted by Andy Pavlo

CMU Database Group @db.cs.cmu.edu · 16d

Today's Future Data Systems Seminar Speaker: Russell Spitzer will present the internals of Apache Iceberg's query planner and execution engine. Zoom talk open to public at 4:30pm ET. YouTube video available after: db.cs.cmu.edu/events/futur...

[Future Data] An Extremely Technical Overview of how the Apache Iceberg™ Planning Implementation Actually Works - Carnegie Mellon Database Group

What are you trying to tell me? That I can read data... Read More +

db.cs.cmu.edu

5 8

Andy Pavlo @andypavlo.bsky.social · 20d

Shoot I don't know how I missed that when I was copy-pasting. It wasn't intentional. Sorry :-(

1

Andy Pavlo @andypavlo.bsky.social · 21d

Fall 2025 Seminar Schedule:
Sep 22: Apache Iceberg
Sep 29: Apache Hudi
Oct 06: @motherduck.com
Oct 13: SpiralDB Vortex
Oct 27: @singlestore.com
Nov 03: @deltalakeoss.bsky.social
Nov 10: Mooncake
Nov 17: @firebolthq.bsky.social
Nov 24: @xtdb.com
Dec 01: Apache Polaris

Future Data Systems Seminar Schedule (Fall 2025)

1 4 16

Andy Pavlo @andypavlo.bsky.social · 21d

Next week is the start of @db.cs.cmu.edu's latest seminar series: Future Data Systems
@samarchdb.bsky.social and I are hosting speakers from leading systems in the datalake / lakehouse space.
Mondays @ 4:30pm ET via Zoom. Open to the public. Videos posted to YouTube: db.cs.cmu.edu/seminars/fal...

Carnegie Mellon University
Future Data Systems
Fall 2025 Seminar Series
Mondays @ 4:30pm ET

1 14 42

Andy Pavlo @andypavlo.bsky.social · 28d

I don't know what to say. You dream about it for so long and then when it finally happens you're in shock. I'm so proud of you Larry. www.theguardian.com/technology/2...

Larry Ellison overtakes Elon Musk as world’s richest person

Oracle co-founder’s shares rose by 40% in early trading, valuing his fortune at $393bn, just ahead of Musk’s $384bn

www.theguardian.com

2 17

Reposted by Andy Pavlo

CedarDB @cedardb.com · 29d

What if a database could be your game engine?

During parental leave @lukasvogel.bsky.social
built DOOMQL: A multiplayer DOOM-like where everything (rendering, game loop, state) runs in pure SQL on CedarDB.
It's fast, ridiculous, and surprisingly elegant.

Full write-up: cedardb.com/blog/doomql

1 5 15

Andy Pavlo @andypavlo.bsky.social · Aug 25

Thank you to our @db.cs.cmu.edu Affiliate companies for their support this academic year:
• @clickhouse.com
• @datastax.com
• @getdbt.com
• @firebolthq.bsky.social
• @motherduck.com
• RelationalAI
• @singlestore.com
• @spiraldb.com
• PingCAP / TiDB
• Yellowbrick
• @yugabytedb.bsky.social

CMU-DB 2025 Industry Affiliate Program Members
https://db.cs.cmu.edu/affiliates/

1 1 11

Andy Pavlo @andypavlo.bsky.social · Aug 25

Everything is available for free to non-CMU students:
• Lectures on YouTube: www.youtube.com/playlist?lis...
• Slides + Notes + Homeworks on course website.
• Project source code on GitHub: github.com/cmu-db/bustub
• Grading with Gradescope (see FAQ ➡️ 15445.courses.cs.cmu.edu/fall2025/faq...)

How can people not enrolled in the class test their projects?
All of the source code for the projects are available on Github. There is a Gradescope submission site available to non-CMU students.

1 1 22

Andy Pavlo @andypavlo.bsky.social · Aug 25

Today is the new semester for @db.cs.cmu.edu's Intro to Database Systems! We're going harder into material than before. More challenging projects but you can use LLMs to help. We also have 10min talks each Wed from leading DB companies: 15445.courses.cs.cmu.edu/fall2025

CMU 15-445/645 :: Intro to Database Systems (Fall 2025)

You want to know whether this is the premier course at Carnegie Mellon University on the design and implementation of database management systems? Well, it is. This course rips through data models (re...

15445.courses.cs.cmu.edu

1 18 63

Reposted by Andy Pavlo

Jonathan Aldrich @jonathanaldrich.bsky.social · Aug 6

Launching my Programming Language Pragmatics talks! These short, accessible talks cover the material in the textbook, the 5th edition of which I wrote with Michael L. Scott. The first one introduces the topic and talks about why we study programming languages!

www.youtube.com/watch?v=hwL0...

PLP 1.1: Introduction to Programming Languages

YouTube video by Jonathan Aldrich

www.youtube.com

3 8 24

Andy Pavlo @andypavlo.bsky.social · Aug 4

The report of my death was an exaggeration. I am still alive and will be in SFO this week to speak about using LLMs to automatically tune databases. Wed Aug 6th @ 5:30pm at Databricks MTV: lu.ma/ha0dc4nj

2 21

Reposted by Andy Pavlo

Alex Miller @alexmillerdb.bsky.social · Jul 23

Attention, South Bay folk! We have The Databaseologist, @andypavlo.bsky.social, giving a talk in the bay on August 6th. Come join us for a great time in hearing:

ChatGPT Ain’t Got $%@& On Me! The Future of Automated Database Tuning

Register now! https://lu.ma/ha0dc4nj

South Bay Systems: ChatGPT Ain’t Got $%@& On Me! The Future of Automated Database Tuning · Luma

We're excited to feature Andy Pavlo, illustrious database professor at CMU, to talk about database tuning. This meetup's venue, food and drinks, are generously…

lu.ma

2 13

Andy Pavlo @andypavlo.bsky.social · Jul 25

This is the successor to BtrBlocks: github.com/AnyBlox/vldb...

GitHub - AnyBlox/vldb-2025: Reproducibility package for the VLDB 2025 submission

Reproducibility package for the VLDB 2025 submission - AnyBlox/vldb-2025

github.com

5

Andy Pavlo @andypavlo.bsky.social · Jul 3

No system hits the sweet spot of allowing for extensibility while maintaining systems safety. It would be nice if there was a standard plugin API (think POSIX) that allows compatibility across systems.

Thanks to @marcoslot.com + @daveandersen.bsky.social for their collaboration on this project

Safety vs. Flexibility quadchart for different DBMSs. VLDB 2025
https://doi.org/10.14778/3725688.3725719

1 9

Andy Pavlo @andypavlo.bsky.social · Jul 3

About 16% of PostgreSQL extns are incompatible with at least one other extn. Common problems include not enforcing APIs, undefined behaviors, and memory errors. Heavyweight extensions like Citus + @timescaledb.bsky.social have most issues because they touch more DBMS internal parts.

Success/error scores of compatibility test matrix for PostgreSQL extensions. VLDB 2025
https://doi.org/10.14778/3725688.3725719

1 1

Andy Pavlo @andypavlo.bsky.social · Jul 3

Abi created a torture chamber that downloads every extension we could find and automatically installs them in different combinations to see what breaks. We expanded our analysis to include other popular open-source DBMSs but could not break them.
Code : github.com/cmu-db/ext-a...

Table of the different types of extensions supported in open-source DBMSs. VLDB 2025
https://doi.org/10.14778/3725688.3725719

1 1

Andy Pavlo @andypavlo.bsky.social · Jul 3

The motivation of this paper started in 2021 when we switched our research to target Postgres via extensions. We found many extensions copied PostgreSQL code in their code. We also hit problems if we loaded extensions in the wrong order.

1 2