Andy Pavlo
@andypavlo.bsky.social
4.8K followers 52 following 68 posts
Associate Prof. of Databases @ Carnegie Mellon.
Posts Media Videos Starter Packs
Reposted by Andy Pavlo
andypavlo.bsky.social
Good. You need to build your strength back up to prep for your next challenge as CS dept chair!
andypavlo.bsky.social
What are you talking about? MapReduce is the opposite of "moving compute to the data". It was all about moving/pulling the data to compute in a shared-disk architecture. See this old paper: dl.acm.org/doi/10.1145/...
A comparison of approaches to large-scale data analysis | Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
dl.acm.org
andypavlo.bsky.social
There was collaboration attempt between CMU, Tsinghua, Meta, CWI, Nvidia, Voltron, & SpiralDB. But then lawyers got involved and it fell apart. Everyone released their own format:
→ Meta Nimble: github.com/facebookincu...
→ CWI FastLanes: github.com/cwida/FastLa...
→ SpiralDB Vortex: vortex.dev
GitHub - facebookincubator/nimble: New file format for storage of large columnar datasets.
New file format for storage of large columnar datasets. - facebookincubator/nimble
github.com
andypavlo.bsky.social
Our F3 files embed small WASM programs to decode data. If somebody creates a new encoding and the DBMS does not have native impl, it can still read data using WASM passing Arrow buffers. Our experiments show WASM is 15-20% slower than native. We use @spiraldb.com's Vortex encoding impls.
Overview of F3's decoding pipeline with WASM support.
andypavlo.bsky.social
One problem with Parquet is many implementations are not updated when the official spec improves. Everyone just uses the lowest version feature set. That means if Parquet adds a better data encoding scheme and a file uses it, many common reader libraries won't be able retrieve the data.
Survey of the features used in public Parquet files.
andypavlo.bsky.social
Our SIGMOD paper with our friends at Tsinghua + @wesmckinney.com + @pateljm.bsky.social on creating a next generation open-source data file format is out. F3 is a future-proof file format avoids the mistakes of Parquet.
📄 Paper: db.cs.cmu.edu/papers/2025/...
📁 Code: github.com/future-file-...
F3: The Open-Source Data File Format for the Future
SIGMOD 2025
Reposted by Andy Pavlo
db.cs.cmu.edu
Today's Future Data Systems Seminar Speaker: Vinoth Chandar will present the internals of Apache Hudi and his work at Onehouse. Zoom talk open to public at 4:30pm ET. YouTube video available after: db.cs.cmu.edu/events/futur...
[Future Data] Apache Hudi: A Database Layer over Cloud Storage for Fast Mutations and Efficient Queries - Carnegie Mellon Database Group
Data lakes emerged as a way to store vast amounts of data... Read More +
db.cs.cmu.edu
Reposted by Andy Pavlo
db.cs.cmu.edu
Today's Future Data Systems Seminar Speaker: Russell Spitzer will present the internals of Apache Iceberg's query planner and execution engine. Zoom talk open to public at 4:30pm ET. YouTube video available after: db.cs.cmu.edu/events/futur...
[Future Data] An Extremely Technical Overview of how the Apache Iceberg™ Planning Implementation Actually Works - Carnegie Mellon Database Group
What are you trying to tell me? That I can read data... Read More +
db.cs.cmu.edu
andypavlo.bsky.social
Shoot I don't know how I missed that when I was copy-pasting. It wasn't intentional. Sorry :-(
andypavlo.bsky.social
Fall 2025 Seminar Schedule:
Sep 22: Apache Iceberg
Sep 29: Apache Hudi
Oct 06: @motherduck.com
Oct 13: SpiralDB Vortex
Oct 27: @singlestore.com
Nov 03: @deltalakeoss.bsky.social
Nov 10: Mooncake
Nov 17: @firebolthq.bsky.social
Nov 24: @xtdb.com
Dec 01: Apache Polaris
Future Data Systems Seminar Schedule (Fall 2025)
andypavlo.bsky.social
Next week is the start of @db.cs.cmu.edu's latest seminar series: Future Data Systems
@samarchdb.bsky.social and I are hosting speakers from leading systems in the datalake / lakehouse space.
Mondays @ 4:30pm ET via Zoom. Open to the public. Videos posted to YouTube: db.cs.cmu.edu/seminars/fal...
Carnegie Mellon University
Future Data Systems
Fall 2025 Seminar Series
Mondays @ 4:30pm ET
Reposted by Andy Pavlo
cedardb.com
What if a database could be your game engine?

During parental leave @lukasvogel.bsky.social
built DOOMQL: A multiplayer DOOM-like where everything (rendering, game loop, state) runs in pure SQL on CedarDB.
It's fast, ridiculous, and surprisingly elegant.

Full write-up: cedardb.com/blog/doomql
andypavlo.bsky.social
Thank you to our @db.cs.cmu.edu Affiliate companies for their support this academic year:
@clickhouse.com
@datastax.com
@getdbt.com
@firebolthq.bsky.social
@motherduck.com
• RelationalAI
@singlestore.com
@spiraldb.com
• PingCAP / TiDB
• Yellowbrick
@yugabytedb.bsky.social
CMU-DB 2025 Industry Affiliate Program Members
https://db.cs.cmu.edu/affiliates/
andypavlo.bsky.social
Everything is available for free to non-CMU students:
• Lectures on YouTube: www.youtube.com/playlist?lis...
• Slides + Notes + Homeworks on course website.
• Project source code on GitHub: github.com/cmu-db/bustub
• Grading with Gradescope (see FAQ ➡️ 15445.courses.cs.cmu.edu/fall2025/faq...)
How can people not enrolled in the class test their projects?
All of the source code for the projects are available on Github. There is a Gradescope submission site available to non-CMU students.
Reposted by Andy Pavlo
jonathanaldrich.bsky.social
Launching my Programming Language Pragmatics talks! These short, accessible talks cover the material in the textbook, the 5th edition of which I wrote with Michael L. Scott. The first one introduces the topic and talks about why we study programming languages!

www.youtube.com/watch?v=hwL0...
PLP 1.1: Introduction to Programming Languages
YouTube video by Jonathan Aldrich
www.youtube.com
andypavlo.bsky.social
The report of my death was an exaggeration. I am still alive and will be in SFO this week to speak about using LLMs to automatically tune databases. Wed Aug 6th @ 5:30pm at Databricks MTV: lu.ma/ha0dc4nj
Reposted by Andy Pavlo
andypavlo.bsky.social
No system hits the sweet spot of allowing for extensibility while maintaining systems safety. It would be nice if there was a standard plugin API (think POSIX) that allows compatibility across systems.

Thanks to @marcoslot.com + @daveandersen.bsky.social for their collaboration on this project
Safety vs. Flexibility quadchart for different DBMSs. VLDB 2025
https://doi.org/10.14778/3725688.3725719
andypavlo.bsky.social
About 16% of PostgreSQL extns are incompatible with at least one other extn. Common problems include not enforcing APIs, undefined behaviors, and memory errors. Heavyweight extensions like Citus + @timescaledb.bsky.social have most issues because they touch more DBMS internal parts.
Success/error scores of compatibility test matrix for PostgreSQL extensions.  VLDB 2025
https://doi.org/10.14778/3725688.3725719
andypavlo.bsky.social
Abi created a torture chamber that downloads every extension we could find and automatically installs them in different combinations to see what breaks. We expanded our analysis to include other popular open-source DBMSs but could not break them.
Code : github.com/cmu-db/ext-a...
Table of the different types of extensions supported in open-source DBMSs. VLDB 2025
https://doi.org/10.14778/3725688.3725719
andypavlo.bsky.social
The motivation of this paper started in 2021 when we switched our research to target Postgres via extensions. We found many extensions copied PostgreSQL code in their code. We also hit problems if we loaded extensions in the wrong order.