Lightnews — Scholar-powered news

Andrew Lamb @andrewlamb1111.bsky.social · 1d

I 100% agree that including a WASM based encoder/decoder will be a barrier to implementation for any file format (including Parquet).

My broader point was that there is no technical reason it could not be added to Parquet, not that it necessarily could or should be added

1 1

Andrew Lamb @andrewlamb1111.bsky.social · 1d

BTW if anyone wants a good intro to database storage / Log structured storage (aka LSM trees) @db.cs.cmu.edu lecture this fall is a good one: www.youtube.com/watch?v=2_sT...

#05 - Log-Structured Database Storage ✸ SingleStore Database Talk (CMU Intro to Database Systems)

YouTube video by CMU Database Group

www.youtube.com

2

Andrew Lamb @andrewlamb1111.bsky.social · 6d

The only thing is getting consensus -- there is no technical blocker

1 1

Andrew Lamb @andrewlamb1111.bsky.social · 6d

It starts: github.com/clflushopt/t...

@clflushopt.bsky.social is going to make the worlds fastest tpc-ds generator

GitHub - clflushopt/tpcdsgen: WIP (out of tree) Rust implementation of TPC-DS generators.

WIP (out of tree) Rust implementation of TPC-DS generators. - clflushopt/tpcdsgen

github.com

2

Andrew Lamb @andrewlamb1111.bsky.social · 7d

"It is not 100% clear to me how a new file format (or three) will drive additional ecosystem adoption :thinking:"

However, I absolutely think this adds to the pressure for Parquet to evolve.

Speaking of, anyone interested in helping add new encodings to parquet?
lists.apache.org/thread/djnbb...

1 2 5

Andrew Lamb @andrewlamb1111.bsky.social · 9d

Apache DataFusion 50 is released. Read all about it here: datafusion.apache.org/blog/2025/09...

1 5

Andrew Lamb @andrewlamb1111.bsky.social · 12d

CloudFlare's Distributed R2 SQL engine's is a pretty good exemplar of how to build a serverless database to process petabytes in seconds using Apache DataFusion and Apache Parquet

blog.cloudflare.com/r2-sql-deep-...

R2 SQL: a deep dive into our new distributed query engine

R2 SQL provides a built-in, serverless way to run ad-hoc analytic queries against your R2 Data Catalog. This post dives deep under the Iceberg into how we built this distributed engine, from its metad...

blog.cloudflare.com

1 8 24

Reposted by Andrew Lamb

Dewey Dunnington @paleolimbot.bsky.social · 13d

I cannot say enough about DataFusion...in order to build an engine that considers spatial types at every level we needed to customize types, functions, optimizer rules, joins, Parquet pruning, and more. DataFusion not only made this possible but documented even the most obscure bits. So cool!

Andrew Lamb @andrewlamb1111.bsky.social · 14d

"Introducing SedonaDB: A single-node analytical database engine with geospatial as a first-class citizen"

Built in Rust with Apache DataFusion

sedona.apache.org/latest/blog/...

Introducing SedonaDB: A single-node analytical database engine with geospatial as a first-class citizen - Apache Sedona

Apache Sedona is a cluster computing system for processing large-scale spatial data. Sedona extends existing cluster computing systems, such as Apache Spark, Apache Flink, and Snowflake, with a set of...

sedona.apache.org

3 25

Andrew Lamb @andrewlamb1111.bsky.social · 13d

So Cool -- jcsherin added full text indexes into Parquet files using the techniques from our blog

github.com/jcsherin/dat...

1 4

Andrew Lamb @andrewlamb1111.bsky.social · 14d

"Introducing SedonaDB: A single-node analytical database engine with geospatial as a first-class citizen"

Built in Rust with Apache DataFusion

sedona.apache.org/latest/blog/...

Introducing SedonaDB: A single-node analytical database engine with geospatial as a first-class citizen - Apache Sedona

Apache Sedona is a cluster computing system for processing large-scale spatial data. Sedona extends existing cluster computing systems, such as Apache Spark, Apache Flink, and Snowflake, with a set of...

sedona.apache.org

5 15

Andrew Lamb @andrewlamb1111.bsky.social · 19d

We just published an easier to find list of all PMC and committers of Apache DataFusion , and it is quite a cool list of people and affiliations if I do say so myself 🤗
datafusion.apache.org/contributor-...

5

Andrew Lamb @andrewlamb1111.bsky.social · 21d

And we are also adding Geometry to the Rust parquet implementation . Huge thanks to @kylebarron.dev github.com/apache/arrow...

[EPIC] [Parquet] Implement Geometry and Geography type support in Parquet · Issue #8373 · apache/arrow-rs

Is your feature request related to a problem or challenge? Please describe what you are trying to do. Parquet recently adopted Geometry and Geography types: apache/parquet-format@master/Geospatial....

github.com

4

Andrew Lamb @andrewlamb1111.bsky.social · 21d

It was a great time on Monday at the @apachedatafusion.bsky.social meetup in NYC. We heard about distributed query plans, filter pushdown, geospatial support, and VegaFusion.

More deets here github.com/apache/dataf...

4

Andrew Lamb @andrewlamb1111.bsky.social · 26d

One unfortunate aspect of new AI tools is they make it easier to generate large amounts of plausible looking code, which puts an even bigger burden on the reviewers, and reviewers are already the resource open source projects have the least of.

1 7

Andrew Lamb @andrewlamb1111.bsky.social · 27d

"downloading [the dataset] could take days. Luckily, we could use the tpchgen-rs tool, a pure Rust implementation of the TPC-H generator that can produce large-scale datasets on the laptop in just a few hours."

Thanks for the shout out 🙇 to
github.com/clflushopt/t...
(fyi @clflushopt.bsky.social)

GitHub - clflushopt/tpchgen-rs: TPC-H benchmark data generation in pure Rust

TPC-H benchmark data generation in pure Rust. Contribute to clflushopt/tpchgen-rs development by creating an account on GitHub.

github.com

3

Andrew Lamb @andrewlamb1111.bsky.social · 27d

Dynamic Filters for TopK and Join queries landing in DataFusion 50.0.0: datafusion.apache.org/blog/2025/09...

6

Andrew Lamb @andrewlamb1111.bsky.social · 28d

What is LiquidCache in these slides: what-is-liquid-cache.xiangpeng.systems

BTW @xiangpeng.systems is looking for some early adopters who want to be on the bleeding edge. Hit me up if interested

1

Andrew Lamb @andrewlamb1111.bsky.social · Sep 6

Recording of "Introduction to Variant in @ApacheParquet ": www.youtube.com/watch?v=nlOJ...

Here are the slides: docs.google.com/presentation...

5

Reposted by Andrew Lamb

Phil Eaton @eatonphil.bsky.social · Sep 4

I've been helping our analytics team integrate our DataFusion-based query engine for Postgres into EDB Postgres Distributed and finally here's an end-to-end demo.

You get HA Postgres plus seamless replication and DataFusion-based queries. This query turned out 6x faster than PG.

1 4 14

Andrew Lamb @andrewlamb1111.bsky.social · Sep 5

We want `brew install tpchgen-cli` to work, but that requires the project to be "popular" enough according to homebrew (30 forks, 30 watchers and 75 stars)

We just need 6 more forks and 24 watchers. Can you help us out?

github.com/clflushopt/t...

Deets: github.com/clflushopt/t...

1 2

Reposted by Andrew Lamb

InfoQ @infoq.com · Aug 28

Discover why #RustLang is the right choice for building #LowLatency systems: not just for code’s performance, but also for productivity & developer joy.

🎧 Listen to the #InfoQ #podcast with Andrew Lamb for more insights: bit.ly/47kgmwU

#AI #ProgrammingLanguages #Performance #Concurrency #DevEx

2 2

Andrew Lamb @andrewlamb1111.bsky.social · Sep 4

Thanks to @clflushopt.bsky.social, make massive TPCH datasets with tpchgen-cli 2.0:

SF1000 (1TB raw, 220GB in @ApacheParquet ) in less than 10 mins (6m45s) on aging laptop

Try it now:

pip install tpchgen-cli
tpchgen-cli --scale-factor 1000 --parts 100 --format=parquet

github.com/clflushopt/t...

1 4

Andrew Lamb @andrewlamb1111.bsky.social · Aug 15

It is a common misconception that Parquet requires (slow) reparsing metadata and is limited to built in indexing structures.

Using External Indexes, Metadata Stores, Catalogs and Caches to Accelerate Queries on Parquet with @apachedatafusion.bsky.social

datafusion.apache.org/blog/2025/08...

2 8

Andrew Lamb @andrewlamb1111.bsky.social · Jul 30

We are doing another DataFusion meetup in Boston Wednesday Nov 12, 2025: lu.ma/w9pw5rce

Boston Apache DataFusion Meetup · Luma

Join us for an evening of talks, panel discussion, and community discussion about Apache DataFusion and its growing role in modern data infrastructure. This…

lu.ma

3

Andrew Lamb @andrewlamb1111.bsky.social · Jul 30

In my opinion, the only actual criticism of Parquet that can not be solved with more software engineering (rather than changing the format) is adding new encodings.

Fastlanes, FSST and the BtrBlocks style cascaded encodings are great candidates. Now we need to get then adopted into Parquet

3