Andrew Lamb
@andrewlamb1111.bsky.social
630 followers 22 following 120 posts
Apache {DataFusion PMC}, Database Internals
Posts Media Videos Starter Packs
andrewlamb1111.bsky.social
I 100% agree that including a WASM based encoder/decoder will be a barrier to implementation for any file format (including Parquet).

My broader point was that there is no technical reason it could not be added to Parquet, not that it necessarily could or should be added
andrewlamb1111.bsky.social
The only thing is getting consensus -- there is no technical blocker
andrewlamb1111.bsky.social
"It is not 100% clear to me how a new file format (or three) will drive additional ecosystem adoption :thinking:"

However, I absolutely think this adds to the pressure for Parquet to evolve.

Speaking of, anyone interested in helping add new encodings to parquet?
lists.apache.org/thread/djnbb...
andrewlamb1111.bsky.social
Apache DataFusion 50 is released. Read all about it here: datafusion.apache.org/blog/2025/09...
Reposted by Andrew Lamb
paleolimbot.bsky.social
I cannot say enough about DataFusion...in order to build an engine that considers spatial types at every level we needed to customize types, functions, optimizer rules, joins, Parquet pruning, and more. DataFusion not only made this possible but documented even the most obscure bits. So cool!
andrewlamb1111.bsky.social
So Cool -- jcsherin added full text indexes into Parquet files using the techniques from our blog

github.com/jcsherin/dat...
andrewlamb1111.bsky.social
We just published an easier to find list of all PMC and committers of Apache DataFusion , and it is quite a cool list of people and affiliations if I do say so myself 🤗
datafusion.apache.org/contributor-...
andrewlamb1111.bsky.social
It was a great time on Monday at the @apachedatafusion.bsky.social meetup in NYC. We heard about distributed query plans, filter pushdown, geospatial support, and VegaFusion.

More deets here github.com/apache/dataf...
andrewlamb1111.bsky.social
One unfortunate aspect of new AI tools is they make it easier to generate large amounts of plausible looking code, which puts an even bigger burden on the reviewers, and reviewers are already the resource open source projects have the least of.
andrewlamb1111.bsky.social
"downloading [the dataset] could take days. Luckily, we could use the tpchgen-rs tool, a pure Rust implementation of the TPC-H generator that can produce large-scale datasets on the laptop in just a few hours."

Thanks for the shout out 🙇 to
github.com/clflushopt/t...
(fyi @clflushopt.bsky.social)
GitHub - clflushopt/tpchgen-rs: TPC-H benchmark data generation in pure Rust
TPC-H benchmark data generation in pure Rust. Contribute to clflushopt/tpchgen-rs development by creating an account on GitHub.
github.com
andrewlamb1111.bsky.social
Dynamic Filters for TopK and Join queries landing in DataFusion 50.0.0: datafusion.apache.org/blog/2025/09...
andrewlamb1111.bsky.social
What is LiquidCache in these slides: what-is-liquid-cache.xiangpeng.systems

BTW @xiangpeng.systems is looking for some early adopters who want to be on the bleeding edge. Hit me up if interested
andrewlamb1111.bsky.social
Recording of "Introduction to Variant in @ApacheParquet ": www.youtube.com/watch?v=nlOJ...

Here are the slides: docs.google.com/presentation...
Reposted by Andrew Lamb
eatonphil.bsky.social
I've been helping our analytics team integrate our DataFusion-based query engine for Postgres into EDB Postgres Distributed and finally here's an end-to-end demo.

You get HA Postgres plus seamless replication and DataFusion-based queries. This query turned out 6x faster than PG.
andrewlamb1111.bsky.social
We want `brew install tpchgen-cli` to work, but that requires the project to be "popular" enough according to homebrew (30 forks, 30 watchers and 75 stars)

We just need 6 more forks and 24 watchers. Can you help us out?

github.com/clflushopt/t...

Deets: github.com/clflushopt/t...
Reposted by Andrew Lamb
infoq.com
InfoQ @infoq.com · Aug 28
Discover why #RustLang is the right choice for building #LowLatency systems: not just for code’s performance, but also for productivity & developer joy.

🎧 Listen to the #InfoQ #podcast with Andrew Lamb for more insights: bit.ly/47kgmwU

#AI #ProgrammingLanguages #Performance #Concurrency #DevEx
andrewlamb1111.bsky.social
Thanks to @clflushopt.bsky.social, make massive TPCH datasets with tpchgen-cli 2.0:

SF1000 (1TB raw, 220GB in @ApacheParquet ) in less than 10 mins (6m45s) on aging laptop

Try it now:

pip install tpchgen-cli
tpchgen-cli --scale-factor 1000 --parts 100 --format=parquet

github.com/clflushopt/t...
andrewlamb1111.bsky.social
It is a common misconception that Parquet requires (slow) reparsing metadata and is limited to built in indexing structures.

Using External Indexes, Metadata Stores, Catalogs and Caches to Accelerate Queries on Parquet with @apachedatafusion.bsky.social

datafusion.apache.org/blog/2025/08...
andrewlamb1111.bsky.social
In my opinion, the only actual criticism of Parquet that can not be solved with more software engineering (rather than changing the format) is adding new encodings.

Fastlanes, FSST and the BtrBlocks style cascaded encodings are great candidates. Now we need to get then adopted into Parquet