Dmitriy Ryaboy
@squarecog.bsky.social
2.8K followers 230 following 170 posts
Works with data, runs with swords.
Posts Media Videos Starter Packs
squarecog.bsky.social
Right, it's all about the ecosystem. Writers are always going to be more conservative than readers, rightfully so. This f3 idea is essentially about letting writers adopt new stuff without worrying too much about older gen readers (once older gen can read this sort of thing, another decade later).
squarecog.bsky.social
andypavlo.bsky.social
One problem with Parquet is many implementations are not updated when the official spec improves. Everyone just uses the lowest version feature set. That means if Parquet adds a better data encoding scheme and a file uses it, many common reader libraries won't be able retrieve the data.
Survey of the features used in public Parquet files.
squarecog.bsky.social
But anyway the point is not whether rle is useful, but if there is a world where parquet format improvements introduced since like 2018 get adopted, and more useful encodings can be propagated.
squarecog.bsky.social
RLE+delta allows filter pushdowns to work without decompressing. If you have repeated strings and sort, dict encode, and rle+delta, even regex searches become blazing fast. Parquet enables this, but who implements it?
squarecog.bsky.social
To be fair, you would not require it. An implementor would only do this if they want to future proof, and are ok with the whole executable data file thing. Otherwise, same as now: implement the reader for every encoding.
It's painful how little even basic RLE is being used in the wild :(
squarecog.bsky.social
I had the same 2 thoughts in the same sequence :)
squarecog.bsky.social
Do you think there's anything blocking parquet from adopting the same wasm reader approach to unlock new encodings and other schemes?
squarecog.bsky.social
This is a pretty intriguing idea for future proofing file formats.
It does assume wasm is future proof, of course, but that feels like a safer bet than "assume readers are updated"
andypavlo.bsky.social
Our F3 files embed small WASM programs to decode data. If somebody creates a new encoding and the DBMS does not have native impl, it can still read data using WASM passing Arrow buffers. Our experiments show WASM is 15-20% slower than native. We use @spiraldb.com's Vortex encoding impls.
Overview of F3's decoding pipeline with WASM support.
squarecog.bsky.social
If you love this sort of thing, read up on C-store, which introduced this idea in 2005 and commercialized it in Vertica. Stonebraker, Sam Madden, Daniel Abadi.
Parquet was also partially inspired by Vertica (and Google's Dremel, and PaX by Natassa Ailamaki et al) :-).
duckdb.org
Are you streaming into your Lakehouse?

Traditional formats suffered with the “many small files” problem — OLAP engines merge them reactively with long jobs. ⏳

DuckLake takes a proactive path: Data Inlining + async flush to parquet while always keeping data queryable ⚡
squarecog.bsky.social
ML is just applied stats.
Stats is just applied algebra.
LLM is just ML backward and with an extra L.
squarecog.bsky.social
The obvious reaction here is to shift at least some of the hiring out of the country to get access to the talent. The obvious counter reaction is to tax payments and wages to foreign employees and contractors. Which will also provoke a reaction. And none of this makes the US stronger or smarter.
squarecog.bsky.social
About a decade late with this, but:
Someone should have started a social media ad agency called Twaddle.
squarecog.bsky.social
Ask Ketan, I've been trying to find a good excuse to get my teams to use Flyte for half a decade now 😆
squarecog.bsky.social
It's tempting to take shortcuts that give you speed today by mortgaging speed tomorrow.

Trouble is, today is yesterday's tomorrow.
squarecog.bsky.social
Thanks for the reference, hadn't seen that!
Are these all one-shotting or doing an agentic workflow to explore before formulating final answer?
squarecog.bsky.social
I tried 2 different english to insights sql llm agents from reputable vendors in the past week. Data analyst jobs remain safe.
Firmly in the toy category for now.
squarecog.bsky.social
Happened to be by the Cloudera building in the south bay earlier. Checked LinkedIn and discovered I have literally 0 1st degree connections who work there now. Not unexpected, I guess, but, man... betwen hnwx and cldr I used to know like 100s of folks there
squarecog.bsky.social
Also, everyone trips at least once on average of ratios vs ratio of sums (which becomes obvious once you describe them as unweighted vs weighted means).
squarecog.bsky.social
It shows up in different ways in different places. The most basic being, you don't know if the rario moved cause numerator went up or denominator went down. Correct course of action is often different depending on which!
squarecog.bsky.social
About once every two years I have cause to re-learn a very important data lesson: never, ever, trust analysis based on ratio metrics.
squarecog.bsky.social
Trying and failing to make the page edits look right? Is offline access lackluster? Tired of AI upsell as a replacement for poor search quality?

You might be suffering from Notion sickness.
squarecog.bsky.social
Phone book, noun: an ebook you read on your phone.
squarecog.bsky.social
Looking up latin phrases on Google results in an AI response in French a good % of the time. Fortunately, my French is slightly better than my latin.