Alex Miller
@alexmillerdb.bsky.social
1.9K followers 130 following 310 posts
Database Papers as a Service
Posts Media Videos Starter Packs
Reposted by Alex Miller
southbaysystems.xyz
There was an accident with the recording where audio wasn't captured, so instead we can offer a recording from one of Jakob's practice runs on twitch: www.twitch.tv/videos/25845...
Reposted by Alex Miller
qianli.dev
Had a fun time at the South Bay Systems meetup last night. Thanks @yugabytedb.bsky.social for hosting!

@codedrift.social gave a great talk on WebAssembly: what it is (and isn't), how it connects to WASI, and promising projects. He cuts through a lot of the hype vs. reality. Recording coming soon.
alexmillerdb.bsky.social
I will note that “scan sharing” seems specifically inter-query. If you have a single query that scans the same table multiple times and you want to coalesce that to scanning only once, that seems to be classified under subplan reuse instead?
E.g. link.springer.com/content/pdf/...
link.springer.com
alexmillerdb.bsky.social
[ASPLOS'25] Fusion: An Analytics Object Store Optimized for Query
Pushdown
www.cs.princeton.edu...

Tightly integrating an Iceberg catalog with an object store means that one could make file-format aware erasure coding decisions, to permit pushing down filters and aggregations.
alexmillerdb.bsky.social
I think you’re talking about scan sharing? 15721.courses.cs.cmu.edu/spring2016/s...

I don’t know the OG citation for this. Andy cites graphs from 15721.courses.cs.cmu.edu/spring2016/p... and ir.cwi.nl/pub/12225/12... also looks pretty reasonable.
alexmillerdb.bsky.social
[VLDB] Towards Principled, Practical Document Database Design
www.vldb.org/pvldb/v...

If you've ever wished that there was a document database equivalent for relational databases' 3NF-style schema design guidance, then this is the paper for you.
alexmillerdb.bsky.social
Err, I mean, I guess Yugabyte is also not linearizable and snapshot isolation, but just because of HLCs being inaccurate. LeanXcale is very intentionally not linearizable, and they mention that you have to do some extra work to even get session consistency out of it.
alexmillerdb.bsky.social
Do you think he’s just been “I told you so”-ing people since 1964? 😂
alexmillerdb.bsky.social
If you’re not ramped up on WCOJ algorithms, a lot of the papers are complicated to get through, but I thought TreeTracker Join arxiv.org/pdf/2403.01631 was pretty comprehensible and shows the minimal difference for NLJ. Or see justinjaffray.com/a-gentle-ish...
alexmillerdb.bsky.social
Even as a disliker of YouTube videos as a way to learn things, I found www.youtube.com/watch?v=-XmJ... easier to understand than the paper www.cs.ox.ac.uk/dan.olteanu/... for factorized database work

Extending SQL to Return a Subdatabase dl.acm.org/doi/pdf/10.1... also seems related?
alexmillerdb.bsky.social
[arXiv] On the Theoretical Limitations of Embedding-Based Retrieval
arxiv.org/abs/2508.2...

It's impossible to retrieve all combinations of pairs of documents post-embedding. Thus, there's usecases that vector search won't do well at. Conversely, BM25 excels in these cases.
alexmillerdb.bsky.social
Still looking! Would love to try it out
alexmillerdb.bsky.social
And, the alternative is that you either get a table as a giant list of numbers read out at you, or graphs and diagrams as nothing at best or random keywords at worst. So, even a bit of an inaccurate summary is still an improvement.
alexmillerdb.bsky.social
The diagram summaries from accidentally figure heavy papers I ran through it have turned into text that was sufficiently reasonable that it seemed to fit. I checked the results of the first one when I realized what was happening mid-listen, and they seemed reasonable interpretations.
alexmillerdb.bsky.social
I have only used full text + no additional context mode, but for a person without the superhuman blind person powers of being able to listen to a code listing and actually make sense of it, the AI generated summaries are a huge step up.
alexmillerdb.bsky.social
I text-to-speech papers often, and www.paper2audio.com finally did the one thing that I was hoping AI would enable: replace tables/figures/diagrams with a summary of what is being shown. It makes table/diagram-heavy papers actually comprehensible. There's iOS and Android apps, and it's free.
alexmillerdb.bsky.social
matklad.github.io/2023/10/23/u... and blog.janestreet.com/putting-the-... pitched better code reviewing tooling, and I really hope something polished happens there at some point too. Reviewing PRs in VSCode (local or web) is about the best experience I've had so far.
alexmillerdb.bsky.social
A bit tangential now, but I'm also a bit grumpy that there's never really been a project which took off to add issue tracking to git somewhat natively, since you can store non-vcs objects into git objects too. github.com/git-bug/git-... is about the best attempt that I've found so far.
alexmillerdb.bsky.social
jj-vcs.github.io is worth a bit of a try. I bounced off as the mental overhead was too much, but I did see the promise and the workflow outlined in ofcr.se/jujutsu-merg... did deliver the better experience it pitched when work cleanly divides into non-overlapping PRs.
alexmillerdb.bsky.social
Specifically for go I see github.com/mmcloughlin/.... It’s worth double checking minio’s highwayhash implementation though, because for this sort of stuff specifically, they tend to be the ones who care the most in go about highly efficient data crunching routines.
alexmillerdb.bsky.social
Depending on your CPU, the AES derived ones are generally fastest for large chunks of data. Gxhash if you’re in rust, or meowhash is a little slower but more available. Smhasher has performance tests and tons of hash functions, so it’s useful to scrape for a quick answer on best hash function.
alexmillerdb.bsky.social
[VLDB] NaviX: A Native Vector Index Design for Graph DBMSs With Robust Predicate-Agnostic Search Performance
www.vldb.org/pvldb/v...

It feels like a follow-on/improvement to ACORN. Also interesting to see HNSW built directly on a graph database working well.