Ryan Marcus
@ryanmarcus.discuss.systems.ap.brid.gy
0 followers 0 following 20 posts
Asst prof computer science @ UPenn. Machine learning for systems. Databases. He/him. [bridged from https://discuss.systems/@ryanmarcus on the fediverse by https://fed.brid.gy/ ]
Posts Media Videos Starter Packs
ryanmarcus.discuss.systems.ap.brid.gy
During my PhD, I got paid about 22% of my highest-declined job offer.

During my postdoc, I got paid about 30% of my highest-declined job offer.

During my assistant professorship, I'm getting paid about 15% of my highest-departed job's salary.

If you want to make it in academia, you can't just […]
Original post on discuss.systems
discuss.systems
ryanmarcus.discuss.systems.ap.brid.gy
Now that I've been the PC chair for a conference, I will never submit a late review again.
ryanmarcus.discuss.systems.ap.brid.gy
This is a class of bugs that can only exist in cyberphysical systems. Plane thinks it's on the ground, but it's actually in the air. Hard to imagine being on this support call -- "The plane says I'm on the ground." "Are you?" "Nope." "Huh. That's not good."

Salient quote:

"At that point, the […]
Original post on discuss.systems
discuss.systems
ryanmarcus.discuss.systems.ap.brid.gy
If you still use Twitter, there's a sale on Premium today. I went through the checkout process, closed the window at the Stripe payment page, then got an email 10 minutes later with an offer for a year of free Premium.

I have a stupid blue check next to my name but there are fewer ads!

"Cancel […]
Original post on discuss.systems
discuss.systems
ryanmarcus.discuss.systems.ap.brid.gy
I thought database folks were always hyperbolic and overly resistant to change in the peer review process, but I just saw a NeurIPS AC compare desk rejecting papers submitted by absentee reviewers to familial extermination / collective punishment during the Qin dynasty, so maybe the database […]
Original post on discuss.systems
discuss.systems
ryanmarcus.discuss.systems.ap.brid.gy
@adamcrussell A Google Scholar page can include a link to your personal/professional website, but AFAIK, can't include your email address (although it will show if you've confirmed an email address at a particular domain).
ryanmarcus.discuss.systems.ap.brid.gy
This is your yearly reminder that anyone who publishes CS papers should have a personal website that lists their current position, research interests, publications, and email address.

If you don't, it's basically impossible for me to invite you to a PC […]

[Original post on discuss.systems]
A meme featuring Bernie Sanders standing outdoors in a winter coat, speaking directly to the camera. The caption reads, “I am once again asking PhD students to make a damn website."
ryanmarcus.discuss.systems.ap.brid.gy
OLAP workloads are dominated by repetitive queries -- how can we optimize them?

A promising direction is to do 𝗼𝗳𝗳𝗹𝗶𝗻𝗲 query optimization, allowing for a much more thorough plan search.

Two new SIGMOD papers! ⬇️

LimeQO (by Zixuan Yi), a 𝑤𝑜𝑟𝑘𝑙𝑜𝑎𝑑-𝑙𝑒𝑣𝑒𝑙 […]

[Original post on discuss.systems]
Infographic describing LimeQO, a workload-level, offline, learned query optimizer. On the left, it shows a workload consisting of multiple queries (q₁ to q₄), each with a default execution time (3s, 9s, 12s, 22s respectively). On the right, alternate plans (h₁, h₂, h₃) show varying execution times for each query, with some entries missing (represented by question marks). For example, q₁ takes 1s under h₂, much faster than the 3s default. A specific callout highlights that for q₃, plan h₃ reduced the time from 12s to 3s, but took 18s to find, resulting in a benefit of 9s gained / 18s search. The image poses the question: “Where should we explore next to maximize benefit?” The image credits Zixuan Yi et al., SIGMOD '25, and provides a link: https://rm.cab/limeqo Infographic describing BayesQO, an offline, multi-iteration learned query optimizer. On the left, it shows a Variational Autoencoder (VAE) being pretrained to reconstruct query plans from vectors, using orange-colored plan diagrams. The decoder part of the VAE is retained. In the center and right, the image shows Bayesian optimization being performed in the learned vector space: new vectors are decoded into query plans, tested for latency, and refined iteratively. At the bottom, a library of optimized query plans is used to train a robot labeled “LLM,” which can then generate new plans directly. The caption reads: "We get a fast query, but also a library of high-quality plans. We can train an LLM to speed up the process for next time!" The image credits Jeff Tao et al., SIGMOD '25, and links to https://rm.cab/bayesqo
ryanmarcus.discuss.systems.ap.brid.gy
The abstract deadline for SoCC '25 is about a month away! This year's event is fully online.

https://acmsocc.org/2025/papers.html
2025 ACM Symposium on Cloud Computing
acmsocc.org
ryanmarcus.discuss.systems.ap.brid.gy
Kept writing bad code today, an expert had to take over and guide my hand.
A small, colorful bird, a Bourke's parakeet, with a mix of pastel pink, blue, green, and gray feathers is perched on a person's arm while they type on a keyboard. The setting appears to be a workspace with a computer mouse, mouse pad, and other desk items visible in the background.
ryanmarcus.discuss.systems.ap.brid.gy
"Database venues are just about LLMs now!" -- LLMs are certainly on the rise, but this claim isn't supported by the data.

Papers about LLMs are certainly growing very quickly, but last year there were:

* Almost 3x more papers about indexing,
* Almost 4x […]

[Original post on discuss.systems]
A table and a line chart depict trends in database research papers from conferences such as VLDB, SIGMOD, CIDR, and PODS.

The table on the left shows the number of papers from 2017 to 2024, categorized by topic: all papers, query optimization, transaction processing, indexing, and large language models (LLMs). It shows that overall paper counts vary each year, peaking at 778 in 2023. Query optimization papers consistently appear in large numbers, while LLM-related papers increase dramatically—from 0 in 2017 to 57 in 2024—indicating growing interest in LLMs in database research.

The line chart on the right shows the historical trend from 1970 to 2024 for each category. Total papers increase exponentially over time. Query optimization, transaction processing, and indexing papers also rise gradually. LLM papers remain flat until around 2018, then begin a sharp upward trend, reflecting their recent emergence in the field.
ryanmarcus.discuss.systems.ap.brid.gy
When modern analytic databases process `GROUP BY` queries, they tend to use a partitioning strategy for parallelism. The conventional wisdom is that partitioning has better scalability due to lower contention.

But is this wisdom still true in 2025? Penn […]

[Original post on discuss.systems]
A diagram showing a two-stage parallel processing system involving key-value pairs. On the left, a column labeled "K V" holds key-value pairs divided into morsels (small batches) for processing. Stage 1 is labeled "ticketing", where keys are matched against a shared hash table (middle column labeled "K T") to obtain a ticket (index). These tickets help arrange the data into a new table (right-middle column labeled "T K V"). Stage 2, labeled "update", uses the ticket to update values into a shared result vector (rightmost column labeled "V") at the index specified by the ticket. Blue and red colors distinguish different morsels processed concurrently.
ryanmarcus.discuss.systems.ap.brid.gy
The NSF GRFP, a training grant awarded to promising American students at public and private colleges looking to earn a PhD, was cut in half this year.

When faculty admit a PhD student, they are committing to raising ~$500k to fund that student. The GRFP […]

[Original post on discuss.systems]
Stacked bar chart titled "NSF GRFP Recipients" showing the number of recipients from private and public institutions for each year from 2015 to 2025. Each year is represented by a stacked bar with blue indicating private institutions and orange indicating public institutions. The total number of recipients remains relatively stable around 2000–2100 per year, with a noticeable peak in 2023 reaching above 2500 and a sharp drop in 2025 to about 1000. The distribution between private and public institutions varies slightly each year, with public institutions generally having a slightly higher share.
ryanmarcus.discuss.systems.ap.brid.gy
The deadline for aiDM 2025 -- the SIGMOD workshop on Exploiting Artificial Intelligence Techniques for Data Management -- has been extended to this Friday, March 28th. If you were on the fence about a submission, now is your change to make it!

http://aidm-conf.org/#dates
ryanmarcus.discuss.systems.ap.brid.gy
Just a reminder that the names of your bibtex citations get included in the PDF (both as the link anchor name and in the metadata), so if you name a paper `morons_who_copied_us`, reviewers and readers will be able to see that...
ryanmarcus.discuss.systems.ap.brid.gy
A very nice paper from UMD about catalog storage on data lakes. While I'm not totally sold on their solution (I have some doubts about the hierarchical data model), the discussion of various tradeoffs and design principles is top notch.

I think there's clear space for major innovation in the […]
Original post on discuss.systems
discuss.systems
ryanmarcus.discuss.systems.ap.brid.gy
I made a simple tool to look for related database papers (VLDB, SIGMOD, CIDR, PODS) given a new paper's title and abstract using vector embeddings. It highlights authors who are currently in the reviewer pool.

I'm sure my horrible Python hack will break at […]

[Original post on discuss.systems]
A screenshot of the linked webpage.
ryanmarcus.discuss.systems.ap.brid.gy
My hot take of the day is that we're pretty good at evaluating PhD applicants (at least better than random), but the number of qualified applicants greatly exceeds the number of available slots. So there exists pairs of students (X, Y) where X is admitted and Y isn't admitted, but the […]
Original post on discuss.systems
discuss.systems