Lightnews — Scholar-powered news

Ben @benjdd.com · 3d

When we need the full count for a video, sum all counts for rows with a specific v_id. This can be done ad-hoc or on a regular interval and stored in the video table.

Ben @benjdd.com · 3d

The fix? Spread the counting out across N rows in a dedicated counter table. In the example below N=3, but you can make N=100 or N=1000 depending on the load.

Every increment chooses one of the N rows randomly to increment.

1

Ben @benjdd.com · 3d

Consider a website like YouTube, where we need to track view counts for every video. It's convenient to store this as a column in our video table, but for popular videos there will be lots of queries simultaneously attempting increments on the same row!

Contention == locking == slow!

1

Ben @benjdd.com · 3d

Have a row contention problem? Try slotted counters!

This is a neat technique for spreading out increments across many rows to reduce contention.

1 1 8

Ben @benjdd.com · 4d

Taken from an excellent article by @ryanmarc.us.

rmarcus.info/blog/

Ryan Marcus · UPenn

Ryan Marcus, assistant professor at the University of Pennsylvania. Using machine learning to build the next generation of data systems.

rmarcus.info

1

Ben @benjdd.com · 4d

I love benchmarks like this.

Visualizing performance improvements over time, validating that Postgres just keeps getting better (mostly).

For all its flaws, it's great to see that the core team is dedicated to continual improvement.

1

Ben @benjdd.com · 5d

It's just their design choice! Postgres favors this design for the sake of MVCC and fast linear writes at the cost of having space reclamation happen as separate jobs / processes.

MySQL does things very different with clustered table indexes and the undo log.

1

Ben @benjdd.com · 5d

Relying solely on autovacuum can lead to fragmented data layout. It can still be useful to do semi-regular table rebuilds either with VACUUM FULL or with extensions like pg_squeeze and pg_repack.

It's also needed to avoid xid wraparound and for updating table stats used by the query planner.

Ben @benjdd.com · 5d

Autovacuum is a background job that scans tables to determine which rows have been removed and reclaims this space. It doesn't rearrange or compact data, but makes the slots available for re-use by future inserts and updates.

1

Ben @benjdd.com · 5d

Why does Postgres need autovacuum?

By default (and with all vacuuming disabled) Postgres never removes rows. When data is deleted, it marks the rows as dead by setting the max transaction id (xmax) but does not reclaim the space. Without intervention, tables would grow indefinitely!

2 1

Ben @benjdd.com · 6d

😂

Ben @benjdd.com · 7d

If you're using a connection pooler like PgBouncer in transaction pooling mode, long transactions can easily result in pool exhaustion producing failures on your clients.

The database is the most core part of your stack. Keep it as contention-free as possible.

Ben @benjdd.com · 7d

Then there's MVCC. Postgres keeps multiple versions of rows for MVCC, and long-running transactions keep old rows from becoming autovacuum candidates. MySQL handles this with an undo log. When there are lots of long transactions, the log has to be kept around longer, log length and performance.

1

Ben @benjdd.com · 7d

Never call an API in a transaction.

In relational databases, it's best practice to keep transactions as short-lived as possible.

Longer transactions lead to resource contention. If each transaction acquires row locks, being longer-lived means higher likelihood of lock waits and execution delays.

2 2

Reposted by Ben

Cynthia Dunlop @cynthiadunlop.bsky.social · 10d

@sarna.dev's latest picks for writethat.blog

90%// @mitsuhiko.at

Turning billions of strings into integers every second without collisions// @jaz.bsky.social

Eliminating cold starts 2: shard and conquer// Harris Hancock @cloudflare.social

Processes and threads// @benjdd.com, @planetscale.com

1 2 10

Ben @benjdd.com · 11d

A must-read for learning how scalable vector indexes work.

Vicent is one of the best writers in the tech world.

PlanetScale @planetscale.com · 11d

Vitess databases on PlanetScale support sophisticated vector search.

We've published a deep-dive on how we built this for MySQL in a way that is scalable, performant, and ACID-compliant.

4

Ben @benjdd.com · 13d

This is applicable in many areas of life, but especially when building a product.

One reason working at PlanetScale is so great: we are definitely operating in the green zone.

2

Ben @benjdd.com · 13d

Are you working in the sweet spot for productivity?

This is intuitively obvious but incredibly useful to visualize like this. Adding organization + structure increases productivity up to a point, but has diminishing and even negative impact when overdone.

1

Ben @benjdd.com · 14d

Range sharding has the advantage of giving the database engineer control of exactly where data lives. The downside? It's difficult to get the ranges just right, and often there ends up being uneven data distribution or hot shards.

This is the challenge (and fun!) of designing a scalable database.

Ben @benjdd.com · 14d

Range sharding is simpler. Each shard is assigned a range of values for a column (the model name in the example below).

1

Ben @benjdd.com · 14d

Hash sharding produces a hash of a column for every incoming row. It doesn't have to be cryptographically secure, but it needs to be fast. This is used to determine which shard the row resides on, each shard being assigned a range of hashes.

1

Ben @benjdd.com · 14d

Sharding is the best way to scale a relation database.

This is exactly how Meta, Uber, Slack, and many others achieve petabyte-scale databases.

The two most popular techniques: hash-based and range-based sharding.