We need more databases
It feels like over recent years, a “database consolidation” has been happening.
The number of databases returning to discussion seems to be dwindling to Postgres and SQLite.
This is tragic.
Not only does this result in a reduction in number of people who know how to work on databases, but we are losing many of the benefits of databases designed around specific workloads. Not everything abstracts well to SQL, relational, or even KV databases.
Sometimes, you really need the right tool for the job.
Industry-specific databases are a good thing
Optimizing for a very specific use case is a very good think, and is basically a forcing-function at scale.
Optimizing data structures, operations, and storage layouts for your use case not only makes things faster and more reliable, but it also makes them easier to use, as the data access layer becomes familiar to the industry (read: no need to translate between something like a financial transaction and a SQL transaction).
Index-free adjacency for graph databases is orders of magnitude faster than a recursive Postgres CTE against a massive index.
Meta built TAO with QoS and availability guarantees optimized for the way that Facebook users interact with the social network, and saw enormous benefits around performance, stability, and iteration speed. With a usecase-specific API, it prevented database developers from making stupid mistakes, and allowed them to think in terms of the domain they were operating in without having to translate to database world.
TigerBeetle is able to make extreme optimizations to financial workloads without having to make all sorts of abstraction layers over relational SQL, cutting down massively on IO and resources.
Many data stores we have now came out of internal, industry-specific projects
Kafka → LinkedIn solving for data streaming
Cassandra → Facebook for powering inbox search
HBase → An OSS clone of Google BigTable, a database they made to handle the massive volume of data they were accumulating
DynamoDB → Amazon needed a database that scaled better
Company-backed databases is reassuring when done right
I think I’m fairly speaking for the general community when I say we all hate a license rug-pull.
More than enough companies have taken perfectly good open source databases and realized they’ve failed to monetize them properly up front. I have a few I’m keeping my eye on that I worry about too.
However, having a company behind a database that can offer support and guidance, the same company that makes the thing, is really useful for businesses. Especially, when that company can then recongnize patterns in how the database is being used, and make adjustments to efficiency, performance, and features to reflect how people are using it.
Regardless of whether you’re working on a startup, a fortune 10 corporation, or an open source project, it’s a really good idea to talk to your users (or at least understand them). They reveal many things that are non-obvious.
Having a company backing the database also means there’s defintiely people working on it, and really caring about the performance, correctness, stability, and scalability. When money’s on the line, people tend to get more serious.
Would you be shocked to know that SQLite, the world’s most prolific database, is maintained by like… a handful of people? Turns out you don’t need an army of developers to make incremental changes and maintain security and stability.
You get meaningful, novel ideas too
You get some crazy guy saying “we should have something like malloc but for the web”, then a decade later S3 is one of the most prolific and scaled data stores on Earth.
You get completely new ways of building databases, like FoundationDB saying “we need to do this an entirely different way, from the ground up”, then proceed to build probably the most technically impressive and dependable database ever.
You get databases like Convex that mostly elegantly merge live updates and code-defined schemas, infrastructure, and consistent transactions into simple typescript.
Old databases carry baggage
This one will likely cause some heated discussion, but it’s undeniable that age and stability is accompanied by technical debt.
Many databases that we all know and love, such as SQLite and Postgres, have a few fundamental issues that are signs of their age and focus on stability (rather than rapid progress).
Before jumping to the comments, to be clear, I like both of these databases and use them in production (but certainly not stock).
First, both databases have checksums off by default
Postgres used to have them enabled, turned them off for a while, and now in Postgres 18 it seems like they will be default on again.
Ask any systems engineer, expecting that the disk will reliably return what you wrote, or properly warn you when it fails, is bonkers. Select filesystems can do that, but how many people running Postgres or SQLite are also running ZFS? Probably quite few.
Second, both databases file formats were designed for HDDs, not modern SSDs
While they do get massive advantages from decades of storage innovation, they will never get to take full advantage of it. Their layouts were fundamentally designed around HDDs, to the point where the pattern has become a file system on top of a file system, creating unnecessary overhead (doubled-up metadata, doubled-up free page tracking, etc.)
MySQL’s InnoDB was probably the first swing at fixing this, but others like Meta even going as far as replacing InnoDB with RocksDB.
For Postgres, OrioleDB seeks to fix this by converting the storage engine to an LSM, and is seeing quite notable performance gains.
Third, old databases have been dogmatic on fixing their horrible MVCC
The most egregious offender here is Postgres, specifically XIDs, or “transaction IDs”.
The TLDR is that every transaction in postgres gets a 32-bit ID that’s used in MVCC. VACUUM cleans these up, and prevents what’s known as “XID wrap-around”: when the 32-bit overflows and you start re-using XIDs from previous transactions.
Not only does this create a no-longer-needed garbage collection pause, but it also results in things like index bloat.
Modern databases don’t have these issues.
They use time-based MVCC, or god forbid even an unsigned 64-bit number (18,446,744,073,709,551,615).
If you’re worried, here’s the math: at 10B txn/sec, it will take 58.5 years to create transaction wrap-around with a 64-bit unsigned integer.
With a 32-bit integer, that would take less than half a second (4,294,967,295).
Fourth, old databases don’t scale well
Amazon runs thousands of shards of Aurora postgres. Many companies have written blog posts about the shard they’ve done on top of Postgres, and in the process, neutering the capabilities that don’t scale [1] [2].
Old databases fundamentally encourage data storage and access patterns that don’t scale.
Some have patched on scaling mechanisms that complicate the mental model of interacting with the database and sacrifice guarantees, like Postgres read replicas.
Companies that do scale then have to spend enormous amounts of time and resources either building around these (like in the above examples), engineering a new database to preserve them (e.g. Google Spanner), or database swapping, re-integrating, and re-educating engineers on how to use the new database that fits their scale.
New(er) databases like DuckDB, Turso, ClickHouse, Spanner, OrioleDB, Aurora, CockroachDB, and others are addressing these pain points (often with some caveats, of course).
Other people in the space have said “I think I can do it better”, and more power to them. If you think they’re crazy, good, that’s exactly who we want.
This doesn’t mean old databases aren’t useful
Despite their shortcomings, old databases that remain maintained have a few critical benefits.
They offer an oasis for companies that don’t want to do a full database migration every few years (looking at you, JS framework ecosystem).
They promise that upgrades are minor, and rarely backwards incompatible.
They become collectively learned, such that you can work around any shortcomings in performance, scale, and correctness checking. A community is built up that shares knowledge and best practices, which ultimately influences further development.