You Really Can Replace Kafka with a Database
yep, you should
I follow Koutanov on Quora, and I just got a notification about his latest fantastic and thought provoking article, You Can Replace Kafka with a Database. His conclusion was ironically the opposite, so I fell for his little trick. I was going to comment at the end of his article with some rebuttals, but I suspected it was going to be too long. So here’s my first medium article in a long time.
I also love the original Top Gear, the irreverent charm, the smell of burning rubber, and the mystery of The Stig. They did things differently, they went against the grain, and that’s what I think the software industry needs to do in order to shed any bloat that’s clinging on.
Getting the design right
It would appear that Koutanov is a seasoned expert in distributed systems, and a big fan of Kafka, a fantastic and powerful tool. When it comes to databases though, he doesn’t seem to have an intuition for the best way to use a database to replace Kafka. To be fair, his design is about implementing a “messaging system on top of a database”. I propose an architecture and database design that is better suited for even an older database, using PostgreSQL.
Multiple tables
If you use a “Type” field, that’s a stinky database smell. Normalize.
Write events as records into a table from a producer process;… with an indexed attribute to denote the event type… [Koutanov]
Primarily, try to use the domain tables, they naturally represent state that can be queried by one or many processes. There are exceptions of course and extra tables may be needed for the explicit purpose of conveying “intent” state (aka. tasks) separate to the domain model, but each different intent must be a new table. By avoiding a single central table, you avoid any central locking and disk I/O bottleneck, and with separate tables you gain variable expression for member fields to capture only the information you need.
Signalling, not Polling, not Messaging
The lack of any signalling mechanism in older databases is not a flaw of the relational database concept, only a failing of the builders of database systems, like the ongoing resistance from MariaDB (MySQL). Most mainstream databases do have signalling capabilities: SQL Server, Oracle, and PostgreSQL.
Periodically poll the database from a consumer instance…(Continuous querying is … somewhat rare in mainstream SQL databases)… [Koutanov]
PostgreSQL has got the LISTEN/NOTIFY commands. Set up a trigger on each table that calls [NOTIFY ‘tableName_changed’]. Any process connected to the database can wait using [LISTEN ‘tableName_changed’].
Query after Signal
Kafka does not offer efficient ways of locating and retrieving records based on their contents — something that databases do reasonably efficiently
As I said in the introduction, my design is not a simple “messaging with database” architecture, there are better architectures.
There is a massive difference between Signalling and Messaging. Messaging includes the state information and typically distributed in an event-driven manner. Signalling resembles the “Observer Pattern”: the notification doesn’t include any data and the consumer must query the subject. The query aspect is an interesting distinction because databases are very good for querying.
Querying is very powerful — upon signal, a process can seek out data to complete the task. This is efficient with an SQL view that has the power of JOIN, query planning, and local data.
Security
An oft-overlooked feature of an event streaming platform is access control. Kafka offers fine-grained control over all actors
It may be overlooked, but that bring it up to par with a database system. Good to know though.
Most databases offer access control; however, its granularity is limited to table level.
That’s incorrect. In addition to Tables, Views can be secured, and done so with different permission sets bundled into roles and also users.
Databases exceed the security capabilities of Kafka’s security mechanisms, with even row-based security that works in a query. PostgreSQL can stamp authoritative data with the correct ChangedByUserId immediately upon receipt. Within a database that field can be used to filter records — this is an essential, simple, and important tool for enforcing domain-specific row-level security restrictions. What’s more, this can be readily code-reviewed and validated by a customer’s own DBA — and such validations can be readily automated.
Scale
Start Small
With less infrastructure your software development is immensely simplified. Starting out, you only need the database on one server, and you can run the processes on that same machine.
Although Koutanov didn’t lead with a great database design, he did concede that less infrastructure would be nice — if it worked.
To be fair…another set of cogs with their own deployment and maintenance overheads, infrastructure requirements, technical learning curves and all that jazz. And on the face of it, it’s a solid argument — why consciously introduce complexity when a comparable outcome may be achieved with a lot less? So, let’s dig deeper.
Of course, I can demonstrate that it does work, and even at scale. So start small with less, and don’t overengineer. Expand when and where you need to.
Don’t make decisions based on synthetic benchmarks
It’s well-established wisdom to optimise performance last. YAGNI / KISS — it’s cheaper to buy a faster server than it is to build a faster system. Furthermore, you need to measure and improve, so that any performance improvements are done in a methodical way.
From a non-functional standpoint, Kafka is optimised for high levels of throughput, both on the producer side (publishing lots of records per unit of time) and on the consumer side (processing lots of records in parallel where possible). Databases can’t touch this.
This is certainly true, but how much of that benchmarked speed translates to software success? In reality, a single stream of data isn’t enough, data is inherently relational, and information emerges from broader correlations. With Kafka, multiple queued streams of data cannot be queried with JOINs like a database can, so you need to create microservices that consume multiple streams, and keep a local copy of data for lookups. That’s extra work for developers, and that’s a terrible issue to have. That’s more bugs, more code to maintain, and in the end you might end up provisioning more servers in a cluster to run custom-JOIN code that cannot be query-planner-optimised. In addition you have to deal with the unfamiliar “CAP-therom” tradeoffs which adds even more complexity for developers —it cannot be overstated how much any added programmer complexity hurts software success.
When a simple system is running, you can run relevant real-world tests across the complete system. You will be able to tell stakeholders the capacity limits with and the statistical distributions for various types of loads. It will become apparent what the real performance will be as load increases. Often, these are aspects you don’t anticipate.
Use an Event Broker for performance exceptions
Most of your system can run on database tables with no performance issues, and with the benefit of all of the simplicity. Lookup tables, slowly changing dimensions, and many processes will cope fine on a single shared database.
But there are exceptions.
But as a distributed, append-only log, Kafka is unrivalled.
Please use Kafka or similar, if you need a distributed append-only log at scale, when your monitoring shows that you can no longer cater for a particular task-queue entity within a single shared database. Only that entity will need the added complexity, and that entity will get the full performance of dedicated infrastructure to solve that isolated performance issue.
Multiple Consumers
You might consider taking a leaf from Kafka’s book and employ offsets to track consumer location, while persisting records beyond their logical consumption — thereby supporting multiple sets of consumers out of a single persistent data set
PostgreSQL supports SKIP LOCKED since version 9.5: With SKIP LOCKED, any selected rows that cannot be immediately locked are skipped. Skipping locked rows provides an inconsistent view of the data, so this is not suitable for general purpose work, but can be used to avoid lock contention with multiple consumers accessing a queue-like table.
I think The Stig turned out to be Perry McCarthy and Ben Collins. Like the mysterious Stig, there are amazing well-credentialed unseen performers behind the success of database systems.
Followup Quarter-mile Drag-racing
Quick responses to article replies from the Koutanov
Building even simple DB-backed messaging systems involves a lot more incidental complexity than meets the eye; It also involves a lot more engineering skill that isn’t easily commoditised. I can clearly see that your knowledge of certain database products is higher than an average engineer, which is a case in point.
Perhaps, but I’m thinking hypothetically here: in all cases good tooling solves all skills gaps, and levels the starting grid.
Your knowledge of Brewer’s CAP theorem already puts you into the 95th percentile.
I learned about that from CockroachDB. Now there’s a NewSQL database system with torque.
a particular client…built a queue on top of MySQL…They are a listed company now….The developers kept trying to optimise their home-grown queue…
Not a terrible problem to have. I suggest that the opposite is worse. A startup that gold plates: building a 100% event-driven software system with microservices, and layers of microservice abstractions — is highly likely to fail.
To justify overengineering, people readily exaggerate the difficulty to change software in the future. It’s essential to start simple, then measure and improve.
As he says “They are a listed company now”, they have more than enough resources to optimise for performance and scalability, as they scale.
It would be interesting to get the full details about their conundrum — I love old classics.
unless the alternative is genuinely significantly simpler/cheaper/quicker and offers a faster time to market or some other material competitive advantage. I don’t believe that implementing and maintaining a messaging platform inside a DBMS is easier or more cost-effective than spinning up a dedicated message broker.
I tend to agree. But I don’t think either extreme is universal, the answer lies somewhere in the middle. Sometimes turbo lag isn’t a big problem.