You have an order service. When a new order comes in, it does two things: writes a row to the orders
table, then publishes an order.created event to Kafka for downstream consumers. Your implementation seemed to work fine at first, but then one of those two writes failed after the other had already succeeded. When that happens you end up with an inconsistency problem with no clean way to recover, or to even know that it occurred.
That’s the problem with dual write services. One way to fix it is with an outbox pattern.
Dual write problems
The instinct when you have two things to write is to just write them in sequence. Write the DB row, then publish to Kafka. Or publish to Kafka first, then write the DB row. The problem is that there is no ordering that’s safe. Both orderings have a failure window where one write succeeds and the other doesn’t.
For this blog post we’re assuming you’re writing to a database and kafka, but the same problem manifests when you’re also calling a downstream service via a CRUD API, writing to disk, writing a cache, and pretty much almost anything else.
In our scenario, three failure modes come up in practice:
DB succeeds, Kafka publish fails: The order exists in the database, but no event to kafka was published. There’s a few ways this could happen including: after the transaction commits the service crashes, or even if you try to revert the database row after kafka fails, that itself could now fail. Without good visibility you’d have no clue this happened, and you would only discover this when a customer asks why their order wasn’t processed by a downstream service.
Kafka publish succeeds, DB commit fails: The event is in Kafka and consumers are already acting on it, say triggering notifications about the order. But because the DB rolled back the transaction, the order doesn’t exist. Consumers are trying to process a missing order.
Retry creates a duplicate: Timeouts are the subtle one. The Kafka publish times out, but a timeout doesn’t always mean failure. The event may have landed before the connection dropped. A naive retry re-publishes the event, and downstream consumers process the same event twice.
Here’s a visualization of the scenarios below.
Service writes to DB first, then attempts the Kafka publish but the network drops.
System state after request
The DB committed and a revert can also fail, leaving orphaned rows nobody knows about. You'd need to scan both sources after the fact to even detect this.
This is a structural error with no simple solution to it.
When I first thought through this it seemed sort of easy (albeit messy), you could design a case in which you could just revert the changes to say the database if the kafka message failed to write. But that’s naive because what if your revert itself fails to rollback, like you might’ve tried your best in that case but at scale you’ll still end up with thousands of orphaned rows in your database that nobody else knows about.
Ultimately, you have two systems with independent transaction semantics and no shared coordinator. Any solution that involves “write to both” has problems.
The fix: write to one system atomically
Since we can’t ever guarantee writes will be persisted to two systems atomically, what if we change where we’re writing to? This is the approach that the outbox takes, we can write to just the transactional database in a separate outbox table.
Preferably the database is the same one that we want to write to anyways, although it isn’t necessary. If the outbox table lives in the same database, it gets the same transactional guarantees. Either both rows commit, or neither does.
BEGIN;
INSERT INTO orders (id, user_id, total, ...)
VALUES ($1, $2, $3, ...);
INSERT INTO outbox (event_type, payload, state, created_at)
VALUES ('order.created', $4, 'pending', now());
COMMIT;Then we can have a separate process that runs in the background and polls the outbox every so often. It looks for unpublished rows in the outbox table, publishes them to Kafka (or any other service), and marks them as sent.
The worker can safely retry failed publishes, because the rows are in a durable database and are atomically marked as completed or not.
-- Poller: runs on a timer, separate from the request path
SELECT id, event_type, payload
FROM outbox
WHERE state = 'pending'
ORDER BY created_at
LIMIT 100;
-- After successful publish:
UPDATE outbox SET state = 'published', published_at = now()
WHERE id = ANY($1);Since the poller is not in the transaction, it reads only fully committed rows. It will never see a row that hasn’t fully successfully committed.
Visually, the architecture looks like this. The orders and outbox rows live inside one transaction, and the poller runs periodically as the only thing that writes to Kafka.
Visualized outbox worker
Here’s a visualization of the outbox worker. The table represents the current set of entries in the outbox table. You can play around with the failure rate, request speed, and more.
Things worth trying:
- Raise request rate above 30 req/s with a slow interval, watch pending depth grow and the annotation appear.
- Set poller interval to 5s or more, avg lag jumps to roughly
interval / 2. - Add 30–40% failure rate, rows turn red and retry, lag builds but nothing is lost.
- Crank failure to 100%, let the backlog fill, then drop it to 0%.
What the simulation illustrates
The poller is your new bottleneck
Once you add the outbox, the poller becomes a primary system component. Its throughput is
batch_size / interval. If that number is lower than your request rate, the outbox depth grows
without bound, and you’ve just traded a race condition failure window problem for a queueing problem. Which is maybe better or worse depending on the domain.
Little’s Law states this precisely outbox queue = arrival rate × average time in system. For example at 20
req/s with a 2s poller interval and a batch of 10, throughput is 5 publishes/s, but you have 20 req/s, so depth grows at 15 rows/s. In an hour you’ll have 54,000 unprocessed rows and an unhappy on call rotation.
Ideally size your poller to run comfortably ahead of your write rate, with headroom for bursty traffic.
Lag becomes a more important metric than error rate
With an outbox pattern you can have a 0% error rate even when the system is severely lagging. Messages themselves might not be explicitly erroring but they might be in a huge backlog.
It’s worth adding metrics and alerts around outbox lag on both the total number of messages still needing to be processed, as well as the e2e latency.
Failure recovery is easy
Simulate a Kafka outage: crank failure rate to 100%, let the outbox fill for a while, then drop it back to 0%.
You’ll see the backlog drain as failed rows retry. No data loss, just a lag spike that resolves once the poller can publish again.
This is the structural guarantee the outbox buys you: messages survive until they’re confirmed published, which makes recovery from outages much easier.
The broken dual write from section one has no equivalent recovery. If the Kafka publish failed and you didn’t notice, the event is gone. Without complicated tooling to reconcile database rows against Kafka messages, there’s nothing to recover from.
Ordering
Within a single poller sweep, rows publish in insertion order. But across sweeps, if a row fails and is retried in the next sweep, it may publish after rows that were inserted later. The outbox guarantees at least once delivery, and generally not strict ordering.
Toggle strict ordering in the sim above. With it on, the poller stops at the first failed row and blocks everything behind it. This guarantees that events arrive in sequence, but a single stuck row stalls the whole queue. With it off, the poller skips failed rows and publishes whatever it can which keeps throughput high, but consumers may see events out of insertion order.
Consider this scenario of requests against the same row:
- CREATE:
{name: "hello world"} - UPDATE:
{name: "test"} - UPDATE:
{name: "hi"}
If (1) is processed first, then (2) fails, (3) gets processed, then (2) gets successfully retried. {name: "test"} ends up as the final entry, which is ordering that the user likely doesn’t expect.
There are ways around this like using a last_updated_at column on the row and rejecting updates with a stale timestamp, but it’s an extra thing to get right.
Downsides
We’ve sort of shown the outbox pattern as a miracle that just makes distributed systems easier, but there’s some things that you have to actively consider when implementing this pattern.
Eventual consistency breaks typical CRUD
This is what eventual consistency looks like in practice. The system is correct in the limit, but at any single moment a row can be in flight.
In a typical CRUD app you expect that you can perform some operation against an object and if it returns you a good status code that it is persisted everywhere. Well, with the outbox pattern we have to break this a little bit because it’s more of an async job framework now where a 200 status code likely means that we added it to the outbox table, but what if the poller itself is having trouble persisting your new record?
We often have to add a new column to the data to reflect the intention like we’re updating, deleting, or creating that row, and that the system doesn’t consider it successfully updated, deleted, or created until the outbox processes it. Otherwise there would be no other way for a user or the UI to tell that the entity isn’t actually fully updated yet.
Before outbox
CREATE: {1, "hello world"} -> 200 OK
UPDATE: {1, "test"} -> 200 OK
It would instantly respond here.
With an outbox, the 200 only tells you the request was accepted. The row carries an explicit status (creating, updating, deleting) until the poller finishes the downstream work:
The UI now has to model an in-between state. A read right after a write may still show creating, and you need to decide whether to poll, subscribe, or just show a spinner.
Downstream Validation
The more annoying part of an outbox pattern is if you’re writing to a downstream service directly and calling their API through your outbox service. What if that service has a ton of validations on their end? Like to say checking that an item is still in stock, that the shipping address is in their delivery zone, or that the user hasn’t been added to some new blocklist?
This is extremely tricky because if you don’t validate it on your side, your outbox poller will just always fail even though you told the user that it’s working. So you’ll end up in cases where specific rows of the outbox are completely unresolvable and the rows are stuck in updating or creating forever.
The user got a 200, the row is in the database, but the work will never complete. The only way out is to mirror every downstream validation on your side before writing to the outbox. Then hope the downstream team doesn’t quietly add a new validation rule next quarter.
Here’s what that looks like with retries on and 100% downstream failure:
To prevent this, you have to move all the validations up from the downstream service into your service and hope that in the future they don’t add new validations without telling you.
While annoying and bad UX, you can add alerting around this and page an engineer to go fix it. Although they may just need to delete the job, add the validation into your service now, and try to revert the already published jobs in the database. Definitely a pain and not easy but hopefully less painful (and more obvious about impact) than subtle hidden race conditions.
When to use it
The outbox fits most backend services that own a transactional database and need to reliably emit events. If you control the schema and your write volume isn’t absurd, it’s the right default.
If you have very high throughput with tight latency requirements, an outbox pattern is likely not for you. If you need sub-100ms publish lag at tens of thousands of writes per second, polling from the DB is going to be inefficient. That’s where CDC (Debezium, Kafka Connect with a WAL reader) end up being useful, they read the database write-ahead log directly, add zero write overhead, and gets you sub-second lag without polling.
The tradeoff is tighter coupling to your database internals and a more complex deployment. Both approaches give you at least once delivery.
If you’re not at that scale, the outbox is simpler, easier to reason about, and auditable. The outbox table is a queue you can query directly when something goes wrong.
Conclusion
The outbox adds real overhead. You get a new bottleneck to size, eventual consistency to handle in the UI, and a poller to operate. But the alternative is a failure mode you can’t see or query, that silently corrupts state until a customer notices.
For most services, that trade is worth it.
If you hate the outbox or I got something wrong, let me know on Twitter.