The Gentle Art of Not Losing Data: An Ode to the Outbox Pattern | Tech Blog

The Friday afternoon problem

4:50 PM on a Friday. You just shipped a new feature. A user places an order. Your service writes to the orders table, then calls Kafka to publish an OrderCreated event. The database COMMIT succeeds. Your code moves to kafkaProducer.send(...).

The network hiccups. A Kubernetes pod restarts. The message broker is momentarily unavailable. The send call throws.

You now have an order in your database that the rest of your distributed system knows nothing about. Shipping won't see it. Inventory won't update. The user won't get a confirmation email. Enjoy your weekend.

This is the classic dual-write problem. I once designed a CQRS-based inventory system that worked perfectly in staging. Under the first real load spike, it started dropping events. The state drift got so bad we took it offline and manually reconciled data for hours. My belief in the atomicity of a single function call died that day.

The illusion of atomic operations

A database transaction and a network call can never be part of the same atomic unit. You cannot COMMIT a database write and a Kafka publish simultaneously. One can succeed while the other fails.

Two-phase commit (2PC) exists, but unless you have a dedicated team to manage that complexity, it's usually a path to more problems. Too much overhead and fragility.

The elegant solution: the Outbox Pattern.

Your database is your best message queue

If you can't guarantee two separate operations are atomic, don't perform two separate operations. Make the act of wanting to send a message part of your primary database transaction.

Single transaction: When processing a command like CreateOrder, start a database transaction.
Write business data: Insert the order into the orders table.
Write event data: In the same transaction, insert a record into an outbox_events table.
Commit. Both the order and the event record are saved, or neither is.

The outbox_events table becomes a temporary, reliable queue inside your primary database.

Implementation with NestJS and TypeScript

The schema

CREATE TABLE "orders" (
  "id" UUID PRIMARY KEY,
  "product_id" VARCHAR(255) NOT NULL,
  "quantity" INT NOT NULL,
  "status" VARCHAR(50) NOT NULL DEFAULT 'PENDING',
  "created_at" TIMESTAMP WITH TIME ZONE NOT NULL DEFAULT NOW()
);
 
CREATE TABLE "outbox_events" (
  "id" UUID PRIMARY KEY,
  "aggregate_id" UUID NOT NULL,
  "event_type" VARCHAR(255) NOT NULL,
  "payload" JSONB NOT NULL,
  "status" VARCHAR(50) NOT NULL DEFAULT 'PENDING',
  "created_at" TIMESTAMP WITH TIME ZONE NOT NULL DEFAULT NOW()
);

The transactional write

Both tables, one transaction:

import { Injectable } from '@nestjs/common';
import { EntityManager } from 'typeorm';
 
@Injectable()
export class OrderService {
  constructor(private readonly entityManager: EntityManager) {}
 
  async createOrder(productId: string, quantity: number): Promise<Order> {
    return this.entityManager.transaction(async (transactionalManager) => {
      const order = new Order();
      order.productId = productId;
      order.quantity = quantity;
      await transactionalManager.save(order);
 
      const event = new OutboxEvent();
      event.aggregateId = order.id;
      event.eventType = 'OrderCreated';
      event.payload = {
        orderId: order.id,
        productId: order.productId,
        quantity: order.quantity
      };
      await transactionalManager.save(event);
 
      // If anything fails, the entire transaction rolls back.
      return order;
    });
  }
}

Now it's impossible to have an order without its corresponding event record. Consistency problem solved.

The relayer

A separate process reads from the outbox and publishes to the message broker:

import { Injectable } from '@nestjs/common';
import { Cron, CronExpression } from '@nestjs/schedule';
import { getRepository } from 'typeorm';
import { KafkaProducerService } from '../kafka/kafka-producer.service';
 
@Injectable()
export class OutboxPublisher {
  private isProcessing = false;
 
  constructor(private readonly kafkaProducer: KafkaProducerService) {}
 
  @Cron(CronExpression.EVERY_5_SECONDS)
  async processOutbox() {
    if (this.isProcessing) return;
 
    this.isProcessing = true;
    try {
      const events = await getRepository(OutboxEvent).find({
        where: { status: 'PENDING' },
        take: 50,
        order: { createdAt: 'ASC' },
      });
 
      for (const event of events) {
        try {
          await this.kafkaProducer.send(event.eventType, event.payload);
          event.status = 'PROCESSED';
          await getRepository(OutboxEvent).save(event);
        } catch (error) {
          console.error(`Failed to publish event ${event.id}. Will retry.`, error);
          // Leave as PENDING. Next cycle picks it up.
        }
      }
    } finally {
      this.isProcessing = false;
    }
  }
}

The relayer is simple and resilient. It polls for pending events, publishes them, and only updates status on success. If the broker is down, events stay PENDING and get retried next cycle.

Handling poison pills: retry and dead letter topics

The outbox gives us at-least-once delivery. But what if an event is malformed or a downstream consumer has a persistent bug? That message fails every time, potentially blocking everything else. A poison pill.

Build a retry mechanism:

Initial consumption: Consumer processes message from order-events.
First failure: Message goes to order-events-retry-1 with a 1-minute delay.
Second failure: Goes to order-events-retry-2 with a 5-minute delay.
Final failure: After N retries, message lands in order-events-dlq (Dead Letter Queue).

Messages in the DLQ stop the retry loop. Engineers can inspect them, fix them manually, or discard them. A single bad message no longer halts the entire pipeline.

Build for a good night's sleep

The Outbox Pattern uses the most reliable tool you have (your transactional database) to guarantee that what you intend to happen will eventually happen. Combined with a retry/DLQ strategy on the consumer side, you get a system that is consistent and resilient to the inevitable failures of distributed computing.

It's a bit more code upfront and a bit more infrastructure to think about. The return is data integrity, system stability, and fewer 2 AM pages.