Code is Cheap, Uptime is Expensive: Why I Pivoted to SRE-First Engineering
Code is Cheap, Uptime is Expensive
The 3 AM reality check
3:14 AM on a Tuesday. PagerDuty screaming that payment gateway latency had spiked to 15 seconds.
The code that was failing? An AI assistant had scaffolded 90% of it three months ago, and a junior dev stitched the rest together. Syntactically perfect. Logically sound. But in the real world, where networks jitter, databases lock, and third-party APIs take naps, it was falling apart.
Staring at a waterfall of error logs, I realized: writing code is the easy part. It has always been the easy part. The actual engineering challenge isn't syntax or algorithms. It's keeping a system alive when the universe wants it to die.
The greenfield trap
Every developer loves npm init. Blank canvas, fresh dependencies, zero technical debt. We feel like gods. We build a demo in a weekend that looks like it solves a million-dollar problem.
The paradox: the first 90% of the work is easy. The remaining 10%, making it production-ready, takes 190% of the time.
Think about Joe from Accounting. Joe learns VBA. Automates a spreadsheet process from 10 hours a week to 1 hour. Hero. Six months later, tax laws change. The macro breaks. Daylight savings time messes up a timestamp. The file size exceeds the buffer. Joe is no longer an accountant. He's the captive maintainer of a brittle software system nobody else understands.
We do the same thing with TypeScript and microservices. Spin up services because it's architecturally "clean," ignoring that we just increased our operational surface area by an order of magnitude.
Operational excellence is the product
Users don't buy software. They don't care about your clean architecture or your recursive functions.
Users buy services. They buy a promise.
- Photos sync to the cloud seamlessly.
- The $7 matcha payment goes through instantly.
- The document saves before the battery dies.
Good software is invisible. When software becomes visible, it's usually broken.
If code is becoming a commodity (thanks to AI), then reliability is the luxury asset.
Engineering for reliability: three patterns
1. Structured logging with context
In a distributed system, console.log('Error happened') is useless. You need to trace a request across boundaries.
import pino from 'pino';
import { randomUUID } from 'crypto';
import { AsyncLocalStorage } from 'async_hooks';
const asyncLocalStorage = new AsyncLocalStorage<Map<string, string>>();
const logger = pino({
formatters: {
level: (label) => ({ level: label }),
},
mixin() {
const store = asyncLocalStorage.getStore();
if (store && store.has('requestId')) {
return { requestId: store.get('requestId') };
}
return {};
},
});
// Middleware injects a request ID automatically
export const requestContextMiddleware = (req: any, res: any, next: any) => {
const store = new Map<string, string>();
const requestId = req.headers['x-request-id'] || randomUUID();
store.set('requestId', requestId as string);
asyncLocalStorage.run(store, () => {
logger.info({ method: req.method, url: req.url }, 'Incoming Request');
next();
});
};
// Deep in business logic, context is always available
export const processPayment = async (amount: number) => {
logger.info({ amount }, 'Processing payment attempt');
try {
if (Math.random() > 0.8) throw new Error('Payment Gateway Timeout');
logger.info('Payment successful');
} catch (err) {
logger.error({ err }, 'Payment failed critically');
throw err;
}
};When the pager goes off, you see the entire story of a specific request, threaded through your logs automatically.
2. Circuit breaker
If an external API hangs, your system shouldn't hang with it. Detect failures and fail fast.
enum CircuitState {
CLOSED, // Normal operation
OPEN, // Failing fast
HALF_OPEN // Testing recovery
}
class CircuitBreaker {
private state: CircuitState = CircuitState.CLOSED;
private failures: number = 0;
private lastFailureTime: number = 0;
constructor(
private threshold: number = 3,
private timeout: number = 5000
) {}
async execute<T>(action: () => Promise<T>): Promise<T> {
if (this.state === CircuitState.OPEN) {
if (Date.now() - this.lastFailureTime > this.timeout) {
this.state = CircuitState.HALF_OPEN;
} else {
throw new Error('Circuit is OPEN: Failing fast.');
}
}
try {
const result = await action();
this.reset();
return result;
} catch (error) {
this.recordFailure();
throw error;
}
}
private recordFailure() {
this.failures++;
this.lastFailureTime = Date.now();
if (this.failures >= this.threshold) {
this.state = CircuitState.OPEN;
console.warn('Circuit Breaker TRIPPED to OPEN state');
}
}
private reset() {
this.failures = 0;
this.state = CircuitState.CLOSED;
}
}
const paymentBreaker = new CircuitBreaker();
async function handleCheckout() {
try {
await paymentBreaker.execute(() => callUnreliableApi());
} catch (e) {
console.log('System under load, switching to queued processing');
}
}This isn't just code. It's admitting that failure is the default state and planning for it.
3. Semantic health checks
Your /health endpoint returning { status: 'ok' } just because the Node process is running? That's lying. Check real dependencies.
import { createClient } from 'redis';
import { Pool } from 'pg';
const db = new Pool();
const redis = createClient();
export const deepHealthCheck = async () => {
const status = {
uptime: process.uptime(),
timestamp: new Date().toISOString(),
services: { database: 'UNKNOWN', redis: 'UNKNOWN' },
healthy: false
};
try {
await db.query('SELECT 1');
status.services.database = 'UP';
await redis.ping();
status.services.redis = 'UP';
status.healthy = true;
return { code: 200, body: status };
} catch (error) {
console.error('Health Check Failed', error);
return { code: 503, body: status };
}
};Your load balancer now knows exactly when to pull a dying instance out of rotation.
From builder to guardian
Before you merge your next PR, ask yourself:
- Observability: If this breaks at 3 AM, will the logs tell me why immediately?
- Recovery: Is there a manual database script I'll need to run? (If yes, automate it now.)
- Scale: Does this algorithm work with 10 users? 10,000? 10 million?
AI will kill the ticket-taker role. But it's creating a vacuum for the system thinker. Code is cheap. Anyone can generate a React component now. But uptime, latency, data integrity, security? Those are expensive. Those require judgment, architectural intuition, and the scars of past failures.
Don't just write software. Operate services.