Engineering

Stop Managing SSH Keys: Designing a Trustless Certificate Authority System

SSH Certificates: Designing a Trustless Access System

The Day the Keys Broke

Back in 2018, I rendered an entire fleet of production instances completely inaccessible. It was a classic administrative misstep during a routine key rotation. Instead of appending a new contractor's public key to the remote authorized_keys file, my deployment script used a single > instead of >>. In a fraction of a second, fifty servers slammed their doors shut. For the next three hours, my team and I had to laboriously bridge into our own infrastructure via cloud console serial ports, sweating through our shirts while customer availability alerts blared. That incident permanently altered my perspective. Managing secure shell access via scattered text files is a fundamentally fragile architecture.

We tolerate archaic practices in infrastructure simply because we are used to them. You generate a public-private key pair, copy the public half to a remote server, and hope nobody ever loses their laptop or moves to a competitor. When you spin up a new server and connect for the first time, you are greeted with a terrifying terminal warning about an unknown host fingerprint.

What do you do? You type yes. Everyone types yes.

This mechanism is known as Trust on First Use (TOFU). It is an absolute user experience disaster masquerading as a security policy. As a solopreneur who scales systems for a living, I eventually realized that relying on TOFU and manual key distribution is the antithesis of System Thinking. We need declarative trust to replace imperative key copying.

The Problem: Imperative Access in a Declarative World

Traditional SSH key authentication relies on a decentralized, manual web of trust. Every server must maintain an explicit list of authorized cryptographic identities.

Consider the operational friction this creates:

  1. Onboarding/Offboarding: Adding a new engineer requires mutating configuration on every single target machine. Removing access when they leave is a race against time.
  2. Host Key Rotation: If you rebuild a server, its identity changes. Suddenly, your entire engineering team is staring at a massive WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED! error. You then spend the next hour fielding Slack messages from panicked developers asking if the company is being hacked, followed by instructing them to run ssh-keygen -R.
  3. Auditability: A public key in a text file lacks metadata about expiration. It also fails to specify the owner's identity or restrict permitted commands.

I have seen corporate environments attempt to solve this by deploying massive, convoluted sync scripts that distribute authorized keys via cron jobs. I have also watched network security vendors inject their own root certificates to intercept traffic, breaking developer tools like Git and Nix in the process. Slapping a centralized distribution mechanism onto a decentralized protocol does not fix the underlying architectural flaw.

The Solution: Certificate-Based SSH Authentication

The elegant alternative has actually been built into OpenSSH since version 5.4 (released in 2010): SSH Certificates.

Instead of trusting individual user keys, servers trust a single cryptographic entity: a Certificate Authority (CA). When you want to grant a developer access, the CA signs their public key, creating an SSH Certificate. This certificate is a digital passport. It proves to the target server that a trusted authority has vouched for the user.

This works in reverse as well. The CA can sign the server's host key. When a client connects, the server presents its signed certificate. The client, knowing the CA, instantly trusts the server. The TOFU prompt vanishes entirely. Warnings and friction disappear entirely, replaced by cryptographic certainty.

By leveraging SSH certificates, you unlock profound capabilities:

  • Ephemeral Access: Certificates can be issued with a strict Time-To-Live (TTL). You can grant access that automatically evaporates after four hours. No more lingering keys.
  • Principal Restriction: Certificates specify exactly which user accounts the bearer is allowed to assume (e.g., ubuntu, but not root).
  • Granular Constraints: You can bake restrictions directly into the cryptographic signature. Want to restrict an offshore contractor so they can only connect from a specific corporate VPN IP and only execute a single restart command? The certificate enforces this at the protocol level.

Implementation: Building an Automated CA System

Transitioning to SSH certificates requires generating the central CA and configuring the fleet to trust it. You also need an automated issuing mechanism.

Since manually signing certificates in the terminal defeats the purpose of automation, we are going to build a lightweight, automated Signing Authority using Node.js and TypeScript.

Phase 1: Generating the Cryptographic Roots

First, we need to generate the CA keys. We will create two pairs: one for signing users, and one for signing hosts. Keep these secure. If these private keys are compromised, your entire infrastructure is compromised. Ideally, these live in a Hardware Security Module (HSM), but for this guide, we will generate them on a secure administrative machine.

# Generate the User CA (used to sign developer keys)
ssh-keygen -t ed25519 -C "User CA" -f user_ca_key
 
# Generate the Host CA (used to sign server keys)
ssh-keygen -t ed25519 -C "Host CA" -f host_ca_key

Phase 2: Configuring the Infrastructure

Servers must be told to trust the User CA. Distribute the user_ca_key.pub to your servers (e.g., placing it in /etc/ssh/user_ca.pub). Then, update the /etc/ssh/sshd_config file:

TrustedUserCAKeys /etc/ssh/user_ca.pub

To eliminate the TOFU prompt for your developers, sign the server's own host key using your Host CA:

ssh-keygen -s host_ca_key -I "production-web-01" -h -n web.internal.net -V +52w /etc/ssh/ssh_host_ed25519_key.pub

This generates /etc/ssh/ssh_host_ed25519_key-cert.pub. Tell the server to present this certificate to connecting clients:

HostCertificate /etc/ssh/ssh_host_ed25519_key-cert.pub

Finally, developers add a single line to their local ~/.ssh/known_hosts file to intrinsically trust any server signed by your Host CA:

@cert-authority *.internal.net ecdsa-sha2-nistp256 AAAAE2Vj...<contents of host_ca_key.pub>

Phase 3: The TypeScript Issuing Service

Manually running ssh-keygen -s every time someone needs access is toil. Let us engineer a specialized internal microservice that authenticates a request and issues a short-lived certificate dynamically.

We will use Node.js with Express and the native child_process module to interact with the OpenSSH tooling.

// src/server.ts
import express, { Request, Response } from 'express';
import { execFile } from 'child_process';
import { promisify } from 'util';
import fs from 'fs/promises';
import path from 'path';
import crypto from 'crypto';
 
const execFileAsync = promisify(execFile);
const app = express();
app.use(express.json());
 
// Configuration
const CA_KEY_PATH = process.env.CA_KEY_PATH || '/secure/ca/user_ca_key';
const CERT_VALIDITY = '+4h'; // Certificates expire in 4 hours
 
interface SignRequest {
  publicKey: string;
  username: string;
  principals: string[];
}
 
/**
 * Validates the SSH public key format to prevent injection attacks.
 */
function isValidPublicKey(key: string): boolean {
  const sshKeyPattern = /^(ssh-rsa|ecdsa-sha2-nistp256|ssh-ed25519)\s+[A-Za-z0-9+/=]+\s*.*$/;
  return sshKeyPattern.test(key.trim());
}
 
app.post('/api/sign', async (req: Request<{}, {}, SignRequest>, res: Response) => {
  const { publicKey, username, principals } = req.body;
 
  // 1. Authenticate the user (Pseudo-code for your IdP integration)
  // const isAuthenticated = await verifyWithOktaOrGitHub(req.headers.authorization);
  // if (!isAuthenticated) return res.status(401).send('Unauthorized');
 
  // 2. Input Validation
  if (!publicKey || !isValidPublicKey(publicKey)) {
    return res.status(400).json({ error: 'Invalid Public Key format.' });
  }
  if (!username || !/^[a-zA-Z0-9_-]+$/.test(username)) {
    return res.status(400).json({ error: 'Invalid username.' });
  }
 
  const tempDir = await fs.mkdtemp(path.join('/tmp', 'ssh-ca-'));
  const pubKeyPath = path.join(tempDir, `${username}.pub`);
  const certPath = path.join(tempDir, `${username}-cert.pub`);
 
  try {
    // Write the requested public key to an ephemeral file
    await fs.writeFile(pubKeyPath, publicKey.trim() + '\n', { mode: 0o600 });
 
    // 3. Execute the OpenSSH signing command securely
    const args = [
      '-s', CA_KEY_PATH,         // The CA private key
      '-I', `${username}-auth`,  // Certificate Identity (logged on target servers)
      '-n', principals.join(','),// Allowed principals (e.g., 'ubuntu,ec2-user')
      '-V', CERT_VALIDITY,       // Validity duration
      '-z', crypto.randomInt(1000, 9999).toString(), // Serial number
      pubKeyPath                 // The public key to sign
    ];
 
    await execFileAsync('ssh-keygen', args);
 
    // 4. Retrieve and return the newly minted certificate
    const signedCert = await fs.readFile(certPath, 'utf8');
    res.status(200).json({ 
      message: 'Certificate issued successfully',
      certificate: signedCert,
      expiresIn: CERT_VALIDITY
    });
 
  } catch (error) {
    console.error('Signing failure:', error);
    res.status(500).json({ error: 'Internal signing process failed.' });
  } finally {
    // Ensure ephemeral files are thoroughly scrubbed
    await fs.rm(tempDir, { recursive: true, force: true });
  }
});
 
const PORT = process.env.PORT || 3000;
app.listen(PORT, () => {
  console.log(`[CA Issuer] Service locked and listening on port ${PORT}`);
});

Phase 4: The Developer Experience

Engineers loathe friction. If obtaining a certificate is harder than dealing with authorized_keys, they will bypass the system. We need a zero-friction client script.

Here is a minimalist Node.js CLI tool that a developer runs at the start of their day. It generates an ephemeral key pair and requests a certificate from our new API before loading both into the native SSH agent.

// src/client.ts
import { execFile } from 'child_process';
import { promisify } from 'util';
import fs from 'fs/promises';
import path from 'path';
import os from 'os';
 
const execFileAsync = promisify(execFile);
 
async function authenticateAndLoad() {
  const keyName = 'ephemeral_id_ed25519';
  const sshDir = path.join(os.homedir(), '.ssh');
  const privateKeyPath = path.join(sshDir, keyName);
  const publicKeyPath = `${privateKeyPath}.pub`;
  const certPath = `${privateKeyPath}-cert.pub`;
 
  console.log('Generating ephemeral daily SSH key...');
  await execFileAsync('ssh-keygen', ['-t', 'ed25519', '-f', privateKeyPath, '-N', '']);
 
  const publicKey = await fs.readFile(publicKeyPath, 'utf8');
 
  console.log('Requesting signed certificate from Authority...');
  const response = await fetch('https://ca.internal.net/api/sign', {
    method: 'POST',
    headers: { 
      'Content-Type': 'application/json',
      'Authorization': `Bearer ${process.env.SSO_TOKEN}` // Assuming SSO token is exported
    },
    body: JSON.stringify({
      publicKey,
      username: os.userInfo().username,
      principals: ['ubuntu', 'developer']
    })
  });
 
  if (!response.ok) throw new Error(`CA rejected request: ${response.statusText}`);
  
  const data = await response.json();
  await fs.writeFile(certPath, data.certificate, { mode: 0o600 });
 
  console.log('Loading certificate into ssh-agent...');
  // Adding the private key automatically loads the adjacent -cert.pub file
  await execFileAsync('ssh-add', [privateKeyPath]);
 
  console.log('Access granted. Certificate expires in 4 hours.');
}
 
authenticateAndLoad().catch(console.error);

With this workflow, an engineer types npm run ssh-login, and immediately gains highly constrained, perfectly auditable access to the entire fleet. When their shift ends, the certificate mathematically rots away. If a laptop is stolen, there is no mad dash to strip their keys from production; the key is either already invalid or will be shortly.

Conclusion: Architecting for Resilience

Shifting to SSH certificates delivers a profound architectural simplification alongside a security upgrade.

Are there trade-offs? Absolutely. Introducing a centralized Certificate Authority means you have created a single point of failure for issuing new access. If your signing service goes down, nobody can get a new certificate. However, because the target servers independently verify the cryptographic math, existing valid certificates will continue to work flawlessly even if the CA is offline.

Furthermore, logging becomes infinitely superior. Instead of seeing a generic Accepted publickey for ubuntu, your auth logs will read Accepted certificate ID "jane.doe-auth" signed by CA. You know exactly who logged in, regardless of which shared UNIX account they assumed.

Stop treating secure shell access like a chaotic 1990s BBS. Eliminate TOFU anxiety and delete your authorized_keys deployment scripts to start thinking in systems. By delegating trust to cryptography rather than sync jobs, you build an infrastructure that scales organically and fails gracefully.