Engineering

We Built Our Own Jarvis: An AI Error Analyst for Happier On-Calls

The 3 AM call

3 AM. PagerDuty goes off. You fumble for the laptop, brain still half-dreaming about refactoring a legacy monolith. You open the alert: a cryptic stack trace. Null pointer? Downstream timeout? Did someone deploy at 5 PM on a Friday?

The next 30 minutes are a scramble. Tailing logs across five services, correlating timestamps, searching Slack for who pushed the last commit to main. Mean Time to Resolution might be acceptable to the business. Mean Time to Coffee is through the roof.

This was our reality. Our on-call engineers are good, but we were burning them out on tedious, repetitive triage. We wanted a really smart assistant that never sleeps and isn't picky about coffee.

What if the alert came pre-analyzed?

Instead of a raw JSON payload from Sentry, what if you received a calm message in Slack:

Jarvis Error Analysis: TypeError: Cannot read properties of undefined (reading 'id') in OrderService

Summary: TypeError in processNewOrder. The user object on the incoming order payload was null.

Potential cause: Commit a4e8b1c by jane.doe 15 minutes ago modified user hydration logic in AuthGateway. Most likely culprit.

Suggested action: Review recent changes in AuthGateway. Consider rolling back the last deployment while investigating.

We built this. We call it Jarvis.

The architecture: simple and serverless

No complex microservices needed. It's an event-driven flow stitching together tools we already had.

  1. Error captured: Sentry catches an exception.
  2. Webhook fired: Sentry sends the payload to our endpoint.
  3. Cloud function triggered: A TypeScript Cloud Function wakes up.
  4. Context gathered: The function parses the stack trace, fetches the last 5 commits from GitHub, and optionally queries observability metrics.
  5. AI analysis: Bundles everything into a prompt, sends to GPT-4.
  6. Slack report: The response is formatted and posted to #on-call-alerts.

The whole process takes about 10 seconds.

The core logic

Simplified, but this illustrates the flow:

import { Octokit } from "@octokit/rest";
import { OpenAI } from "openai";
 
const octokit = new Octokit({ auth: process.env.GITHUB_TOKEN });
const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });
 
export const errorTriageHandler = async (req, res) => {
  const errorPayload = req.body;
 
  const recentCommits = await getRecentCommits('my-org', 'my-repo');
  const prompt = buildAnalysisPrompt(errorPayload, recentCommits);
  const analysis = await getAIAnalysis(prompt);
  await postToSlack(analysis);
 
  res.status(200).send('Analysis complete.');
};
 
async function getRecentCommits(owner: string, repo: string) {
  const { data } = await octokit.repos.listCommits({
    owner,
    repo,
    per_page: 5,
  });
  return data.map(commit => ({
    sha: commit.sha.substring(0, 7),
    message: commit.commit.message,
    author: commit.commit.author?.name,
    date: commit.commit.author?.date,
  }));
}
 
function buildAnalysisPrompt(error: any, commits: any[]): string {
  const stackTrace = JSON.stringify(error.stacktrace, null, 2);
  const commitHistory = JSON.stringify(commits, null, 2);
 
  return `
    Analyze the following production error. Provide a summary, likely cause, and suggested action.
 
    **Error Stack Trace:**
    ${stackTrace}
 
    **Recent Commits (last 5):**
    ${commitHistory}
 
    Be direct and technical. Assume you are talking to a senior engineer.
  `;
}
 
async function getAIAnalysis(prompt: string): Promise<string> {
  const completion = await openai.chat.completions.create({
    model: 'gpt-4-turbo-preview',
    messages: [
      {
        role: 'system',
        content: 'You are an expert Senior Software Engineer named Jarvis. Analyze production errors and provide concise, actionable summaries. No apologies, no filler.',
      },
      { role: 'user', content: prompt },
    ],
  });
  return completion.choices[0].message.content || 'Analysis failed.';
}
 
async function postToSlack(message: string) { /* ... */ }

The real magic: the prompt

The difference between a useless summary and an actionable insight is often a few sentences in the system prompt. We don't want "I'm sorry you're seeing this error." We want "The error is in the payment module, likely caused by commit X."

Providing structured context (stack trace, commit history) gives the model the raw data to draw logical connections. Time spent tuning these prompts is probably the highest-ROI activity I did all year.

Results

  • MTTR is down. Engineers get a head start with the correct root cause often already identified. 30-45 minute diagnoses now take under 5.
  • Cognitive load is reduced. Instead of a puzzle, engineers get a set of clues.
  • Better post-mortems. Jarvis's initial analysis provides a solid starting point.

The system costs a few dollars a month in Cloud Function and API fees. It's paid for itself many times over in saved hours and reduced stress.

Build your own Jarvis. Your sleep schedule will improve.