Engineering

Regex Will Betray You: Building an AST and AI-Driven i18n Pipeline

Regex Will Betray You: Building an AST and AI-Driven i18n Pipeline

Confession: I used to think regular expressions were the ultimate hammer for every text-processing nail. I once brought down an entire production dashboard because my "clever" regex replaced a core database variable name instead of a frontend UI string. We’ve all been there, sweating profusely at 2 AM while a database rollback takes its sweet, agonizing time.

Recently, this exact trauma resurfaced. I decided to try out the newly released 1.0 version of a blazing-fast code editor that everyone has been talking about. The performance was spectacular, but there was a glaring omission: no multi-language support. I prefer working in my native Korean, and I quickly discovered that the underlying architecture simply didn't support internationalization (i18n). Every UI label and menu item, along with error messages, was hardcoded directly into the source code.

To make matters worse, the community forks attempting to translate the editor into other languages were relying heavily on regex search-and-replace scripts. They were running brittle shell commands across thousands of files to swap English strings with localized ones. As a system thinker, my skin crawled. Regex does not understand context. It does not know the difference between a UI button labeled "Save" and an internal function named save(). Relying on pattern matching to patch a compiled binary is a ticking time bomb.

I knew I needed a better system. I needed a pipeline that could parse source code with architectural awareness and extract the UI strings safely. It also had to translate them intelligently using large language models before injecting them back without breaking the build. Today, I want to walk you through exactly how I built an AST (Abstract Syntax Tree) and AI-driven localization pipeline. While the editor I patched was written in Rust (using Python bindings for Tree-sitter), I am going to demonstrate this system using TypeScript and React, as it is highly applicable to the modern web stacks most of us wrestle with daily.

The Problem: The Hardcoded Monolith

Imagine inheriting a massive codebase where the original developers, in their rush to ship version 1.0, ignored i18n entirely. You have thousands of files looking like this:

// Header.tsx
export const Header = () => {
  return (
    <header>
      <button className="btn-primary">Save Changes</button>
      <p>Welcome back, user!</p>
    </header>
  );
}

If you try to translate this by running a blanket regex like content.replace(/Save Changes/g, "변경사항 저장"), you are playing Russian Roulette with your compiler. What happens if a string literal is part of an object key? What happens if it's inside a comment?

The core issue is that text files are an illusion. Source code is a highly structured mathematical tree rather than a simple string of characters. When you treat source code as plain text, you lose all contextual metadata. You throw away the exact information the compiler uses to understand your application.

To fix this without rewriting the entire application to use an i18n provider, we have to intercept the build process. We need to read the code and identify only the strings that end up in the DOM or UI. Then we can translate them and write them back to disk.

The Solution: Enter the Abstract Syntax Tree (AST)

An Abstract Syntax Tree is the data structure your compiler creates when it reads your code. Tools like Babel and ESLint, alongside Tree-sitter, use ASTs to understand the relationship between different parts of your syntax. By utilizing an AST parser, we can confidently ask our code: "Give me all the string literals, but only if they are the children of a JSX element or the direct argument of a specific UI function."

Once we have safely extracted these strings, we pass them to an AI agent. I opted for this because traditional translation APIs often struggle with the weird formatting quirks of software UI. These include {variable} injections and %s placeholders, as well as bizarre capitalization. An LLM can be prompted with the exact context of a software interface, resulting in translations that actually sound like a human software engineer wrote them.

Finally, we use the AST to traverse the code again. We replace the original nodes with our newly translated strings to generate a localized version of the source code ready for compilation.

Implementation: Extract, Translate, Inject

Let's build this pipeline. For our TypeScript environment, we will use ts-morph, a fantastic wrapper around the TypeScript Compiler API that makes AST manipulation genuinely enjoyable.

Step 1: Extracting the UI Strings

First, we need to extract the strings safely. We will initialize ts-morph and load our project files before hunting for JSX text nodes and string literals used in UI contexts.

import { Project, SyntaxKind, JsxText, StringLiteral } from "ts-morph";
import * as fs from "fs";
 
// Initialize the project
const project = new Project({
  tsConfigFilePath: "./tsconfig.json",
});
 
const dictionary: Record<string, string> = {};
 
function extractStrings() {
  const sourceFiles = project.getSourceFiles();
 
  for (const file of sourceFiles) {
    // 1. Extract plain text inside JSX elements
    const jsxTexts = file.getDescendantsOfKind(SyntaxKind.JsxText);
    jsxTexts.forEach((node: JsxText) => {
      const text = node.getLiteralText().trim();
      if (text.length > 0) {
        // We use the English text as the key for simplicity in this example
        dictionary[text] = ""; 
      }
    });
 
    // 2. Extract string literals inside JSX Attributes (e.g., placeholder="Search...")
    const jsxAttributes = file.getDescendantsOfKind(SyntaxKind.JsxAttribute);
    jsxAttributes.forEach((attr) => {
      const name = attr.getNameNode().getText();
      // Only grab specific attributes we know are user-facing
      if (["placeholder", "title", "alt", "aria-label"].includes(name)) {
        const initializer = attr.getInitializer();
        if (initializer && initializer.getKind() === SyntaxKind.StringLiteral) {
          const text = (initializer as StringLiteral).getLiteralValue();
          if (text.length > 0) dictionary[text] = "";
        }
      }
    });
  }
 
  // Save our extracted dictionary map
  fs.writeFileSync("./extracted_strings.json", JSON.stringify(dictionary, null, 2));
  console.log(`Extracted ${Object.keys(dictionary).length} strings safely via AST.`);
}
 
extractStrings();

Notice the surgical precision here. We are explicitly telling the compiler API to only grab text that exists between React tags, or text that acts as a placeholder or aria-label, without relying on guesswork. If the word "Save" exists as an internal object property, the AST ignores it entirely. This is the power of system thinking applied to code analysis.

Step 2: The AI Translation Engine

Now that we have an extracted_strings.json file filled with keys and empty values, we need to translate them. I could have manually translated the Korean strings, but I wanted to support 13 different languages from day one. Doing that manually is a recipe for burnout.

I wrote a Node script that batches the JSON keys and sends them to an OpenAI model. The trick here is the system prompt. You must strictly instruct the LLM to preserve placeholders and technical casing while returning valid JSON.

import OpenAI from "openai";
import * as fs from "fs";
 
const openai = new OpenAI({
  apiKey: process.env.OPENAI_API_KEY,
});
 
async function translateDictionary(targetLang: string) {
  const rawData = fs.readFileSync("./extracted_strings.json", "utf-8");
  const dictionary = JSON.parse(rawData);
  const keys = Object.keys(dictionary);
 
  console.log(`Translating ${keys.length} items to ${targetLang}...`);
 
  const systemPrompt = `
    You are an expert software localization engineer. 
    Translate the following JSON keys into ${targetLang}. 
    Rules:
    1. Maintain all UI contexts (e.g., "Save" should be the verb for saving a file).
    2. Do NOT translate technical placeholders like {0}, %s, or React variables.
    3. Return ONLY a valid JSON object where the key is the original English string and the value is the translated string.
  `;
 
  // In a real scenario with thousands of strings, you must chunk the array to fit context limits.
  // For brevity, we pass it all at once here.
  const response = await openai.chat.completions.create({
    model: "gpt-4o",
    messages: [
      { role: "system", content: systemPrompt },
      { role: "user", content: JSON.stringify(keys) }
    ],
    response_format: { type: "json_object" },
    temperature: 0.1, // Keep it deterministic
  });
 
  const translatedJSON = response.choices[0].message.content;
  if (translatedJSON) {
    fs.writeFileSync(`./locales/${targetLang}.json`, translatedJSON);
    console.log(`Successfully generated ${targetLang} localization file.`);
  }
}
 
// Usage: translateDictionary("Korean");

This approach saved me weeks of manual labor. I ran the script, and within minutes, I had 13 distinct JSON files populated with highly accurate, context-aware translations. I cross-referenced the Korean output with the official VSCode language packs just to check the quality, and the AI nailed the specific jargon (like "리팩토링" for refactoring) beautifully.

Step 3: AST Injection and Automating the Build

Having the translations is only half the battle. Because the original application didn't use an i18n library (like react-i18next), we have to physically modify the source code before we compile the localized application.

We spin up ts-morph one more time. This script reads the translated JSON and finds the exact same AST nodes we identified in Step 1 to replace their text.

import { Project, SyntaxKind, JsxText, StringLiteral } from "ts-morph";
import * as fs from "fs";
 
function injectTranslations(targetLang: string) {
  const project = new Project({
    tsConfigFilePath: "./tsconfig.json",
  });
 
  const rawData = fs.readFileSync(`./locales/${targetLang}.json`, "utf-8");
  const dictionary: Record<string, string> = JSON.parse(rawData);
 
  const sourceFiles = project.getSourceFiles();
 
  for (const file of sourceFiles) {
    let modified = false;
 
    // 1. Replace JSX Text
    const jsxTexts = file.getDescendantsOfKind(SyntaxKind.JsxText);
    jsxTexts.forEach((node: JsxText) => {
      const originalText = node.getLiteralText().trim();
      const translatedText = dictionary[originalText];
      
      if (translatedText && originalText !== translatedText) {
        // Replace the node with the new translated text
        node.replaceWithText(translatedText);
        modified = true;
      }
    });
 
    // 2. Replace String Literals in UI attributes
    const jsxAttributes = file.getDescendantsOfKind(SyntaxKind.JsxAttribute);
    jsxAttributes.forEach((attr) => {
      const name = attr.getNameNode().getText();
      if (["placeholder", "title", "alt", "aria-label"].includes(name)) {
        const initializer = attr.getInitializer();
        if (initializer && initializer.getKind() === SyntaxKind.StringLiteral) {
          const originalText = (initializer as StringLiteral).getLiteralValue();
          const translatedText = dictionary[originalText];
          
          if (translatedText && originalText !== translatedText) {
            // Use quotes since it's a string literal replacement
            initializer.replaceWithText(`"${translatedText}"`);
            modified = true;
          }
        }
      }
    });
 
    if (modified) {
      file.saveSync(); // Write changes to disk
    }
  }
 
  console.log(`Source code successfully localized to ${targetLang}. Ready for compilation.`);
}
 
// Usage: injectTranslations("ko");

By executing this script right before the build command, the compiler receives a perfectly formatted, syntax-error-free codebase localized entirely in Korean. No broken imports. No accidental variable renaming.

The Reality of CI/CD: The 10-Hour Marathon

Architecture is useless if you can't deliver it reliably. When you're patching and compiling software from source, especially for multiple operating systems, you need a transparent pipeline. Users are rightfully paranoid about downloading unofficial binaries. If I'm going to distribute a customized, localized version of an editor, I need to prove that I didn't inject malware into the source.

I set up GitHub Actions to handle the entire lifecycle. The workflow automatically pulls the upstream repository and runs the AST extraction and injection scripts before triggering the heavy compilation step.

Here is where the solopreneur reality check hits hard. Compiling a massive codebase (especially one in Rust or a heavy Electron app) takes a ridiculous amount of compute. The free tier of GitHub Actions offers runners that are incredibly generous, but they are still shared machines.

My build matrix compiling 13 language variants across Windows, macOS, as well as Linux took over 10 hours to complete. I spent days watching a tiny loading spinner, praying that the job wouldn't timeout at the 599th minute. There is a specific kind of developer despair that occurs when you successfully finish a 10-hour build for version 1.2.5, only to check the upstream repository and realize the core team just released version 1.2.6 an hour ago. You sigh and grab a coffee before pushing the trigger button again.

Despite the friction, the transparency is worth it. Anyone can look at the GitHub Actions logs and review the AST manipulation scripts to verify exactly how the binaries are generated. Trust is the currency of open source, and a verifiable pipeline is how you mint it.

Conclusion

Treating your code as a raw string is a habit born of convenience, but it scales terribly. When we elevate our thinking to treat code as data, as a structured tree that we can query and manipulate programmatically, entire classes of bugs disappear.

Combining AST parsing with the linguistic capabilities of modern AI completely changed my perspective on what a solo developer can accomplish. A few years ago, maintaining 13 different localized forks of a fast-moving upstream project would have required a dedicated team of maintainers and translators. Today, it requires a few TypeScript files and a clear CI/CD pipeline, along with a whole lot of patience with free-tier build runners.

The next time you reach for regex to patch code, stop. Take a breath. Look into Tree-sitter or the TypeScript compiler API. It requires a slightly steeper learning curve up front, but the tree will not betray you the way the string does.