A Frontend Team Trained a 99% Accurate CNN Captcha Solver in 30 Minutes Using AI-Generated Code

Captcha Recognition in Practice: Frontend Stops Writing Pages and Starts Training Models?

1. Introduction: Frontend Engineers Training Models Is No Longer a Fantasy

Frontend developers who work with Canvas, WebGL, and image upload components every day have long been dealing with "pixel matrices"—which is essentially the same thing convolutional neural networks do. The only difference was that the barrier to training models used to be too high. Now, with AI assistance, someone who writes TypeScript can easily train a CNN on the side.

This article documents the complete process of using two approaches—DDDD low-code training and PyTorch CNN—to tackle different difficulty samples of the same type of captcha (4-digit numbers + letters). It is particularly worth emphasizing that the core code for the CNN solution was generated by AI—we spent only half an hour, through multiple rounds of prompt adjustments and parameter fine-tuning, to get the training pipeline running. The purpose of writing this article is not to show off technical depth, but to prove one thing: in the AI era, a frontend developer's toolbox can absolutely include a "model training" screwdriver, and this screwdriver is handed to you by AI.

2. Practical Background: Same Type of Captcha, Two Difficulty Levels, Two Strategies

We had captcha images encountered in the same line of business, but with significant differences in difficulty:

Type	Characteristics	Training Strategy
Easy Difficulty: 4-digit number + letter captcha	4 characters, slight noise, regular deformation	DDDD quick solution—configure and train, covering multiple easy samples
Hard Difficulty: conjoined + rotationally distorted captcha	4 characters, severe conjoining, obvious rotational distortion	AI-assisted CNN solution—AI generates the code skeleton, with fine-tuning for the challenge

3. Solution 1: DDDD (ddddocr) — Low-Code Rapid Coverage of Multiple Easy-Difficulty Samples

The dddd_trainer companion to ddddocr abstracts model training into two steps: "modify config + run command," which is extremely frontend-friendly. We used it to handle multiple easy-difficulty 4-digit number + letter captcha samples.

3.1 Environment Setup

conda create -n captcha python=3.10
conda activate captcha
pip install torch torchvision dddd_trainer

3.2 Dataset Organization

DDDD's training data organization method is a flat folder with filenames containing labels: all images are placed in the same directory, with the filename format label_randomvalue.extension. The part before the underscore is the captcha content, and the part after is a random hash (to prevent duplicate names).

/root/images_set/
├── 3x9k_a1b2c3d4.png
├── ab2c_e5f6g7h8.jpg
└── 0000_x9y8z7w6.png

If your filenames are inconvenient to rename into this format, DDDD also supports a second method: mapping via a labels.txt file, where image filenames can be completely arbitrary and labels are written separately in the txt file.

Leveraging the DDDD solution, we successively annotated and trained multiple easy-difficulty 4-digit captcha samples, totaling tens of thousands of images. The validation set ratio is configured in config.yaml via Val: 0.03, and the tool automatically splits the data when the cache command is executed, eliminating the need for manual folder separation.

3.3 Configuration File and Training

Model:
  CharSet: []
  ImageChannel: 1
  ImageHeight: 64
  ImageWidth: -1
System:
  GPU: true
  Val: 0.03
Train:
  BATCH_SIZE: 64
  CNN: {NAME: ddddocr}
  LR: 0.01
  TARGET:
    Accuracy: 0.97
    Epoch: 20

python app.py cache --project std_captcha
python app.py train --project std_captcha

Process and Results: The technical research and solution implementation for DDDD took about a day or two; subsequently, based on the data annotation experience accumulated from this process, we rapidly covered multiple easy-difficulty 4-digit captcha samples. Ultimately, on an RTX 4070S, training a single type took about 10 minutes, and validation set accuracy generally reached over 97%. The DDDD solution is suitable for rapid delivery of easy-difficulty captchas, but when encountering hard-difficulty samples with severe character conjoining and intense rotational distortion, the default configuration falls short.

4. Solution 2: AI-Assisted PyTorch CNN — From Zero to Running on Hard Difficulty in Half an Hour

To be honest, we initially knew nothing about CNNs. Faced with those severely conjoined and rotationally distorted captchas, DDDD repeatedly failed to produce results during training, and we didn't know what else to do. With a try-it-and-see attitude, we directly sent a few of the hardest sample images to the AI and asked it how to solve this kind of captcha. After looking at the images, the AI told us: this degree of deformation and conjoining cannot be handled by low-code tools, and suggested building a custom CNN using PyTorch, providing a complete solution outline.

At the time, we had no concept of what a CNN even was, let alone writing a model, adjusting learning rates, or reading loss curves. But because the AI had pointed the way, we decided to follow it through.

In the end, we described our requirements to the AI, it generated a complete code skeleton (config + dataset + model + train), and we spent only about half an hour on data adaptation and parameter fine-tuning to get the training running.

4.1 Project Structure: AI-Generated Engineering Layering

The following project structure was generated by AI based on our requirements, and it aligns very well with frontend engineering intuition:

captcha_cnn/
├── config.py          # Centralized configuration (like frontend's constants.ts)
├── dataset.py         # Data loading and cleaning
├── model.py           # CNN network definition
├── train.py           # Main training loop
└── data/
    └── raw/           # Raw images, named like: a3b9_001.png

4.2 config.py: AI Suggested, We Decided

The AI-generated config.py centrally manages the character set, image dimensions, and training parameters. We adjusted it based on the actual data—for example, the captcha is 4 digits, and the image size is 420×80 (width×height):

# config.py
IMG_W, IMG_H = 420, 80
CHARS = "0123456789abcdefghijklmnopqrstuvwxyz"   # 36 character classes
NUM_CLASSES = len(CHARS)
MAX_LEN = 4          # Captcha length (filename prefixes are all 4 characters)
CHAR2IDX = {c: i for i, c in enumerate(CHARS)}
IDX2CHAR = {i: c for i, c in enumerate(CHARS)}

DATA_DIRS = [
    "data/row",
]
TRAIN_RATIO = 0.9

BATCH_SIZE = 64
EPOCHS = 60
LR = 1e-3
MODEL_PATH = "best_captcha.pth"

4.3 dataset.py: AI Wrote the Skeleton, We Filled in the Business Logic

The AI-generated dataset.py already included standard implementations for data augmentation and loading. We made a key adjustment based on the actual data—added strict dirty sample filtering:

import random
from pathlib import Path
from PIL import Image
import torch
from torch.utils.data import Dataset, DataLoader
from torchvision import transforms
from config import *

# Training set gets augmentation, validation set stays original
train_tf = transforms.Compose([
    transforms.Grayscale(),
    transforms.Resize((IMG_H, IMG_W)),
    transforms.RandomRotation(5),                    # Slight rotation to simulate real scenarios
    transforms.ColorJitter(brightness=0.3, contrast=0.3),  # Brightness/contrast jitter
    transforms.GaussianBlur(kernel_size=3, sigma=(0.1, 1.0)),  # Slight blur
    transforms.ToTensor(),
    transforms.Normalize([0.5], [0.5]),             # Normalize to [-1, 1]
])

val_tf = transforms.Compose([
    transforms.Grayscale(),
    transforms.Resize((IMG_H, IMG_W)),
    transforms.ToTensor(),
    transforms.Normalize([0.5], [0.5]),
])


class CaptchaDataset(Dataset):
    def __init__(self, img_paths, tf):
        self.paths = img_paths
        self.tf = tf

    def __len__(self):
        return len(self.paths)

    def __getitem__(self, idx):
        p = self.paths[idx]
        # Parse label from filename, e.g., a3b9_001.png -> label "a3b9"
        label_str = Path(p).stem.split("_")[0].lower()[:MAX_LEN]
        img = Image.open(p).convert("RGB")
        x = self.tf(img)
        y = torch.tensor([CHAR2IDX[c] for c in label_str], dtype=torch.long)
        return x, y


def get_loaders():
    all_paths = []
    for d in DATA_DIRS:
        all_paths += [
            str(p) for p in Path(d).glob("*.png")
            if len(Path(p).stem.split("_")[0]) == MAX_LEN          # Filter wrong length
            and all(c in CHARS for c in Path(p).stem.split("_")[0].lower())  # Filter illegal characters
        ]
    random.shuffle(all_paths)
    n_train = int(len(all_paths) * TRAIN_RATIO)

    train_ds = CaptchaDataset(all_paths[:n_train], train_tf)
    val_ds   = CaptchaDataset(all_paths[n_train:], val_tf)

    train_loader = DataLoader(train_ds, batch_size=BATCH_SIZE, shuffle=True,  num_workers=2)
    val_loader   = DataLoader(val_ds,   batch_size=BATCH_SIZE, shuffle=False, num_workers=2)
    print(f"train: {len(train_ds)}  val: {len(val_ds)}")
    return train_loader, val_loader

A few pitfalls we stepped into together with the AI:

The AI initially suggested num_workers=4, which threw errors immediately when we ran it on Mac MPS; reducing it to 2 stabilized things. This is a common issue when the AI doesn't know your hardware environment—a human must intervene to verify.
We discussed the RandomRotation(5) parameter with the AI over two rounds: the AI initially suggested 15 degrees, but we found characters rotated out of bounds after testing, so we compromised at 5 degrees. The AI cannot guess the characteristics of your business data; a human must tell it.
Grayscale images have 1 input channel, but Image.open defaults to RGB. The AI's initial code didn't handle this detail; we added convert("RGB") and then let transforms.Grayscale() process it, avoiding anomalies with certain PNG formats.

4.4 model.py: AI-Generated Network Architecture

model.py was directly generated by AI based on the requirements of a "4-digit captcha, 36 character classes, input 420×80 grayscale image." It uses a 4-layer convolution + adaptive pooling + multi-head classification structure: each character position is output by an independent classification head, rather than a single fully connected layer outputting all positions at once:

import torch
import torch.nn as nn
from config import IMG_W, IMG_H, NUM_CLASSES, MAX_LEN


class CaptchaCNN(nn.Module):
    """
    Input: (B, 1, H, W)  → Output: (B, MAX_LEN, NUM_CLASSES)
    Each character position is classified independently (multi-head classification, not CTC)
    """
    def __init__(self):
        super().__init__()
        self.features = nn.Sequential(
            nn.Conv2d(1, 32, 3, padding=1), nn.BatchNorm2d(32), nn.ReLU(),
            nn.MaxPool2d(2),                                        # 40×210
            nn.Conv2d(32, 64, 3, padding=1), nn.BatchNorm2d(64), nn.ReLU(),
            nn.MaxPool2d(2),                                        # 20×105
            nn.Conv2d(64, 128, 3, padding=1), nn.BatchNorm2d(128), nn.ReLU(),
            nn.MaxPool2d(2),                                        # 10×52
            nn.Conv2d(128, 256, 3, padding=1), nn.BatchNorm2d(256), nn.ReLU(),
            nn.AdaptiveAvgPool2d((2, 4)),                           # 2×4
        )
        flat = 256 * 2 * 4
        self.heads = nn.ModuleList([
            nn.Sequential(
                nn.Linear(flat, 256), nn.ReLU(), nn.Dropout(0.3),
                nn.Linear(256, NUM_CLASSES)
            )
            for _ in range(MAX_LEN)
        ])

    def forward(self, x):
        feat = self.features(x).flatten(1)
        return torch.stack([h(feat) for h in self.heads], dim=1)  # (B, 4, 36)

Structure explanation: The first 4 convolutional layers downsample progressively, and finally AdaptiveAvgPool2d fixes the feature map to 2×4. After flattening, a 256*2*4 = 2048-dimensional vector is obtained; this is followed by 4 independent classification heads (corresponding to the 4 character positions of the captcha), each head being a two-layer Linear → ReLU → Dropout → Linear structure. This "multi-head" design allows the model to model each character position separately, which is more stable than a single large fully connected layer directly outputting 4×36 in a coupled manner.

4.5 train.py: AI Wrote the Logic, We Tuned the Parameters

The training script was also generated by AI, including two metrics: character-level accuracy and sequence-level accuracy. We mainly adjusted the learning rate and batch size:

import torch
import torch.nn as nn
from torch.optim import Adam
from torch.optim.lr_scheduler import CosineAnnealingLR
from config import *
from dataset import get_loaders
from model import CaptchaCNN


def accuracy(logits, targets):
    # logits: (B, MAX_LEN, NUM_CLASSES)  targets: (B, MAX_LEN)
    preds = logits.argmax(-1)          # (B, MAX_LEN)
    char_acc = (preds == targets).float().mean().item()
    seq_acc  = (preds == targets).all(dim=1).float().mean().item()
    return char_acc, seq_acc


def train():
    # Prefer Mac MPS, then CUDA, finally CPU
    device = torch.device("mps" if torch.backends.mps.is_available() else
                          "cuda" if torch.cuda.is_available() else "cpu")
    print("device:", device)

    train_loader, val_loader = get_loaders()
    model = CaptchaCNN().to(device)
    criterion = nn.CrossEntropyLoss()
    optimizer = Adam(model.parameters(), lr=LR)
    scheduler = CosineAnnealingLR(optimizer, T_max=EPOCHS)

    best_seq_acc = 0.0
    for epoch in range(1, EPOCHS + 1):
        model.train()
        total_loss = 0
        for x, y in train_loader:
            x, y = x.to(device), y.to(device)
            logits = model(x)                          # (B, 4, 36)

            # Calculate cross-entropy for each character position separately, then sum
            loss = sum(criterion(logits[:, i], y[:, i]) for i in range(MAX_LEN))

            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            total_loss += loss.item()

        scheduler.step()

        # Validation
        model.eval()
        all_char, all_seq, n = 0, 0, 0
        with torch.no_grad():
            for x, y in val_loader:
                x, y = x.to(device), y.to(device)
                logits = model(x)
                ca, sa = accuracy(logits, y)
                bs = x.size(0)
                all_char += ca * bs
                all_seq  += sa * bs
                n += bs

        char_acc = all_char / n
        seq_acc  = all_seq  / n
        print(f"epoch {epoch:3d}  loss={total_loss/len(train_loader):.4f}"
              f"  char_acc={char_acc:.4f}  seq_acc={seq_acc:.4f}")

        if seq_acc > best_seq_acc:
            best_seq_acc = seq_acc
            torch.save(model.state_dict(), MODEL_PATH)
            print(f"  -> saved (best seq_acc={best_seq_acc:.4f})")

    print("done. best seq_acc:", best_seq_acc)


if __name__ == "__main__":
    train()

The real process of tuning parameters with AI:

In the first few epochs, sequence accuracy hovered around 30%~40%. We thought there was a problem with the model structure, but after asking the AI, we learned that LR=1e-3 is on the high side for Adam. Dropping it to 5e-4 led to significant improvement. AI knows the principles, but it doesn't know your data distribution; adjustments must be made based on the actual situation.
The AI suggested CosineAnnealingLR, and after trying it, we found it indeed much better than a fixed learning rate. This is the advantage of AI's "knowledge base"—it has seen too many best practices, so you don't need to dig through papers yourself.
Because we had already accumulated data annotation experience during the earlier DDDD phase, data preparation for the CNN solution didn't take much time; the main effort was spent on tuning and validation. Ultimately, for the hard difficulty captcha, after about 10,000 samples and 60 epochs, the test set sequence accuracy reached 99%. From the AI generating the first version of the code to this result, it took a total of just over half an hour.

5. Frontend + AI: What Did We Actually Gain?

After completing this project, our biggest takeaway wasn't "we can train models now," but rather a change in working methodology:

AI took on the work of "writing boilerplate code": Network structure definitions, training loops, evaluation metrics—AI generates this boilerplate code quickly and accurately. We only need to focus on business logic (data cleaning, parameter tuning).
Image processing experience was reused: The "pixel intuition" accumulated from writing Canvas image compression and WebGL filters in the past directly came into play when understanding convolution kernels and pooling. But understanding principles and writing runnable code are two different things; the latter can now be handed off to AI.
Engineering mindset is universal: The "centralized constant management" (config.py), "data cleaning" (filtering dirty samples), and "performance monitoring" (loss/acc curves) that frontend developers deal with daily are completely isomorphic to model training. These are things we, as frontend developers, are already good at—AI helped us transfer this experience to a new domain.
From "consumer" to "producer": Previously, we called third-party OCR APIs; now we produce our own models, export to ONNX, and even run inference in the browser using ONNX Runtime. Frontend is no longer the end consumer of AI capabilities, but can participate in the model production chain—and the production tool itself is also AI.

6. Conclusion: AI Doesn't Replace You, It Amplifies You

This article is not an algorithm tutorial, but a frontend team's "AI collaboration workflow" report. Captcha recognition is just an entry point; the same path can extend to: image classification, sensitive content filtering, handwriting recognition, and even simple object detection.

The boundaries of a role are not determined by a job description, but by the moment you dare to throw requirements at AI and then sit down to tune the parameters.

If you are also a frontend developer, AI has long been your daily coding partner—but you might not have thought that this "electric screwdriver" can also help you tighten the screws of model training. Our experience is: don't see it as "switching careers to algorithms," but as another expansion of your toolbox. You don't need to write PyTorch from scratch, just like you don't need to write Webpack from scratch. Knowing how to describe business requirements, knowing where the generated code needs modification, and knowing how to ask AI when problems arise—that's already enough to build things you never dared to imagine before.

Comments

Top 1 from juejin.cn, machine-translated. The original thread is authoritative.

Pikachu803

This issue's quality is mediocre.