跪拜 Guibai
← All articles
Frontend

A Frontend Team Trained a 99% Accurate CNN Captcha Solver in 30 Minutes Using AI-Generated Code

By 转转技术团队 ·
Read original on juejin.cn ↗ Google Translate ↗ Alt translation

Frontend teams can now produce their own vision models without hiring ML specialists or spending weeks learning PyTorch. The bottleneck isn't algorithm knowledge—it's the willingness to describe requirements clearly, recognize when generated code breaks on real hardware, and tune parameters against actual data distributions.

Summary

Two approaches tackled the same 4-character captcha problem. A low-code DDDD trainer handled easy samples with 97% accuracy in 10 minutes of GPU time, but failed on conjoined, rotationally distorted variants. The team then fed the hardest samples to an AI assistant, which generated a full PyTorch project—config, dataset loader with dirty-sample filtering, a 4-layer CNN with multi-head classification, and a training loop with CosineAnnealingLR scheduling.

With no prior CNN knowledge, the developers spent 30 minutes on data adaptation and parameter tuning, dropping the learning rate from 1e-3 to 5e-4 and capping rotation augmentation at 5 degrees after characters spun out of bounds at 15. On 10,000 samples and 60 epochs, sequence-level accuracy hit 99%.

The real shift is workflow: AI wrote the boilerplate while the team contributed pixel intuition from Canvas and WebGL work, engineering habits like centralized configs and dirty-data filtering, and the judgement to override AI suggestions that didn't match their hardware or data distribution. The trained model exports to ONNX and runs inference in the browser via ONNX Runtime, turning frontend from an AI consumer into a model producer.

Takeaways
DDDD low-code training reached 97% accuracy on clean 4-character captchas in about 10 minutes on an RTX 4070S, but failed entirely on conjoined, rotationally distorted samples.
An AI assistant generated a complete PyTorch CNN pipeline—config, dataset, model, and training loop—from a description of the hard captcha and a few sample images.
The generated 4-layer CNN uses multi-head classification: four independent output heads each predict one character position, rather than one head outputting all four at once.
Sequence-level accuracy on the hard captcha reached 99% after 60 epochs on roughly 10,000 samples.
The learning rate was the critical tuning lever: 1e-3 proved too high for Adam, and dropping to 5e-4 unlocked convergence.
AI initially suggested 15-degree rotation augmentation, but characters rotated out of frame; the team capped it at 5 degrees based on their own data.
num_workers=4 crashed on Mac MPS; reducing to 2 stabilized training—a hardware detail the AI could not know.
The trained model exports to ONNX and runs inference in the browser via ONNX Runtime, eliminating the need for a server-side OCR API.
Conclusions

AI-generated boilerplate is now reliable enough that domain expertise shifts from writing code to verifying it—knowing which generated defaults will break on your specific hardware or data.

The same engineering instincts that serve frontend work—centralized configuration, dirty-data filtering, performance monitoring—transfer directly to model training, making the jump smaller than it appears.

Low-code trainers hit a hard ceiling on distorted data, but the fallback to a custom CNN is now cheap enough that teams can try both in a single afternoon.

Multi-head classification for fixed-length captchas is a simpler, more stable alternative to CTC loss when the character count is known and constant.

Cosine annealing schedulers consistently outperform fixed learning rates, and AI assistants surface this best practice without requiring the developer to read optimization literature.

The gap between consuming an OCR API and producing a model has collapsed to roughly 30 minutes of prompt engineering and parameter tuning.

Concepts & terms
Multi-head classification
A model architecture where each output position gets its own independent classifier head, rather than one head producing all outputs. For a 4-character captcha, four separate linear layers each predict one character, which is more stable than a single layer outputting 4×36 logits.
DDDD / dddd_trainer
A low-code OCR training tool that reduces model training to editing a YAML config file and running two commands (cache and train). It handles data splitting, augmentation, and training automatically but struggles with severe image distortion.
CosineAnnealingLR
A learning rate scheduler that smoothly decreases the learning rate following a cosine curve over the course of training, often yielding better convergence than a fixed rate or step decay.
AdaptiveAvgPool2d
A PyTorch pooling layer that reduces any input spatial dimensions to a fixed output size (e.g., 2×4) by averaging, regardless of the input feature map dimensions. It lets the same CNN handle variable-sized inputs.
ONNX Runtime
A cross-platform inference engine that runs models exported to the ONNX format. It can execute trained PyTorch models directly in a browser, enabling client-side inference without a backend server.
From the discussion

A single comment dismisses the article's quality as mediocre, offering no further elaboration or counterpoints. No substantive technical debate or alternative perspectives surfaced.

See top comments, translated →
Source: juejin.cn ↗ Google Translate ↗ Backup ↗