The 12-Point Engineering Checklist That Keeps AI Agents From Going Off the Rails
The gap between a demo agent and one that survives production is entirely in these subsystems. Skipping them means unreliable retrieval, runaway tool calls, unrecoverable failures, and no way to debug what went wrong — costs that compound fast once an agent is handling real user tasks or touching production APIs.
Most agent demos collapse under real workloads because they skip the engineering: one big prompt, a while-loop, and a vector database. A reliable agent needs at least a dozen subsystems — a finite state machine to prevent chaotic state jumps, a workflow engine that breaks tasks into deterministic steps, and a Planner/Executor/Verifier split so one model isn't guessing its way through everything.
RAG alone requires six pieces to be usable: hybrid search (BM25 + vector), semantic chunking with parent-child references, metadata filtering, query rewriting, a reranker, and source attribution on every returned chunk. Without them, retrieval either misses exact keywords or drowns in semantically-similar noise.
Beyond retrieval, the system needs memory layering (session, preference, task-state, long-term), context assembly rules that separate instructions from untrusted data, prompt-injection guardrails, structured output contracts, idempotency keys on side-effect tools, a sandbox for code execution, cost budgets, caching, and trace logging that records every tool call, chunk, and state transition. The article provides a master prompt that forces an LLM to design all of this before writing a single line of code.
Most RAG quality complaints trace back to chunking, not the model or the vector store — bad splits break context even when the right document is retrieved.
BM25 is not a legacy fallback; it is the only retrieval path that reliably hits code symbols, config keys, and error codes, which semantic search routinely misses.
Query rewriting is the difference between an agent that handles vague follow-ups and one that silently returns nothing — users rarely type search-engine-optimized queries.
A Planner/Executor/Verifier split does not require three separate models; the same model with different prompts, permissions, and output contracts at each step achieves the same boundary enforcement.
Prompt injection is not just about malicious users; any document containing phrases like 'ignore previous instructions' can hijack an agent if retrieval results are treated as trusted input.
Idempotency is the only thing standing between a retry loop and duplicate emails, payments, or deletions — every side-effect tool needs an action_id check before execution.
The master prompt at the end of the article is itself a specification generator: it forces an LLM to produce architecture decisions before code, which prevents the 'one big prompt plus while-loop' anti-pattern.
Business-specific agents beat general-purpose ones not because they are smarter, but because their flows, rules, and tools are shaped to the domain's exact failure modes and data shapes.