Frontend · AI Programming · Artificial Intelligence

The 12-Point Engineering Checklist That Keeps AI Agents From Going Off the Rails

By 恋猫de小郭 · Jun 29, 2026 · 342 views · 9 likes

Read original on juejin.cn ↗ Google Translate ↗ Alt translation

The gap between a demo agent and one that survives production is entirely in these subsystems. Skipping them means unreliable retrieval, runaway tool calls, unrecoverable failures, and no way to debug what went wrong — costs that compound fast once an agent is handling real user tasks or touching production APIs.

Summary

Most agent demos collapse under real workloads because they skip the engineering: one big prompt, a while-loop, and a vector database. A reliable agent needs at least a dozen subsystems — a finite state machine to prevent chaotic state jumps, a workflow engine that breaks tasks into deterministic steps, and a Planner/Executor/Verifier split so one model isn't guessing its way through everything.

RAG alone requires six pieces to be usable: hybrid search (BM25 + vector), semantic chunking with parent-child references, metadata filtering, query rewriting, a reranker, and source attribution on every returned chunk. Without them, retrieval either misses exact keywords or drowns in semantically-similar noise.

Beyond retrieval, the system needs memory layering (session, preference, task-state, long-term), context assembly rules that separate instructions from untrusted data, prompt-injection guardrails, structured output contracts, idempotency keys on side-effect tools, a sandbox for code execution, cost budgets, caching, and trace logging that records every tool call, chunk, and state transition. The article provides a master prompt that forces an LLM to design all of this before writing a single line of code.

Takeaways

— A production agent needs an explicit finite state machine; letting an LLM decide its own flow produces chaotic state jumps and low success rates.

— RAG must combine BM25 keyword search with vector search, then merge, deduplicate, normalize scores, and rerank — vector-only retrieval misses exact codes, config keys, and error strings.

— Chunking strategy matters more than the vector database: semantic chunking with parent-child references prevents broken context across snippet boundaries.

— Every chunk must carry metadata (source, section, tags, language, date, project) so retrieval can filter by scope instead of searching the entire library blindly.

— User questions should be rewritten into 3–5 query variants (natural language, keywords, English terms, code/config names, synonyms) before retrieval, or the search will fail on vague inputs.

— Tools need full schemas with parameter validation, error codes, permission levels, and side-effect flags; high-risk tools must enter a human-approval gate.

— Tool routing rules must specify when to query RAG, when to search the web, when to answer directly, and when to block tool calls — leaving selection entirely to the LLM produces wrong-tool errors.

— Memory must be layered into session, user-preference, task-state, and long-term stores; dumping all history into the prompt hits context limits and pollutes new tasks with old noise.

— Context assembly should follow a fixed order: task goal first, then state, relevant memory, retrieved evidence, and output format last — with source and credibility labels on every piece.

— Prompt injection defense requires treating all retrieved documents as untrusted data, never as instructions, and downgrading any document content that mimics system commands.

— Every workflow step needs a failure-recovery strategy: max retries, parameter changes on retry, downgrade paths, and human-escalation triggers — side-effect tools must check idempotency keys before re-executing.

— Trace logging must capture trace_id, session_id, state transitions, retrieval queries, hit chunks, tool calls and results, LLM input/output, token usage, latency, errors, and retry counts.

— An eval dataset should include normal, ambiguous, multi-language, proper-noun, unanswerable, prompt-injection, tool-failure, long-context, and cost-pressure test cases, each with expected sources and pass/fail rules.

— Cost control requires step limits, tool-call caps, token budgets, small-model routing for simple tasks, and caching of embeddings, retrieval results, and tool outputs.

— Code execution and file operations belong in a sandbox; real-world changes must produce a diff and wait for user confirmation before applying.

Conclusions

Most RAG quality complaints trace back to chunking, not the model or the vector store — bad splits break context even when the right document is retrieved.

BM25 is not a legacy fallback; it is the only retrieval path that reliably hits code symbols, config keys, and error codes, which semantic search routinely misses.

Query rewriting is the difference between an agent that handles vague follow-ups and one that silently returns nothing — users rarely type search-engine-optimized queries.

A Planner/Executor/Verifier split does not require three separate models; the same model with different prompts, permissions, and output contracts at each step achieves the same boundary enforcement.

Prompt injection is not just about malicious users; any document containing phrases like 'ignore previous instructions' can hijack an agent if retrieval results are treated as trusted input.

Idempotency is the only thing standing between a retry loop and duplicate emails, payments, or deletions — every side-effect tool needs an action_id check before execution.

The master prompt at the end of the article is itself a specification generator: it forces an LLM to produce architecture decisions before code, which prevents the 'one big prompt plus while-loop' anti-pattern.

Business-specific agents beat general-purpose ones not because they are smarter, but because their flows, rules, and tools are shaped to the domain's exact failure modes and data shapes.

Concepts & terms

FSM (Finite State Machine)

A model that defines a fixed set of states, allowed transitions between them, and conditions that trigger each transition. In agents, it prevents the LLM from jumping between steps unpredictably.

Hybrid Search

Combining vector similarity search (good for semantic matches) with BM25 keyword search (good for exact terms, codes, and symbols), then merging and reranking the combined results.

BM25

A traditional keyword-based retrieval algorithm used in search engines like Elasticsearch and Lucene. It excels at exact word matching and short-phrase queries, compensating for vector search's weakness with precise technical terms.

Reranker

A model that takes the top candidates from initial retrieval and re-scores them for relevance to the user's query, producing a more accurate final ordering before results are sent to the LLM.

Semantic Chunking

Splitting documents along semantic boundaries (headings, paragraphs, functions) rather than fixed token counts, so each chunk contains a coherent unit of meaning.

Parent-Child Chunking

A retrieval pattern where small chunks are indexed for precise search hits, but the larger parent chunk (or adjacent chunks) is returned to the LLM to preserve full context.

Query Rewrite

Transforming a user's raw, often vague question into multiple targeted search queries (natural language, keywords, technical terms, code names) before retrieval, to improve recall.

Planner / Executor / Verifier

An agent architecture that separates task decomposition (Planner), step execution (Executor), and result validation (Verifier) into distinct roles, preventing a single model from both planning and executing without checks.

Tool Schema

A structured definition for each tool that specifies its purpose, input parameters, output format, error codes, permission level, side effects, and whether it requires human approval.

Tool Routing

Rules that determine which tool to call based on the task type, rather than letting the LLM freely choose — preventing inappropriate tool usage like searching when a direct answer suffices.

Prompt Injection

An attack where untrusted input (such as a retrieved document) contains instructions that override the system prompt, tricking the agent into executing unauthorized actions.

Idempotency

The property that repeating an operation produces the same result as doing it once. In agents, it prevents duplicate side effects (emails, payments, deletions) during retries by checking an action_id.

Context Engineering

The deliberate assembly of what goes into the LLM's context window, in what order, with what labels — prioritizing goals and state over raw history to avoid noise and instruction confusion.

Guardrails

A constraint layer that validates inputs and outputs, enforces tool permissions, blocks sensitive operations without human approval, and defends against prompt injection.

Human-in-the-loop

A design pattern where high-risk actions (deletion, publishing, financial operations) pause for explicit human confirmation before execution, rather than running fully autonomously.

Output Contract

A required JSON Schema for each agent step's output, ensuring downstream logic receives structured, parseable results instead of relying on free-text regex extraction.

Source: juejin.cn ↗ Google Translate ↗ Backup ↗