跪拜 Guibai
← Back to the summary

Becoming an AI Agent Engineer: A Field Guide from a Full-Stack Practitioner

Preface

From the explosion of ChatGPT in 2023 to now, Large Language Models (LLMs) are no longer just "tools that can chat." In just over three years, LLMs have evolved from conversational assistants into "intelligent agents" that can write code, manipulate file systems, execute shell commands, and even autonomously break down tasks and complete them step by step—what the industry calls AI Agents. Consequently, more and more companies are integrating AI Agents into their products and workflows, and the demand for AI Agent engineers is growing rapidly.

I started as a frontend engineer. From 2022, I gradually began using Node.js for backend development, embarking on my full-stack journey. Later, by chance, I joined a company building AI Agent products and participated in developing several different forms of AI Agent products, including web applications and desktop applications, serving in a full-stack role handling frontend, backend, and AI Agent work. Additionally, in my spare time, I wrote an open-source AI Agent CLI project out of interest to help me better understand AI Agent knowledge.

I've said all this mainly to convey one point: the content shared below comes from real project development experience, not from just reading a few tutorials or browsing a few documents and then summarizing out of thin air. This article is a summary of that experience, intended for those who want to enter the field of AI Agent development. AI Agent is a rapidly changing field, so this content is more about providing a reference direction rather than standard answers.

AI Agent Engineer: Not an Independent Job Role

My understanding of this role is: AI Agent engineer is not an independent job role, but a composite role of "X + AI Agent." Here, X can be frontend, backend, full-stack, or even product manager, but a standalone "AI Agent engineer" hardly exists.

What AI Agents do is use code to orchestrate the capabilities of LLMs—letting LLMs call tools, manage context, handle errors, and interface with external systems. These tasks are inherently software engineering, just that the objects of operation have changed from databases and APIs to LLMs.

So if you are already a frontend, backend, or full-stack engineer with a few years of development experience, the barrier to transitioning to an AI Agent direction is actually not high. There isn't much new knowledge to learn (the "Core Knowledge System" section will elaborate), but the engineering implementation issues—such as how to manage context, how to do error recovery, how to control costs—these are things that truly take time to get right, and this is precisely where experienced developers excel.

From a job-seeking perspective, frontend/backend engineers with AI Agent development experience have more opportunities than their peers without such experience. More and more positions require candidates to have AI Agent-related development experience. Mastering this capability broadens your options. If interested, you can search for "AI Agent" on major recruitment platforms to see actual job postings. Below are some examples:

Mindset Adjustment

Not Much Knowledge, But Hard to Do Well

The concepts to learn for AI Agents are not many, such as Tool Use, Agent Loop, RAG, Memory, Prompt Engineering, etc. You can roughly understand them in a few weeks. Compared to the APIs, frameworks, and browser compatibility issues that frontend/backend development requires learning, the knowledge volume for AI Agents is relatively small.

But understanding the principles and concepts does not mean you can build a usable product with your hands. In my experience, the real time-consuming parts are these engineering issues:

These problems cannot be fully solved by theoretical knowledge alone; they require continuous trial and error and iterative refinement in real projects.

Embrace Uncertainty

In traditional development work, there is often an implicit psychological expectation: if the code is written correctly, it should produce the correct result. But AI Agent development is not like that. The same Prompt may produce a perfect solution one time and a buggy solution the next. The same sequence of tool calls may take 3 steps one time and 10 steps the next. This uncertainty can be very uncomfortable initially and requires a mindset adjustment.

The adjustment is to shift from "writing correct code" to "designing constraints." You don't need the model to get it right every time; instead, design sufficiently good guardrails so that when the model makes mistakes, they can be detected and corrected. Permission systems, loop detection, output validation—these "fences" are where Agent engineers spend most of their daily time.

Learn by Doing

In learning science, there is a widely cited model called the "Learning Pyramid," derived from Edgar Dale's Cone of Experience theory. Its core idea is that just listening or reading results in low knowledge retention; through hands-on practice, knowledge retention can be significantly improved. The principle behind this is related to the Ebbinghaus Forgetting Curve—newly learned information, if not actively recalled, is forgotten very quickly. Hands-on practice is a form of active recall that deepens understanding, and the deeper the understanding, the harder it is to forget.

AI Agent development is especially suited to this approach. The knowledge itself is not extensive, but many engineering issues—like under what circumstances the model repeatedly calls the same tool without stopping, or to what extent context must be compressed without losing key information—cannot be anticipated by reading documentation alone. You only encounter these problems by actually doing. When you encounter a problem and solve it, that knowledge truly becomes your own.

A Real Lesson

Early in the launch of an AI Agent product I helped develop, we encountered a memorable failure—an Agent infinite loop.

Here's what happened: A user asked the Agent to fetch the content of a webpage. The Agent called the WebFetch tool, but the target URL returned an error. The normal expectation was that the model would try a different approach upon failure. But the opposite happened—the model thought, "It didn't work just now; trying again should work," and called WebFetch again with exactly the same parameters. The second time naturally failed too, then the third, fourth... five consecutive calls with the exact same tool, same parameters, same error.

The cost of these five invalid calls wasn't huge, but each failure appended a request-response pair to the message history, significantly expanding the context after a few rounds. Later, we saw the same pattern with another user, except this time the Agent was stuck in a loop on a Bash command call, lasting longer—the context ballooned beyond the model's 262K token limit (the actual request reached 294K), the API rejected the request, and the user's entire session was interrupted.

Both cases stemmed from the same problem: the Agent fell into a retry loop when a tool call failed, and each retry expanded the context, eventually causing overflow. The solution isn't conceptually hard—"detect the loop and stop" or "compress context when nearing the limit"—but before launch, we never anticipated the model would repeatedly call the same tool with identical parameters, as this behavior never occurred in testing. When we actually fixed it, we realized it wasn't just adding an if statement: loop detection requires comparing the parameters of recent tool calls to determine repetition; context compression needs to trigger promptly near the limit, replacing old messages with summaries while retaining key information. This experience taught me that the hardest part of AI Agent development isn't understanding concepts but dealing with the model's unpredictable behavior in production.

A Pitfall to Watch Out For

There are many "no-code AI Agent building" tools on the market now—Dify, Coze, n8n, etc. These tools lower the entry barrier, allowing non-programmers to drag and drop an Agent workflow. For quick idea validation, they're fine. But if your goal is to become an AI Agent engineer, only using pre-built components without understanding their construction principles will severely limit your ability to solve new problems.

For example: Suppose you build an Agent workflow with Dify. It works well for single-turn Q&A or simple information queries. But after deployment, you might encounter situations where the Agent repeatedly calls the same tool without stopping (i.e., infinite loop) on certain tasks, or it starts "forgetting" earlier instructions after long conversations. Dify provides configuration items like maximum iteration count and memory window size for coarse adjustments. But if finer control is needed—like detecting infinite loops by comparing the content of recent tool calls, or when context is too long, having the LLM generate a summary of earlier conversations and replace the original messages (rather than directly discarding early messages and losing key information)—these are beyond what the platform offers. Those who understand the underlying mechanisms can write code to implement these strategies; those who don't can only tweak a few parameters provided by the platform.

So my advice is: You can use frameworks or low-code tools to accelerate development, but at least implement a minimal Agent from scratch once—even if it's just a 50-line while loop with one or two tools. Understand the basic mechanisms of message passing, tool calling, and loop control. With this understanding, learning any framework or tool won't feel like a black box. The "Recommended Learning Path" section later provides specific reference projects and resources.

AI Agent Core Knowledge System

AI Agent knowledge can be understood in three layers. The first layer is the foundation (LLM basics), the second is core mechanisms (key capabilities that distinguish Agents from ordinary LLM calls), and the third is engineering (how to make Agents run stably in production).

graph BT
    L1["Layer 1: LLM Basics\nCapability Boundaries · API Calls · Prompt Engineering"] --> L2["Layer 2: Agent Core Mechanisms\nAgent Loop · Tool Use · RAG · Memory · Multi-Agent"]
    L2 --> L3["Layer 3: Engineering Capabilities\nContext Management · Permissions & Security · Evaluation · Cost Control"]

    style L1 fill:#e8f5e9
    style L2 fill:#fff3e0
    style L3 fill:#fce4ec

Layer 1: LLM Basics

You don't need to understand the mathematical details of Transformers or training processes, but you need to understand what LLMs can and cannot do as an "interface."

LLM Capability Boundaries. An LLM is essentially a text continuator—given a piece of text (Prompt), it generates subsequent text (Completion). It excels at natural language understanding and generation, code writing, information extraction and summarization, and reasoning. But it is not good at real-time information retrieval, interacting with external systems, deterministic logical judgments, or answering questions beyond its training data range. Understanding this boundary is crucial because the core design philosophy of AI Agents is: use tools to compensate for LLM shortcomings. Need real-time information? Provide a search tool. Need to manipulate the file system? Provide file read/write tools. Need to execute code? Provide a Shell tool. Most of the engineering work for Agents is spent on "how to make the LLM use these tools correctly."

API Calls. The starting point for all Agents is calling the LLM's API. Taking the mainstream Messages API as an example, the core concepts are few: a message list (Messages, a sequence of conversation messages arranged by time, each with roles system/user/assistant/tool and content), streaming output (Streaming, where the model generates tokens one by one and the client receives them in real-time), and Tokens and billing (both input and output are billed by token; one Chinese character is roughly 1-2 tokens). These concepts are not complex. The recommended Datawhale "Hands-on Learning LLM" tutorial starts from API calls; following along will help you understand.

Prompt Engineering. Writing good prompts is a fundamental skill in Agent development. Common techniques include: using system prompts to define the Agent's role, capability boundaries, and behavior rules; using few-shot learning to provide examples in the prompt to guide the model to respond in a specific format; using chain-of-thought to guide the model to reason before answering, improving accuracy. Prompt engineering covers a lot, but Agent engineers don't need to master all techniques. In practice, the three most common tasks are: how to write system prompts for more stable model behavior, how to write tool descriptions to reduce mis-calls, and which common writing patterns tend to cause the model to deviate from expected output.

Recommended Materials for This Layer: Datawhale "Hands-on Learning LLM" is a Chinese tutorial from the domestic open-source community, covering from API calls to RAG application development, suitable for zero-based beginners. For Prompt Engineering, Andrew Ng's collaboration with OpenAI, "ChatGPT Prompt Engineering for Developers", is in English with community-translated Chinese subtitles on Bilibili, viewable in an hour or two. If English reading is not an issue, Anthropic's Prompt Engineering Guide is currently the most systematic reference.

Layer 2: Agent Core Mechanisms

This layer is what distinguishes AI Agents from ordinary LLM conversations.

Agent Loop. The core operating mode of an Agent can be summarized in one sentence: Think → Act → Observe → Re-think. This loop, in code, is a while loop. Below is a simplified version from x-code-cli:

while (turn < maxTurns) {
  // 1. Call LLM, pass message history and available tool list
  const outcome = await runTurn(state, model, tools)

  // 2. Model returns tool_calls → execute tools, append results to message history
  if (outcome.finishReason === 'tool-calls') {
    await processToolCalls(outcome.toolCalls, state)
    continue  // Return to loop top, let model see tool results and continue thinking
  }

  // 3. Model returns stop → task complete, exit loop
  if (outcome.finishReason === 'stop') break
}

Although this code is only a dozen lines, it is the skeleton of all Agent products. Whether it's Claude Code, Cursor, or x-code-cli, the core is this loop—the differences lie only in the richness of the tool set, the completeness of error handling, and the sophistication of context management strategies. Represented in a flowchart:

flowchart LR
    A["User Input"] --> B["Call LLM\n(messages + tools)"]
    B --> C{"Model Returns"}
    C -->|"tool_call"| D["Execute Tool"]
    D --> E["Append Result to messages"]
    E --> B
    C -->|"stop"| F["Output Final Answer"]

Tool Use / Function Calling. Tool calling is the bridge that turns LLMs from "only talking" to "able to do things." A tool definition consists of two parts: the input parameter Schema (telling the model "what parameters this tool accepts") and a natural language description (telling the model "what this tool can do and when to use it"). Taking the Shell tool definition from x-code-cli as an example:

export const shell = tool({
  description: 'Execute a shell command and return stdout/stderr...',
  inputSchema: z.object({
    command: z.string().describe('The command to execute'),
    timeout: z.number().optional().describe('Timeout in milliseconds'),
  }),
})

When the model sees this tool definition, it will call it when it needs to execute a system command, generating the correct JSON parameters according to the schema. The actual execution logic of the tool is handled on the Agent engine side; the model itself does not execute any code—it only sees the tool's description and parameter definitions. So the quality of the tool description directly determines whether the model calls the right tool at the right time. If the description is too vague, the model may misuse it when it shouldn't or miss it when it should. In practice, repeatedly adjusting the wording of tool descriptions is very common, often taking more time than writing the tool's code itself.

Retrieval-Augmented Generation (RAG). LLMs have knowledge cut-off dates and cannot access internal enterprise data. RAG is designed to address these two shortcomings: pre-store internal documents, product manuals, latest information, etc., into a knowledge base. When a user asks a question, relevant fragments are retrieved from the knowledge base and sent to the model along with the question, allowing the model to answer based on these supplementary materials. A typical RAG process includes: document chunking → vectorization (Embedding) → storage in a vector database → similarity search during retrieval → retrieval of relevant document chunks → incorporation into the Prompt. RAG is not conceptually complex, but there are many engineering details: how to choose the chunking strategy, what chunk size is appropriate, which vector model to use, how to evaluate retrieval recall—these questions have no standard answers and need to be debugged based on the specific scenario.

Memory. Agent memory comes in two types. Short-term memory is the current conversation's message history. As the conversation progresses, the message history grows and eventually hits the context window limit, requiring compression or truncation (specific strategies are discussed in Layer 3's "Context Management" section). Long-term memory is cross-session persistent information—user preferences, project context, agreements made in previous conversations. Without long-term memory, each conversation starts from scratch: yesterday you told it the project uses ESM module syntax, today it might write code in CommonJS module syntax again. The difficulty of long-term memory lies in filtering: a single conversation may have dozens of messages, but perhaps only one or two facts are worth remembering long-term. A mechanism is needed to automatically perform this extraction. x-code-cli's approach is to use an LLM in the background after each conversation to distill facts worth remembering, writing them to disk in a structured format:

const MemoryItemSchema = z.object({
  category: z.enum(['user', 'feedback', 'project', 'reference']),
  scope: z.enum(['project', 'user']),
  key: z.string().describe('Short slug. Same key overwrites the previous fact.'),
  fact: z.string().describe('The fact itself.'),
})

At the start of the next session, these memories are loaded into the system prompt, allowing the Agent to "remember" previous context. The main Agent itself does not have a tool to write memories—all memory writes are done through this silent background extractor to prevent the model from being distracted by "managing its own memory" during conversation.

Multi-Agent Collaboration. When tasks are sufficiently complex, a single Agent can hit bottlenecks: context gets filled with intermediate steps, the model tries to track multiple different concerns simultaneously, and output quality degrades. The Multi-Agent approach breaks down tasks among specialized Agents—for example, one responsible for exploring the codebase, one for coding, one for review. x-code-cli includes 4 sub-Agents (explore, general-purpose, plan, code-reviewer), each running in its own context, returning only conclusions to the main Agent, keeping the main conversation concise. The challenge of Multi-Agent lies in coordination: how to reasonably assign tasks, how to pass context, how to handle conflicts between sub-Agents. This is an area the industry is still actively exploring, with no ready-made standard answers.

Recommended Materials for This Layer: Andrew Ng's "Agentic AI" covers four major Agent design patterns (Reflection, Tool Use, Planning, Multi-Agent Collaboration), free, with community-translated Chinese subtitles on Bilibili, currently the best introductory course on Agents. For documentation, Anthropic's "Building Effective Agents" is an industry-recognized classic document on Agent design methodology, and the companion piece "Writing Effective Tools for AI Agents" focuses on tool design. Both are short; if English reading is not an issue, they are recommended reading.

Layer 3: Engineering Capabilities

With the knowledge from the first two layers, building a prototype Agent that can run a basic flow is not a problem. But to deploy it online for real business, a series of engineering issues need to be addressed.

Context Management. Although current models have large context windows (200K to 1M+ tokens), "large window" does not mean "no need to manage." Longer contexts mean higher API costs and slower response times, and the model's ability to retrieve information from the middle of very long contexts degrades (the "Lost in the Middle" phenomenon). For a concrete example: a code modification task involves reading multiple files and multiple tool calls; after a few rounds of conversation, the context may have accumulated tens of thousands of tokens. Without management, the model might "forget" the initial requirement description or repeat already completed steps. Therefore, a strategy is needed to actively manage context size during conversation. x-code-cli's approach: when context usage exceeds a threshold, keep the most recent rounds of messages unchanged, and have the LLM generate a summary of the older messages to replace them:

async function compressMessages(messages, model) {
  const recent = messages.slice(-KEEP_RECENT)  // Keep most recent messages
  const old = messages.slice(0, -KEEP_RECENT)
  const { text: summary } = await generateText({
    model,
    system: 'Summarize the conversation, preserving key decisions...',
    messages: old,
  })
  return [{ role: 'user', content: `[Summary]\n${summary}` }, ...recent]
}

This is the most basic strategy for context management. More sophisticated approaches include: prompt cache (reusing prefixes to reduce duplicate input costs), selective loading (only putting file content relevant to the current task into context), and hierarchical knowledge bases (long-term invariant knowledge in system prompt, short-term context in message history).

Permissions and Security. There is a fundamental difference between Agents and ordinary LLM conversations: Agents can actually perform actions—shell commands, file reads/writes, network requests. This means that if the model hallucinates, the consequences are no longer just a wrong answer. For example, the model might misjudge the directory scope when cleaning up a project and directly execute a delete command, or overwrite production configuration files during debugging. In ordinary conversations, a hallucination might result in a wrong text; in Agent scenarios, a single hallucination can cause file loss or system damage. Therefore, every Agent intended for production needs a permission mechanism to constrain what the model can and cannot do. x-code-cli implements a three-level permission model: default mode requires user confirmation for all write operations; trust mode skips confirmation; plan mode allows the Agent only to read and explore, not perform any write operations.

Evaluation and Observability. Evaluating Agent quality cannot rely solely on traditional unit tests; a different set of methods is needed—LLM-based automatic scoring, A/B comparison, success rate statistics, and manual spot checks. But knowing whether the result is good or bad is not enough; when problems occur, you need to be able to locate the cause, which depends on observability. Every step of Agent execution—model reasoning, tool calls, context changes—needs to be recorded for traceability and troubleshooting.

Cost Control. The operating cost of an AI Agent is directly related to the number of API calls and token consumption. An unoptimized Agent running a complex task might cost tens or even hundreds of yuan. A ten-person team running dozens of tasks daily could face monthly API costs in the tens of thousands. Common cost optimization strategies include: choosing the right model (not all tasks need the most expensive model), prompt cache to reuse prefixes, reducing unnecessary tool call rounds, and compressing context at appropriate times.

Recommended Materials for This Layer: The best way to learn engineering is to read the source code of mature Agent products—see how others implement context compression, design permission models, and handle tool call failures. This is more effective than just reading tutorials on concepts. There are many open-source AI Agent CLI projects now, such as OpenAI's Codex CLI, community-driven opencode, etc., all worth referencing. My own x-code-cli is one of them. Its advantage is that it comes with a companion Juejin booklet "Building an AI Agent CLI from Scratch", which has dedicated chapters on the engineering issues mentioned above—context management, permission control, loop detection, cost optimization—each chapter first explaining the design approach and then providing code implementation, suitable for readers who want to systematically understand these mechanisms.

Learning Path and Practice

Changes in Learning Methods in the AI Era

In traditional programming learning, the best approach is to find a well-validated course, learn theory while doing accompanying experiments. Nand2Tetris, MIT 6.828, SICP—these courses are repeatedly recommended because they strike a good balance between theory and practice.

In the AI era, learning methods themselves are changing. Now you can tell an AI: "I want to learn how to build an AI Agent from scratch. My goal is to create a CLI tool that can automate my daily development tasks. I have TypeScript and Node.js basics," and let the AI plan a learning path for you while generating accompanying practice code and experiments.

This learning method offers a level of personalization far beyond traditional courses—AI can adjust content in real-time based on your foundation, goals, and progress. However, note that having AI customize a learning path also requires significant time investment, and the code and explanations generated by AI are not necessarily correct. You need to verify each step's output yourself, and this verification cost is not low. So if there are already well-validated high-quality courses available online, prioritize them; AI-customized learning is better as a supplementary tool to fill gaps not covered by existing courses.

Recommended Learning Path

Step 1: Run a Minimal Agent. Don't start by installing LangChain. Use a language you're familiar with (Python, TypeScript, etc.), call the LLM API in the most primitive way, and write a while loop of 50 lines or less—the Agent Loop shown earlier. Make it accept user input, call one or two simple tools (like a calculator, file reader), and return results. The purpose of this step is not to make something useful, but to understand how each part of the Agent works: what the message format looks like, how tools are defined, how the loop is controlled, and when the loop should stop. If you want to see code directly, refer to my hello-agent, 56 lines of TypeScript, one readFile tool, clone and run. For Chinese tutorials, Datawhale "Building Agents from Scratch" is a free tutorial from the domestic open-source community, starting from a minimal Agent, balancing theory and practice.

Step 2: Understand Prompt Engineering. After running the minimal Agent, start systematically learning Prompt Engineering. Focus not on memorizing "100 Prompt techniques," but on understanding the three things mentioned in the Prompt Engineering section: how to write system prompts, the wording of tool descriptions, and the use cases for few-shot examples.

Step 3: Expand the Tool Set. Based on the minimal Agent from Step 1, gradually add tools: file read/write, Shell execution, code search, web scraping. With each added tool, observe the model's behavior—when does it use it correctly, when incorrectly, and why. After several rounds, you'll understand how to write tool schemas and descriptions for correct model invocation.

Step 4: Add Memory and RAG. At this point, your Agent can complete tasks within a single conversation, but it doesn't remember anything when a new conversation starts—agreements from the previous conversation, your project documents, it can't access them. Solving this requires two mechanisms working together. Long-term memory saves key information from conversations: store facts worth remembering to a local file, load them into the system prompt at the next startup. RAG connects to external documents: chunk and vectorize a set of documents, store them in a vector database, retrieve relevant fragments when a question is asked, and incorporate them into the Prompt. The Memory section earlier showed x-code-cli's background extraction scheme; you can refer to that approach to implement your own filtering logic. For RAG, you need to understand the basic concepts of vector embeddings and similarity search, with the balance between chunk granularity and retrieval quality being the most frequently debugged aspect.

Step 5: Build a Complete Project. By this step, you should be able to build a relatively complete Agent project. If you haven't decided what to build, try one of these directions: build a CLI Agent (the purest form of Agent, no frontend needed, focus on core Agent logic); build a RAG Q&A bot (turn your notes, documents into a knowledge base); build an automated workflow Agent (hand over repetitive daily development tasks to the Agent). The Layer 3 recommended materials mentioned several open-source CLI Agent projects; at this step, you can pick one of interest and compare its source code to validate the understanding accumulated in the first four steps.

Conclusion

The core point of this article is: AI Agent engineer is not a new career that requires learning from scratch, but a combination of existing development experience plus a specific set of knowledge. The concepts to learn are few; the difficulty lies in turning concepts into engineering implementations that can go into production.

If you're still hesitating about whether to invest time in learning AI Agent development, my advice is to first spend an hour or two running a minimal Agent—just Step 1 of the learning path is enough. hello-agent was written for this purpose: 56 lines of code, clone and run. After running it, you'll have a basic understanding of how an Agent works, and then decide whether to go deeper.

AI Agent engineering is still in a rapidly evolving stage. What was done six months ago may now have better alternatives. Rather than pursuing learning everything at once, maintain a hands-on rhythm and accumulate experience through real projects. This field changes quickly, but the core principles—message passing, tool calling, context management—will not become obsolete easily. Master these fundamentals, and you won't be at a loss when facing new frameworks and model capabilities.