跪拜 Guibai
← Back to the summary

The Agent Anatomy: How MLLMs, Memory, and Multi-Agent Loops Build AI That Thinks Like Us

AI Agent Composition: An Intelligent Agent That Thinks Like a Human

© 2026 by ethan.tan (Tan Ming) · All Rights Reserved · Illustrated First Edition · 2026.07.02

One-Liner: The goal of an AI Agent is to enable AI to perceive, think, remember, and act like a human, autonomously solving our various problems.

Era Judgment: MLLMs (Multimodal Large Models) are now powerful enough, reaching professional levels in certain domains. The conclusion is simple—embrace AI, embrace change.

Global Architecture Diagram

Before diving into each layer, here is a global architecture overview—centered on the MLLM, it maps perception, brain, thinking methods, memory, actions, capability collaboration, and Skills onto a single diagram to show how they connect into an agent that thinks like a human:

image.png

Diagram Guide (from top-left to bottom-right, inside-out):

This diagram is the "map" of the entire article—each subsequent chapter is a detailed breakdown of a specific block in the diagram. It is recommended to understand this master diagram first before diving into the details of each layer.


Table of Contents


Introduction: A New Computer Organization

The goal of an AI Agent is to enable AI to perceive, think, remember, and act like a human. A regular LLM is a passive tool that "answers one question at a time"; an Agent is an active executor that can plan autonomously, call tools, maintain persistent memory, and adjust dynamically. The difference lies in having "hands and feet" (actions/tools), a "memory" (memory), and "methodology" (thinking methods).

The best way to understand an Agent is to view it as a "new computer", while referencing two frames of reference: humans (the object of imitation) and traditional computers (the engineering carrier).

Human Traditional Computer AI Agent Essence
Eyes / Ears (Senses) Input Devices (Keyboard/Mouse) Perception Layer (NLP/CV/ASR) Receiving external info
Hands / Mouth (Body & Language) Output Devices (Monitor/Printer) Action Layer (Virtual Output/Devices/Robots) Acting on the external world
Brain CPU Intelligent Brain (LLM/MLLM) Core computing & reasoning
Thinking Style / Methodology Controller / Program Thinking Methods Organizing logic for reasoning & action
Memory (Short-term/Long-term) RAM / Hard Drive Memory Layer State & knowledge storage
Tools / Toolbox Peripherals / Bus MCP / Tools Connecting external tools & systems
Language / Collaboration Network Protocol A2A (Agent to Agent) Inter-agent communication
Professional Skills / Experience Software Libraries / Experience Skill Experience reuse

This table is the key to the entire article: Every layer of the Agent has a corresponding counterpart in humans. Each subsequent chapter will return to this "Human vs. Agent" main thread.

The Four-Module Cognitive Cycle: The Mainstream Theoretical Framework in the Industry

The above mapping has theoretical backing from academia and industry. Mainstream LLM Agent architecture surveys align with the "Perception-Brain-Action-Memory" four-module framework. This article uses that as a skeleton, separates "Thinking Methods" into its own chapter, merges "Action" into the perception layer, and expands it into a complete structure:

Perceive Environment → Think → Take Action → Form Memory → (Use memory to guide the next round of thinking and action) → Loop

The four modules form a continuous cognitive loop, which is the fundamental difference between an Agent and a "one-shot Q&A":

image.png

The Essence of Architectural Evolution: Agent architecture has evolved from a "single model wrapper" into a modular system. The core idea is to borrow from human cognitive models, decoupling capabilities into modules that are both independent and collaborative. Below, we break it down layer by layer, centered on the MLLM.


I. Perception and Action: Input and Output Layers

For a person to do things, they must first "perceive" the environment (see, hear, read) and then "act" to produce results (speak, write, operate). These two layers form the interaction boundary between the Agent and the world.

1. Perception Layer (Input)

Perception is the entry point for the Agent to receive external information. The three types of input modalities correspond directly to the mechanization of three human senses:

Human Sense Agent Perception Capability
Eyes (Seeing) Vision (Image/Picture Input) Image understanding, OCR document recognition
Ears (Hearing) Auditory (Voice/Audio Input) ASR speech-to-text
Reading Text/Speaking NLP (Natural Language Input) Text intent parsing, currently the most mature channel

Unified Representation of Multimodal Information — The primary task of the perception module is to unify heterogeneous data sources into a form the brain can process:

Key Technologies and Correspondences:

Technology Corresponding Sense Role
NLP Reading/Speaking Intent recognition, entity extraction, sentiment analysis, long-text understanding
CV Seeing UI operation Agent locating buttons/input boxes; robot obstacle recognition
ASR Hearing Voice interaction, key for intelligent customer service/smart homes
Multimodal Fusion Synthesis Achieves deep cross-modal correlation via Cross-Attention, producing a "1+1>2" effect

Trend: Multimodal fusion. A single MLLM consumes text, images, and voice simultaneously, avoiding the information loss of multi-model splicing. Humans don't split "seeing" and "hearing" into two independent systems, and neither should Agents.

Human vs. Agent: Human senses are naturally fused and come with common sense; an Agent's multimodality still requires deliberate splicing and easily loses cross-modal associations.

2. Action Layer (Output)

The Agent must be able to "do things." Categorized by the target of action, there are three types, with capabilities extending from virtual to physical:

Human Action Agent Action Capability
Writing/Drawing/Using a PC Virtual Output Content generation (text/image/video/files), browser automation
Using a Phone/Switching Appliances Device Operation Phone/PC control, smart home hardware manipulation
Physical Labor/Operating Machinery Robotics Software-hardware synergy to execute physical actions (Embodied Intelligence)

Progressive Expansion of Three Action Types:

Tools: Infinite Expansion of Capabilities

The action layer is implemented as tool invocation. By combining tools, the Agent breaks through the LLM's own limitations to complete multi-step tasks. Common tool types:

Tool Type Examples
Information Retrieval Search engines, database queries, weather/stock/news APIs
Calculation & Analysis Calculator, code interpreter, data analysis libraries
Content Generation Image generation, speech synthesis
Application Control Sending emails, creating calendar events, operating CRM
Physical World Interaction Controlling robots, drones, smart homes

The three action types progressively extend from the digital world to the physical world. Embodied intelligence is the ultimate form of the Agent—enabling AI not just to "think online," but to "act offline."

Human vs. Agent: Humans far surpass current Agents in fine physical manipulation, but Agents already have an advantage in virtual output and parallel cross-device control.


II. The Intelligent Brain: Computing and Model Base

The brain is the core computing unit of the Agent. This layer is divided into two parts: computing power types and the model base.

1. Three Types of Computing Power: Evolution from Judgment to Multimodality

The "brain" has followed a clear evolutionary path—Classifier → LLM → MLLM. Each leap breaks through the bottlenecks of the previous stage.

Three Stages of Evolution:

Stage 1 · Classifier — Traditional machine learning, solving clearly bounded classification problems. Lightweight, deterministic, low cost, but each task requires specifically trained data and cannot generate or reason.

Stage 2 · LLM — A general reasoning and generation engine, serving as the Agent's "main brain." One model handles thousands of tasks, but only understands text and cannot perceive multimodal information.

Stage 3 · MLLM — Unifies the processing of text, images, voice, and other modalities on top of the LLM. It is the evolutionary direction of the "all-around brain" and the center of this article's architecture.

Evolution Comparison Table:

Stage Representative Approximate Time Breakthrough Limitation
Classifier Traditional ML 1950s–2010s Learned to categorize Specialized, needs retraining, cannot generate
LLM Large Language Model 2018–2022 General-purpose, understands intent Only understands text, no perception
MLLM Multimodal Large Model 2023–Present Unified seeing, hearing, speaking, thinking Current central node, continuously evolving

Correction Note: The LLM era began with GPT-1 and BERT in 2018; 2017 was the year the Transformer paper was published, serving as the foundation, not the era itself. The CoT paper was published in January 2022.

Brain Evolution Chain Visualization:

image.png

Key Milestones:

Classifiers solve "what is it," LLMs solve "how to do it," and MLLMs solve "all-around perception and decision-making." These three form an evolutionary chain of capability leaps.

Human vs. Agent: The human brain relies on intuition and common sense, is energy-efficient, and can extrapolate; Agents rely on statistical patterns, excelling in breadth and speed but weak in causal understanding and physical common sense.

Underlying Cornerstone: Chain of Thought (CoT)

Before diving into specific thinking methods, understand their common underlying technology—Chain of Thought (CoT). Proposed by Google researchers in January 2022, its core is to guide the LLM to generate a step-by-step reasoning process before answering, improving accuracy on multi-step logic problems.

Zero-shot CoT Example:

Q: There are 5 apples in a basket. Xiao Ming takes 2 and then puts 1 back. How many are there now?

A: Let's think step by step.

Initially 5 → Take away 2 leaves 3 → Put back 1 leaves 4 → Final Answer: 4

CoT provides structured expression for the Agent's thinking and is the foundation for subsequent complex thinking methods.

2. Model Base and Further Reading: "Building a Large Model from Scratch"

Model capability comes from two types of bases:

Selection Logic: Use a general base whenever possible; only use proprietary fine-tuning when vertical domain accuracy is insufficient.

To truly understand the "Classifier → LLM → MLLM" evolution chain and the internal structure behind the model base, further reading is recommended: "Building a Large Model from Scratch"—it breaks down how a large model is "stacked" step-by-step, from data preparation, architecture design, pre-training, and fine-tuning to instruction alignment. We've drawn the book's core thread into two flowcharts: The first clarifies the main "input to output" link, and the second clarifies "how capabilities are expanded."

Figure 1: Core Workflow of a Large Model

image.png

Key Takeaway: The essence of a large model is "slicing text into tokens, converting tokens into vectors, and then predicting one token at a time via autoregression."


III. Thinking Methods: Control Flow

This is the soul that distinguishes an Agent from a regular LLM.

A regular LLM is a "one-shot Q&A." An Agent is a cyclical "Reasoning → Action → Observation" process that can dynamically adjust based on intermediate results. The logic that determines "how to cycle" is the thinking method.

Four thinking methods have clear boundaries:

① ReAct

Mechanism: Thought → Action → Observation loop, deciding the next step based on the observation at each step. Jointly proposed by Princeton University and Google, it is currently the most widely used Agent thinking method, with the core being the combination of CoT and tool invocation.

Key Constraint: An exit condition must be determined in advance, otherwise it will fall into an infinite loop.

Advantages: Dynamic adaptation, explainable and controllable, strong error correction. When a step fails, the Agent can remedy it in the next round (re-search with different keywords, switch APIs).

Challenges: Requires multiple interactions with the LLM and tools, leading to higher latency and cost.

Suitable for: Tasks with strong exploratory nature and high uncertainty (open research, information retrieval, debugging).

FlowchartThought → Action → Observation loop, converging via an "exit condition":

image.png

② Plan-and-Execute

Mechanism: First make a global plan, decomposing the task into ordered steps, then execute step-by-step.

Characteristics: Good global perspective; high efficiency and low cost when tasks are clear.

Trade-off with ReAct: ReAct is locally flexible but may deviate from the global goal; PlanExe is globally clear but less flexible, and the plan may need adjustment if the environment changes during execution. Mature implementations usually include a replan mechanism.

Suitable for: Tasks with relatively standard processes and predictable steps.

Flowchart — Two-stage "plan first, then execute," with a replan correction loop:

image.png

③ Reflection

Mechanism: Generate an initial version → Identify flaws → Improve and optimize, iterating to enhance quality. Represented by Reflexion and LATS.

Characteristics: "Have it first, then make it good"—first solve "is there one," then solve "is it good."

Suitable for: Quality-oriented tasks that can be iteratively polished (code generation, copywriting, solution design).

Flowchart — Self-iterative loop of "Generate → Reflect → Improve":

image.png

④ Multi-Agent

Mechanism: One orchestrating Agent (Master/Orchestrator) dispatches multiple specialized sub-Agents. Its essence is a Multi-Agent System (MAS).

Why MAS is needed: ① Specialized division of labor; ② Tasks can be parallelized; ③ Scalable, failure of a single Agent does not crash the system; ④ Can simulate complex systems.

Flowchart — Orchestrator decomposes and dispatches, sub-Agents perform their roles, results are aggregated and returned:

image.png

Comparison and Selection of the Four Methods

Thinking Method Core Logic Globality Advantages Disadvantages Suitable Scenarios Analogy
ReAct Step-by-step Weak (Local) Dynamic adaptation, explainable, strong error correction High cost, high latency Exploratory, uncertain tasks Taking career planning one step at a time
PlanExe Plan then execute Strong (Global) Structured, efficient when tasks are clear Poor flexibility, hard to handle surprises Standard processes, predictable tasks Decompose first, then act
Reflection Have then optimize Medium (Iterative) Self-learning iteration, high output quality Further increases cost and latency Quality-oriented, polishable tasks Agile development iteration
Multi-Agent Each does its job Strong (Division of labor) Specialized division, parallel, scalable Complex coordination Complex, cross-domain tasks Team specialization

Selection Principle: The more uncertain the task → lean towards ReAct; the more standard the task → lean towards PlanExe; the higher the quality requirement → layer on Reflection; the higher the complexity → go Multi-Agent.

Combined Use in Practice: Complex systems can first use PlanExe to formulate a macro plan, use ReAct for the details of each macro step, and introduce Reflection after key nodes for checking.

Human vs. Agent: These four methods make explicit the unconscious thinking habits of humans. Humans excel at "metacognition," knowing what method they are using to think; an Agent's thinking method is still preset and requires human selection.


IV. Memory Layer: Storage and Retrieval

Memory is the Agent's state storage. Without memory, the Agent starts from scratch with every conversation, unable to learn or understand user preferences. It is divided into two levels by scope: short-term memory and long-term memory.

1. Two Levels by Scope

① Session-Level Memory

Short-term memory stores the current task context and disappears when the task ends. Its main form is conversation history.

Implementation: Directly utilizes the LLM's context window. When the conversation becomes too long, compression is needed:

② Cross-Session / Persistent Memory

Long-term memory stores information across tasks and sessions. The core technology is RAG. It comes in three deployment forms:

RAG (Retrieval-Augmented Generation)

The LLM's context window is limited and cannot hold all knowledge. RAG's solution is to attach an external knowledge base to the LLM: before generation, it retrieves the most relevant information from an external database and feeds it to the LLM as additional context. The mechanism is "fetch on demand," not "cram everything into the brain."

RAG's Four-Step Mechanism (using "user likes lattes" as an example):

Step Action Example
① Store Convert long-term memory into high-dimensional vectors via an embedding model and store in a vector database "I like lattes" → vector stored
② Retrieve When relevant clues appear in a subsequent conversation, convert the question into a vector and perform a similarity search "Recommend a coffee" → recall "likes lattes"
③ Augment Use the retrieved memory as context and send it to the LLM along with the question Known info: user likes lattes
④ Generate The LLM generates a personalized response based on the augmented context "Perhaps a classic latte is a good choice"

RAG Four-Step Mechanism Flowchart:

image.png

2. Storage Base

Mainstream Vector Database Comparison (2026):

Database Type Core Advantages Main Application Scenarios
Pinecone Commercial Cloud Service Fully managed, out-of-the-box, stable performance Rapid prototyping, SMB applications
Milvus Open Source Distributed architecture, high scalability, rich features Large-scale production environments, high scalability needs
Weaviate Open Source Multimodal support, built-in multiple Embedding models, GraphQL interface Complex data types, multimodal retrieval applications
ChromaDB Open Source Lightweight, Python-native, developer-friendly Local development, data science experiments, small applications
Redis Open Source/Commercial In-memory database, extremely low latency, versatile (combined with RediSearch) Scenarios demanding extreme real-time performance, existing Redis systems

In practice, hybrid retrieval (vector + keyword) is often used to balance semantics and exact matching.

Human vs. Agent: Human memory has emotional weighting and associative triggers, and actively forgets irrelevant details; Agent memory relies on explicit storage and recall, precise and lossless but lacking emotion and contextual association.

Memory Dimension Human Agent Comparison Point
Short-term Working memory (approx. 7±2 items) Session-level memory (context window) Agent has larger capacity but easily overflows and loses info
Long-term Experience, skills, emotional memory Persistent memory (MD/SQLite/Vector DB) Agent is precise and lossless; humans rely on associative reconstruction
Retrieval Association + emotional trigger RAG vector/keyword retrieval Agent can recall fully; human recall rate is lower but relevance is higher
Forgetting Active forgetting Requires designed decay mechanism Forgetting is a noise-reduction advantage for humans
Cross-device Cannot be migrated Multi-device sync Agent is migratable

The memory layer is an area where Agents structurally surpass humans—precise, lossless, cross-device, migratable. The cost is the need to actively design a decay mechanism, otherwise "remembering too much" actually dilutes relevance.


V. Capabilities, Collaboration, and Connection: Peripherals and Bus

Perception allows the Agent to "input," the brain allows it to "think," and action allows it to "output." Connecting to external tools and collaborating with other Agents requires a connection layer.

Three-Layer Evolution Overview — Moving from "monolithic capability" to a "collaborative network," three types of infrastructure progressively expand the Agent's boundaries:

image.png

Connection Layer Analogy Role Scope Representative
Tools Hand / Single Peripheral Callable functions, execute atomic actions like queries and calculations, no unified standard Monolithic Capability Function Calling
MCP USB Port / Bus Tool side implements once per protocol, all Agents reuse, unified discovery and management Tool Ecosystem Standardization stdio / SSE
A2A Network Protocol / Internet Agent Card supports discovery, mutual trust, and cross-Agent collaboration Collaborative Network Agent Card

1. Tools

Specific functions the Agent can call to execute atomic actions like queries, calculations, and external operations. Without a unified standard, each Agent integrates independently, and the integration cost grows linearly with the number of tools.

2. MCP

MCP (Model Context Protocol), open-sourced by Anthropic in 2024, turns "tools" into "plug-and-play peripherals." The tool side implements an MCP Server once, and any MCP-supporting Agent can connect.

MCP Core Three Elements:

Element Role
Resources Readable data exposed to the Agent
Tools Executable functions, actively called by the Agent
Prompts Preset prompt templates, standardized interaction

Boundary Note: MCP Prompts are "reusable interaction templates" exposed outward by the tool side; Skills in Chapter VI are "workflow prompts + domain knowledge" internalized by the Agent.

3. A2A

A2A (Agent2Agent Protocol), proposed by Google, allows Agents from different vendors and frameworks to discover, negotiate, and delegate tasks to each other, building an "Internet of Agents."

Agent Card: Each Agent publishes a standardized card declaring its capabilities, endpoints, authentication methods, and supported inputs/outputs.

Four Steps of Collaboration: Discover → Negotiate → Delegate → Return.

MCP vs. A2A: MCP solves "how an Agent uses tools" (vertical integration), A2A solves "how an Agent finds another Agent" (horizontal collaboration).

Architectural Patterns for Multi-Agent Systems

The core differences between the three architectural patterns can be summarized as "who calls the shots" and "how Agents communicate with each other." The diagram below shows the three common patterns side-by-side:

image.png

Diagram Guide:

Architectural Pattern Structure Characteristics Typical Representative
Hierarchical Manager Agent decomposes tasks and assigns them to worker Agents, results are aggregated and reported upward Similar to corporate management structure, most common AutoGen
Peer-to-Peer All Agents have equal status, communicate and negotiate directly Decentralized, flexible CrewAI
Hybrid Macro-level hierarchical management, local peer-to-peer collaboration Combines advantages of both

Relationship of the Three: Tools are the hands, MCP is the interface standard, A2A is the network protocol. Moving from monolithic capability to a collaborative network.

5. Consensus: The Semantic Foundation for Agent Collaboration

Protocols (A2A) solve the problem of Agents "being able to communicate," but not "being able to align"—different Agents may have completely different understandings of the same thing. This is the problem that consensus aims to solve.

What is Consensus?

In the absence of a global command, the ability for multiple entities to form a consistent judgment on the same matter. Humans achieve it through language, history, culture, and tacit understanding; Agents require explicit alignment mechanisms.

Three Levels of Consensus in AI Agents:

Level Scenario Core Mechanism Current Maturity
Single Agent Internal Maintaining consistent judgment standards across conversations Skill solidifies logic, memory retains historical references ✅ Relatively mature
Between Multiple Agents Agents with different roles aligning on goals and standards Orchestrator Agent enforces allocation; or peer-to-peer consensus negotiation via shared context ⚠️ Hierarchical is mature, peer-to-peer is still early-stage
Between Human and Agent Human needs correctly understood by Agent, Agent output recognized by human Human-in-the-loop intervention and correction + jointly aligned judgment benchmarks ⚠️ Fallback solutions are mature, deep alignment is still a bottleneck

Consensus is a hurdle for Agents moving from "usable" to "reliable."

A single Agent can output stably with Skills and memory, which is relatively easy. But how multiple Agents self-negotiate to reach agreement without a unified brain, and the deep semantic alignment between humans and Agents, are the core bottlenecks in current engineering.

Essentially, the "things humans are responsible for" in human-machine collaboration is a continuous process of reaching consensus with AI—humans clarify needs to the Agent, the Agent explains its output to the human, and each round of interaction narrows the consensus gap. When consensus breaks, human fallback (Chapter VIII) is the final alignment mechanism.

This is precisely the meaning of "becoming someone who masters AI"—it's not about throwing a task at the Agent and being done, but continuously calibrating consensus in every round of alignment.


VI. Skill: Reusing Experience

Encapsulate reusable capabilities into Skills to avoid starting from scratch every time.

1. Structure of a Skill

A standard Skill typically contains three parts:

Component Content Role
Metadata name, description, tags, version, author Allows the Agent to discover and decide whether to activate
Instructions Role setting, workflow steps, constraints, output specifications Guides the Agent on how to execute the task
Resources Template files, reference documents, sample code, data sources Provides materials for execution

2. Problems Solved by Skills

Problem 1: Repetitive Labor

In every conversation, the Agent understands the task from scratch, repeatedly consuming tokens for the same process. Skills solidify the "how-to," encapsulating once for multiple reuses.

Problem 2: Unstable Quality

The output quality of the Agent across different scenarios is highly dependent on prompt quality. Skills standardize best practices to ensure stable output.

Problem 3: Difficulty in Accumulating Experience

Human experience disappears with the conversation, and a team's best practices cannot be passed on. Skills make experience explicit from tacit, becoming an accumulable asset.

Problem 4: Difficulty in Cross-Tool Migration

Different platforms (Claude Code, Cursor, DeepSeek) have different interaction methods. Skills enable knowledge to flow across tools through a unified specification.

Difference from Memory: Memory stores "what happened" (e.g., user preferences, history logs), Skills store "how to do it" (e.g., code review process, data analysis framework).


VII. Orchestration Framework Implementation: From Thought to Engineering

Thinking methods are the "thought," orchestration frameworks are the "tool." This section uses LangChain's official Deep Agents as the main thread, mapping "thinking methods / memory / Skills / capability collaboration" one-to-one onto engineering components.

1. Three-Layer Architecture

Layer Representative Core Capabilities What It Solves
Runtime LangGraph Graph orchestration, persistence, state management, streaming output, human-in-the-loop Complex process control and execution engine
Framework LangChain Model abstraction, create_agent (ReAct loop + tool invocation), tool interface, middleware Standardization of single Agent basic capabilities
Harness Deep Agents Planning, virtual file system, sub-agents, memory, skills Reliability for end-to-end complex tasks

The relationship between the three layers can be understood as "base → middleware → application suite": LangGraph is responsible for state flow and graph orchestration, LangChain packages LLM + tools into a standard Agent, and Deep Agents provides out-of-the-box complex task capabilities like long tasks, memory, file systems, and sub-agents on top of the first two layers.

image.png

Diagram Guide:

Deep Agents is not a replacement for LangGraph, but an "application suite" built on top of the LangGraph runtime + LangChain framework. Simple tasks can use just the LangChain single layer; end-to-end complex tasks require Deep Agents.

How LangGraph Builds a Graph

image.png

Core Capabilities of LangChain create_agent

LangChain's create_agent series of functions (create_react_agent, create_tool_calling_agent, etc.) are factory functions that package "one LLM + a set of tools" into a runnable Agent. Their core responsibilities are:

  1. Bind Tools: Inject functions/packaged tools into the scope callable by the LLM.
  2. Construct Prompt Templates: Assemble the system prompt, role setting, and tool descriptions in a fixed format to feed to the model.
  3. Implement the ReAct Loop: Let the model output Thought in each round, then decide which tool's Action to call, get the Observation, and enter the next round—this is the Thought → Action → Observation loop discussed in Chapter III.
  4. State Flow: Maintain multi-turn conversation state, re-injecting each tool's return result back into the context until the exit condition is met and a final answer is given.

In one sentence: create_agent is not a specific thinking method, but a "standard Agent launcher" that packages ReAct thinking method + tool invocation + state maintenance into a single line of code to start. It eats a model and tools, and spits out an Agent capable of cyclic reasoning-action. LangGraph then goes one level up: turning this cyclic link into a visual state graph node, supporting branching, concurrency, persistence, and human-in-the-loop.

2. Deep Agents' Four Capability Pillars

Capability Pillar Key Components Corresponding Principles in This Article
Execution Environment Virtual File System, Tools/MCP, Code Sandbox, Streaming Output Action Layer, Capability Collaboration
Context Management Skills, Long-term Memory, Summarization & Context Offloading, Prompt Caching Memory, Skill
Delegation write_todos, task PlanExe, Multi-Agent
Control interrupt_on, File System Permissions Human Fallback

3. Virtual File System

Traditional Agents stuff large chunks of information into the prompt, causing context bloat. Deep Agents uses a file system for Context Engineering: letting the Agent read on demand and store in categories, rather than spreading all materials on the table at once.

The documentation currently mentions three types of mechanisms:

  1. Six File Operations: ls, read_file, write_file, edit_file, glob, grep — these are the atomic commands for the Agent to interact with the virtual file system.
  2. Automatic Offloading of Large Results: When the content returned by a tool call exceeds a token threshold, the full content is written to the file system, and the conversation history only keeps the file path + content preview, preventing prompt explosion.
  3. Automatic History Summarization: When the context reaches the window limit and there is no offloadable content, a summary is generated to replace the original conversation, and the original conversation is written to the file system for retention.

Security Isolation: Execution Sandbox on Top of the File System

The virtual file system solves "how to manage context," but Agents often need to execute code, call command lines, or access networks, so the question of "is execution safe" must be addressed. In engineering, sandboxes are typically chosen in layers based on isolation strength:

image.png

Selection Logic: The virtual file system is the Agent's "work desktop," and sandbox isolation is the "protective shield around the workbench." The two are complementary—the file system is responsible for context organization, the sandbox for execution security. Ordinary document processing may not need a sandbox; but once an Agent needs to execute user-submitted code, access external networks, or operate on sensitive data, the runtime environment must be placed in a sandbox.

The virtual file system is the infrastructure for context management and is the most essential upgrade of Deep Agents over a regular ReAct loop.

4. Task Planning write_todos: Engineering Plan-and-Execute

Mechanism: The write_todos tool is automatically injected when calling create_deep_agent(), requiring no manual configuration. Each task contains three fields: subject (title), description (description), status (status). Status flows linearly: pendingin_progresscompleted.

Three Execution Phases: Make a plan (all pending) → Execute step-by-step (mark status) → Dynamic adjustment (new tasks can be added/adjusted if new needs are discovered during execution).

Correspondence to Chapter III Thinking Methods: Its essence is Plan-and-Execute (PlanExe), but it is not a strict two-phase separation—the Agent can modify the plan during execution, making it "PlanExe with dynamic adjustment capability" (echoing the replan mechanism mentioned earlier).

Key Design: The task list is persisted in Agent State, not in the conversation history. This means even if the conversation history is summarized and compressed, the list remains intact—it acts as the Agent's "North Star," solving the problem of "forgetting the goal midway" during long tasks.

5. Sub-Agent task: Solving Bloat with Isolated Context

Mechanism: A built-in task tool allows the main Agent to dispatch sub-tasks to specialized sub-agents for execution.

Core Problem Solved: Context window bloat. Sub-agents have their own independent context window, autonomously execute, and then only return a single final report to the main Agent—all intermediate searches, file reads, and trial-and-error processes are isolated within the sub-agent's own context, not polluting the main Agent's.

Correspondence to Chapter III Thinking Methods: Multi-Agent (each does its own job). The main Agent is responsible for orchestration, sub-agents for specialized execution, with context naturally isolated.

Insight: This is the implementation of "divide and conquer" on Agents—the main Agent's context remains lean (only holding plans and result summaries), while heavy exploration is thrown to sub-agents. Sub-agents are another sharp tool for context management, complementary to file system offloading: the file system offloads "data," sub-agents offload "process." The tutorial also mentions "async subagents" for further parallelization.

6. Skills: Progressive Disclosure + Cross-Tool Standard

Specification: Deep Agents' Skill format is evolving towards an open specification (like the SKILL.md convention: YAML frontmatter metadata + Markdown instruction body). This concept is similar to skills/instructions files in tools like Claude Code, OpenAI Codex, and Cursor, but the ecosystem is still evolving, and the specific schema of each platform should be checked during actual cross-tool migration.

Progressive Disclosure Three-Level Loading — This is the most core design decision for Skills:

Level Loaded Content Timing Cost
L1 Metadata Only frontmatter (name + description) Injected into system prompt at startup ~hundreds of tokens for 20 Skills
L2 Instructions Full SKILL.md body Loaded only after Agent matches via description On-demand
L3 Resources Files under references/, assets/ Read at the LLM's discretion when referenced by instructions On-demand

Key Point: description is the sole basis for the Agent to decide whether to activate a Skill—the Agent will not read the body in advance to match. This ensures that as the number of Skills grows, the startup overhead only increases linearly, allowing for "infinite scalability."

Analogy: From the original tutorial—"Skills are to AI Agents what npm packages are to Node.js." Tools are atomic operations (search once, read a file); Skills are packaged reuse of "multi-step workflows + domain knowledge + template resources." This is precisely the engineering implementation of "experience reuse" from Chapter VI.

7. Long-Term Memory: memory.md + LangGraph Store

Mechanism: Declare memory file paths (e.g., memory.md, preferences.md) via the memory= parameter. The Agent automatically loads them into the system prompt at startup. Memory writes go to the /memories/ path and are persisted to the LangGraph Store via StoreBackend (InMemoryStore for development, PostgresStore for production), retained across sessions.

Naming Note: Deep Agents documentation examples might use AGENTS.md as the memory/preference file. To avoid confusion with the "AGENTS.md global Agent configuration protocol," this article uses memory.md as an example. If your project's Deep Agents template indeed uses AGENTS.md, treat it as the persistent memory file convention under that framework, which is not the same thing as the global configuration protocol.

Self-Updating: When the Agent learns new information during a conversation, it uses the built-in edit_file to update the memory file, and the changes persist to the next conversation—the Agent can "self-evolve" and develop its own professional capabilities.

Isolation: Isolated by namespace via assistant_id (Agent-level), user_id (User-level), org_id (Organization-level), supporting multi-user isolation and organization-level sharing.

Relationship of the Three (sharing the same set of file operation interfaces, differentiated by path prefix and backend):

Dimension Virtual File System workspace Skills Long-term Memory
Storage Backend StateBackend StoreBackend StoreBackend
Lifecycle Within a single conversation Persistent across conversations Persistent across conversations
Content Nature Temporary working files Procedural memory (how to do) Semantic memory (what to know)

Insight: Memory, Skills, and File System share the same set of read/write/edit interfaces, differing only in storage backend and path prefix. This unified abstraction is the brilliance of Deep Agents' design—Chapter IV talks about "Memory," Chapter VI talks about "Skill," but in engineering, they are actually three usages of the same file system.

8. Orchestration Framework Panorama and Selection

Beyond the official main thread of LangChain/LangGraph/Deep Agents, there are more orchestration frameworks in the ecosystem:

Framework Positioning Suitable For
LangGraph Runtime engine, graph orchestration Developers needing ultimate controllability
LangChain Framework building blocks, single Agent capability Developers building custom Agents
Deep Agents Application suite, out-of-the-box Teams needing reliable implementation of complex tasks
Dify Low-code visual orchestration Rapid prototyping, business validation
AutoGen Studio / Flowise, etc. Low-code / visual platforms Writing less code, rapid prototyping
Claude Desktop / Claude Code Out-of-the-box Agent experience End-users directly using Agents

Framework selection is also problem-driven: For ultimate flexibility and control → LangChain/LangGraph; for rapid implementation and lower barrier → Dify; for out-of-the-box → platform-type products. Three layers are not always better—simple tasks can use just the LangChain single layer; end-to-end complex tasks require a Harness like Deep Agents. Blindly adopting high-level architecture only introduces unnecessary complexity.

Back to "Human vs. Agent" : Every capability of Deep Agents is compensating for a human cognitive shortcoming—write_todos compensates for "easily losing track in long tasks," sub-agents for "limited attention," the virtual file system for "small working memory capacity," and Skills for "difficulty in passing on experience." In engineering, these are about "externalizing good human brain habits into mechanisms that machines won't forget."

9. Extension: OpenViking—ByteDance's Open-Source Agent Context Database

Volcano Engine (ByteDance) open-sourced OpenViking, which is highly aligned with the Deep Agents philosophy: it is not a vector database, but a context database for AI Agents, solving the problem of "how to uniformly organize, load on-demand, and self-iterate" Agent context.

The core difference in one sentence:

Vector databases (like Milvus, Pinecone, VikingDB) solve "how to store vectors and retrieve them quickly"; OpenViking solves a higher-level problem—how Agent context is managed like a file system. The two have a "file system" to "hard drive" relationship.

Five Core Features (another industrial implementation of nearly the same philosophy as Deep Agents):

Core Feature Description Corresponding Deep Agents Mechanism
📁 File System Management Paradigm Maps memory, resources, and skills uniformly to a viking:// virtual directory, using ls / find for standardized location and management. Virtual File System (VFS)
🧠 Layered Context Loading (L0/L1/L2) Adopts a pre-generated hierarchical summary strategy: loads on-demand from overview (L0) to details (L2), significantly reducing token consumption. Skill Progressive Disclosure (L1/L2/L3)
🔍 Directory Recursive Retrieval Follows a high-precision retrieval link of "intent analysis → directory location → vector retrieval → sub-directory drill-down → result aggregation." Context Offloading + RAG
👁️ Visual Retrieval Trajectory Completely records and displays the directory browsing and file location path, achieving observability and debuggability of the retrieval process. File System Natural Path Tracking
🔄 Automatic Session Management & Self-Iteration Asynchronously analyzes execution results and user feedback after a session ends, automatically updating user profiles (memory.md) and the Agent's experience library. Long-Term Memory Storage (LangGraph Store)

Relationship with VikingDB: VikingDB is ByteDance's cloud vector database service. OpenViking can use it as a storage base—the open-source version can run locally, and the commercial version leverages VikingDB for large-scale storage and high-performance retrieval. This reaffirms: Vector databases are infrastructure; context databases are higher-level Agent infrastructure.

Endorsement and Performance (according to Volcano Engine's official materials, early data for reference only): Open-sourced a core capability subset of the VLDB 2026 paper "VikingMem: A Memory Base Management System for Stateful LLM-based Applications"; officially claims accuracy improved from 57.21% to 80.32% on the LoCoMo user memory benchmark, with token consumption reduced by 63.2%.

One-Sentence Summary: OpenViking and Deep Agents arrive at the same destination by different routes—both are using the "file system paradigm + layered on-demand loading + memory self-iteration" to answer the same question: For a long-running Agent, how should context be managed? Deep Agents is LangChain's official suite (deeply integrated with the LangGraph runtime), while OpenViking is ByteDance's independent open-source implementation (multi-model Provider, locally runnable, academically grounded). Two routes validate the same thing: context engineering is becoming the new infrastructure of the Agent era.

Deep Agents vs. OpenViking Comparison Diagram — Two implementations, one set of context engineering concepts:

image.png


VIII. Implementation Scenarios: Empowerment and Replacement, Human Fallback

Core Judgment: Wherever there is a need for "people" and "processes," MLLMs can be used for empowerment and replacement.

Division of Labor Boundary: Agents empower and replace; humans are responsible for fallback.

This is the key design when implementing the entire architecture—Agents can autonomously complete a large amount of execution work, but at the level of "what to do" and "is it right," humans remain irreplaceable decision-makers:

Why is Human Fallback Irreplaceable?

Because every person is unique. Everyone's cognition, experience, and values are different, and these differences shape their individual needs and value judgments. An Agent can efficiently execute the "how," but cannot replace a human in deciding "what" to do and whether it is "done well"—the latter requires a person's unique understanding of their own situation and independent judgment of quality.

Human fallback is not a technical compromise, but a confirmation of "human irreplaceability." This is consistent with the discussion on "consensus" in Chapter V—humans and Agents continuously align their understanding in collaboration, with humans ultimately steering the direction and quality.

Engineering Implementation: LangGraph provides interrupt / interrupt_on, and Deep Agents provides file system permissions and human intervention points. It is recommended to explicitly set Human-in-the-loop for high-risk steps (transfers, publishing, deletion, external commitments), rather than just emphasizing "human fallback" at the policy level.


Conclusion

1. In this new Age of Exploration, become someone who masters AI (human-machine coordination), not someone who is replaced.

The essence of an AI Agent is an "intelligent agent that thinks like a human," but it is ultimately a tool. What truly determines value is whether one can use the Agent well—human-machine coordination, not human-machine confrontation.

2. Every round of technological change brings new productivity, as well as new opportunities and jobs.

From classifiers to LLMs, from single Agents to Multi-Agents, every paradigm shift has eliminated a batch of old jobs and created a batch of new ones (Prompt Engineers, Agent Orchestrators, Skill Designers…). The trend is irreversible, but there are always opportunities within the trend—the key is not to resist change, but to stand on the favorable side of change.

3. The Future is Now: Four Major Evolutionary Directions

Looking back and forward from 2026, the technical architecture of AI Agents is evolving in the following directions: