The Agent Anatomy: How MLLMs, Memory, and Multi-Agent Loops Build AI That Thinks Like Us

AI Agent Composition: An Intelligent Agent That Thinks Like a Human

© 2026 by ethan.tan (Tan Ming) · All Rights Reserved · Illustrated First Edition · 2026.07.02

One-Liner: The goal of an AI Agent is to enable AI to perceive, think, remember, and act like a human, autonomously solving our various problems.

Era Judgment: MLLMs (Multimodal Large Models) are now powerful enough, reaching professional levels in certain domains. The conclusion is simple—embrace AI, embrace change.

Global Architecture Diagram

Before diving into each layer, here is a global architecture overview—centered on the MLLM, it maps perception, brain, thinking methods, memory, actions, capability collaboration, and Skills onto a single diagram to show how they connect into an agent that thinks like a human:

Diagram Guide (from top-left to bottom-right, inside-out):

The Brain is the Center: The MLLM acts as the "CPU," handling reasoning and planning—it is the hub of everything.
Cognitive Loop: External World → Perception → Brain → Action → Influence the World. Action results are deposited into memory, and memory feeds back into the brain—forming a continuous cycle.
Two Supports for the Brain: One relies on the "Model Base" for computing power, the other on "Thinking Methods" to organize the logic of reasoning and action. CoT is the underlying foundation for both.
Memory Spans Across: Short-term ↔ Long-term communication, read and written by the brain; RAG solves the problem of "knowledge not fitting."
Capability Collaboration is External: Tools/MCP/A2A allow the brain to act, go online, and collaborate with other Agents.
Skill Injection: Reusable experience is solidified into skill packages, ready to be injected into the brain for reuse.
Orchestration Framework as the Foundation: The three-layer stack of LangGraph → LangChain → Deep Agents engineers the above concepts into reality.

This diagram is the "map" of the entire article—each subsequent chapter is a detailed breakdown of a specific block in the diagram. It is recommended to understand this master diagram first before diving into the details of each layer.

Introduction: A New Computer Organization
I. Perception and Action: Input and Output Layers
II. The Intelligent Brain: Computing and Model Base
III. Thinking Methods: Control Flow
IV. Memory Layer: Storage and Retrieval
V. Capabilities, Collaboration, and Connection: Peripherals and Bus
VI. Skill: Reusing Experience
VII. Orchestration Framework Implementation: From Thought to Engineering
VIII. Implementation Scenarios: Empowerment and Replacement, Human Fallback
Conclusion

Introduction: A New Computer Organization

The goal of an AI Agent is to enable AI to perceive, think, remember, and act like a human. A regular LLM is a passive tool that "answers one question at a time"; an Agent is an active executor that can plan autonomously, call tools, maintain persistent memory, and adjust dynamically. The difference lies in having "hands and feet" (actions/tools), a "memory" (memory), and "methodology" (thinking methods).

The best way to understand an Agent is to view it as a "new computer", while referencing two frames of reference: humans (the object of imitation) and traditional computers (the engineering carrier).

Human	Traditional Computer	AI Agent	Essence
Eyes / Ears (Senses)	Input Devices (Keyboard/Mouse)	Perception Layer (NLP/CV/ASR)	Receiving external info
Hands / Mouth (Body & Language)	Output Devices (Monitor/Printer)	Action Layer (Virtual Output/Devices/Robots)	Acting on the external world
Brain	CPU	Intelligent Brain (LLM/MLLM)	Core computing & reasoning
Thinking Style / Methodology	Controller / Program	Thinking Methods	Organizing logic for reasoning & action
Memory (Short-term/Long-term)	RAM / Hard Drive	Memory Layer	State & knowledge storage
Tools / Toolbox	Peripherals / Bus	MCP / Tools	Connecting external tools & systems
Language / Collaboration	Network Protocol	A2A (Agent to Agent)	Inter-agent communication
Professional Skills / Experience	Software Libraries / Experience	Skill	Experience reuse

This table is the key to the entire article: Every layer of the Agent has a corresponding counterpart in humans. Each subsequent chapter will return to this "Human vs. Agent" main thread.

The Four-Module Cognitive Cycle: The Mainstream Theoretical Framework in the Industry

The above mapping has theoretical backing from academia and industry. Mainstream LLM Agent architecture surveys align with the "Perception-Brain-Action-Memory" four-module framework. This article uses that as a skeleton, separates "Thinking Methods" into its own chapter, merges "Action" into the perception layer, and expands it into a complete structure:

Perceive Environment → Think → Take Action → Form Memory → (Use memory to guide the next round of thinking and action) → Loop

Perception — The Agent's "five senses": Captures information from user commands, files, databases, API return results, and even raw data from cameras/microphones, transforming it into structured information the brain can understand.
Brain — The Agent's "central nervous system", with the LLM at its core: Responsible for reasoning and planning, understanding the user's ultimate intent, and decomposing complex tasks into executable sub-tasks.
Action — The Agent's "hands and feet": Interacts with the external world by calling tools (search, calculation, code execution, robot control, etc.).
Memory — The key to the Agent's learning and evolution: Short-term memory stores the current task context; long-term memory stores cross-task knowledge, experience, and user preferences.

The four modules form a continuous cognitive loop, which is the fundamental difference between an Agent and a "one-shot Q&A":

The Essence of Architectural Evolution: Agent architecture has evolved from a "single model wrapper" into a modular system. The core idea is to borrow from human cognitive models, decoupling capabilities into modules that are both independent and collaborative. Below, we break it down layer by layer, centered on the MLLM.

I. Perception and Action: Input and Output Layers

For a person to do things, they must first "perceive" the environment (see, hear, read) and then "act" to produce results (speak, write, operate). These two layers form the interaction boundary between the Agent and the world.

1. Perception Layer (Input)

Perception is the entry point for the Agent to receive external information. The three types of input modalities correspond directly to the mechanization of three human senses:

Human Sense	Agent Perception	Capability
Eyes (Seeing)	Vision (Image/Picture Input)	Image understanding, OCR document recognition
Ears (Hearing)	Auditory (Voice/Audio Input)	ASR speech-to-text
Reading Text/Speaking	NLP (Natural Language Input)	Text intent parsing, currently the most mature channel

Unified Representation of Multimodal Information — The primary task of the perception module is to unify heterogeneous data sources into a form the brain can process:

Information Sources: Text (commands/web pages/documents/code), Images (diagrams/photos/UI screenshots), Audio (voice/environmental sounds), Video (dynamic streams of image+audio), Structured Data (API JSON, database tables).
Unified Encoding: Each modality is converted into a unified high-dimensional vector (Embeddings) via a dedicated encoder—text uses Transformer, images use ViT, audio uses Whisper, etc. Unified vectors allow the brain to comprehensively understand different modalities within the same semantic space.

Key Technologies and Correspondences:

Technology	Corresponding Sense	Role
NLP	Reading/Speaking	Intent recognition, entity extraction, sentiment analysis, long-text understanding
CV	Seeing	UI operation Agent locating buttons/input boxes; robot obstacle recognition
ASR	Hearing	Voice interaction, key for intelligent customer service/smart homes
Multimodal Fusion	Synthesis	Achieves deep cross-modal correlation via Cross-Attention, producing a "1+1>2" effect

Trend: Multimodal fusion. A single MLLM consumes text, images, and voice simultaneously, avoiding the information loss of multi-model splicing. Humans don't split "seeing" and "hearing" into two independent systems, and neither should Agents.

Human vs. Agent: Human senses are naturally fused and come with common sense; an Agent's multimodality still requires deliberate splicing and easily loses cross-modal associations.

2. Action Layer (Output)

The Agent must be able to "do things." Categorized by the target of action, there are three types, with capabilities extending from virtual to physical:

Human Action	Agent Action	Capability
Writing/Drawing/Using a PC	Virtual Output	Content generation (text/image/video/files), browser automation
Using a Phone/Switching Appliances	Device Operation	Phone/PC control, smart home hardware manipulation
Physical Labor/Operating Machinery	Robotics	Software-hardware synergy to execute physical actions (Embodied Intelligence)

Progressive Expansion of Three Action Types:

Virtual Output: Content generation, browser automation.
Device Operation: Phone/PC control, smart home hardware manipulation.
Robotics: Software-hardware synergy to execute physical actions (Embodied Intelligence).

Tools: Infinite Expansion of Capabilities

The action layer is implemented as tool invocation. By combining tools, the Agent breaks through the LLM's own limitations to complete multi-step tasks. Common tool types:

Tool Type	Examples
Information Retrieval	Search engines, database queries, weather/stock/news APIs
Calculation & Analysis	Calculator, code interpreter, data analysis libraries
Content Generation	Image generation, speech synthesis
Application Control	Sending emails, creating calendar events, operating CRM
Physical World Interaction	Controlling robots, drones, smart homes

The three action types progressively extend from the digital world to the physical world. Embodied intelligence is the ultimate form of the Agent—enabling AI not just to "think online," but to "act offline."

Human vs. Agent: Humans far surpass current Agents in fine physical manipulation, but Agents already have an advantage in virtual output and parallel cross-device control.

II. The Intelligent Brain: Computing and Model Base

The brain is the core computing unit of the Agent. This layer is divided into two parts: computing power types and the model base.

1. Three Types of Computing Power: Evolution from Judgment to Multimodality

The "brain" has followed a clear evolutionary path—Classifier → LLM → MLLM. Each leap breaks through the bottlenecks of the previous stage.

Three Stages of Evolution:

Stage 1 · Classifier — Traditional machine learning, solving clearly bounded classification problems. Lightweight, deterministic, low cost, but each task requires specifically trained data and cannot generate or reason.

Stage 2 · LLM — A general reasoning and generation engine, serving as the Agent's "main brain." One model handles thousands of tasks, but only understands text and cannot perceive multimodal information.

Stage 3 · MLLM — Unifies the processing of text, images, voice, and other modalities on top of the LLM. It is the evolutionary direction of the "all-around brain" and the center of this article's architecture.

Evolution Comparison Table:

Stage	Representative	Approximate Time	Breakthrough	Limitation
Classifier	Traditional ML	1950s–2010s	Learned to categorize	Specialized, needs retraining, cannot generate
LLM	Large Language Model	2018–2022	General-purpose, understands intent	Only understands text, no perception
MLLM	Multimodal Large Model	2023–Present	Unified seeing, hearing, speaking, thinking	Current central node, continuously evolving

Correction Note: The LLM era began with GPT-1 and BERT in 2018; 2017 was the year the Transformer paper was published, serving as the foundation, not the era itself. The CoT paper was published in January 2022.

Brain Evolution Chain Visualization:

Key Milestones:

2012: AlexNet broke the performance ceiling of the classifier era.
2017: The Transformer paper was published, becoming the underlying architecture for large models.
2018: BERT / GPT-1 established the pre-training paradigm.
2020: GPT-3 validated "scale equals capability."
2022.01: CoT was proposed, the underlying cornerstone for all thinking methods in Chapter III.
2022.11: ChatGPT brought LLMs to the masses.
2023: GPT-4 / Gemini introduced multimodality, the first year of MLLM.
2024–Present: MLLMs are maturing, driving the full-scale implementation of Agents.

Classifiers solve "what is it," LLMs solve "how to do it," and MLLMs solve "all-around perception and decision-making." These three form an evolutionary chain of capability leaps.

Human vs. Agent: The human brain relies on intuition and common sense, is energy-efficient, and can extrapolate; Agents rely on statistical patterns, excelling in breadth and speed but weak in causal understanding and physical common sense.

Underlying Cornerstone: Chain of Thought (CoT)

Before diving into specific thinking methods, understand their common underlying technology—Chain of Thought (CoT). Proposed by Google researchers in January 2022, its core is to guide the LLM to generate a step-by-step reasoning process before answering, improving accuracy on multi-step logic problems.

Zero-shot CoT Example:

Q: There are 5 apples in a basket. Xiao Ming takes 2 and then puts 1 back. How many are there now?

A: Let's think step by step.

Initially 5 → Take away 2 leaves 3 → Put back 1 leaves 4 → Final Answer: 4

CoT provides structured expression for the Agent's thinking and is the foundation for subsequent complex thinking methods.

2. Model Base and Further Reading: "Building a Large Model from Scratch"

Model capability comes from two types of bases:

General Large Model Base: ChatGPT / DeepSeek / GLM / Kimi / Doubao, etc.—out-of-the-box general capabilities.
Proprietary Models: Built on top of a general base through pre-training + Supervised Fine-Tuning (SFT) to inject industry knowledge, suitable for specialized domains not well-covered by general models.

Selection Logic: Use a general base whenever possible; only use proprietary fine-tuning when vertical domain accuracy is insufficient.

To truly understand the "Classifier → LLM → MLLM" evolution chain and the internal structure behind the model base, further reading is recommended: "Building a Large Model from Scratch"—it breaks down how a large model is "stacked" step-by-step, from data preparation, architecture design, pre-training, and fine-tuning to instruction alignment. We've drawn the book's core thread into two flowcharts: The first clarifies the main "input to output" link, and the second clarifies "how capabilities are expanded."

Figure 1: Core Workflow of a Large Model

Key Takeaway: The essence of a large model is "slicing text into tokens, converting tokens into vectors, and then predicting one token at a time via autoregression."

III. Thinking Methods: Control Flow

This is the soul that distinguishes an Agent from a regular LLM.

A regular LLM is a "one-shot Q&A." An Agent is a cyclical "Reasoning → Action → Observation" process that can dynamically adjust based on intermediate results. The logic that determines "how to cycle" is the thinking method.

Four thinking methods have clear boundaries:

① ReAct

Mechanism: Thought → Action → Observation loop, deciding the next step based on the observation at each step. Jointly proposed by Princeton University and Google, it is currently the most widely used Agent thinking method, with the core being the combination of CoT and tool invocation.

Key Constraint: An exit condition must be determined in advance, otherwise it will fall into an infinite loop.

Advantages: Dynamic adaptation, explainable and controllable, strong error correction. When a step fails, the Agent can remedy it in the next round (re-search with different keywords, switch APIs).

Challenges: Requires multiple interactions with the LLM and tools, leading to higher latency and cost.

Suitable for: Tasks with strong exploratory nature and high uncertainty (open research, information retrieval, debugging).

Flowchart — Thought → Action → Observation loop, converging via an "exit condition":

② Plan-and-Execute

Mechanism: First make a global plan, decomposing the task into ordered steps, then execute step-by-step.

Characteristics: Good global perspective; high efficiency and low cost when tasks are clear.

Trade-off with ReAct: ReAct is locally flexible but may deviate from the global goal; PlanExe is globally clear but less flexible, and the plan may need adjustment if the environment changes during execution. Mature implementations usually include a replan mechanism.

Suitable for: Tasks with relatively standard processes and predictable steps.

Flowchart — Two-stage "plan first, then execute," with a replan correction loop:

③ Reflection

Mechanism: Generate an initial version → Identify flaws → Improve and optimize, iterating to enhance quality. Represented by Reflexion and LATS.

Characteristics: "Have it first, then make it good"—first solve "is there one," then solve "is it good."

Suitable for: Quality-oriented tasks that can be iteratively polished (code generation, copywriting, solution design).

Flowchart — Self-iterative loop of "Generate → Reflect → Improve":

④ Multi-Agent

Mechanism: One orchestrating Agent (Master/Orchestrator) dispatches multiple specialized sub-Agents. Its essence is a Multi-Agent System (MAS).

Why MAS is needed: ① Specialized division of labor; ② Tasks can be parallelized; ③ Scalable, failure of a single Agent does not crash the system; ④ Can simulate complex systems.

Flowchart — Orchestrator decomposes and dispatches, sub-Agents perform their roles, results are aggregated and returned:

Comparison and Selection of the Four Methods

Thinking Method	Core Logic	Globality	Advantages	Disadvantages	Suitable Scenarios	Analogy
ReAct	Step-by-step	Weak (Local)	Dynamic adaptation, explainable, strong error correction	High cost, high latency	Exploratory, uncertain tasks	Taking career planning one step at a time
PlanExe	Plan then execute	Strong (Global)	Structured, efficient when tasks are clear	Poor flexibility, hard to handle surprises	Standard processes, predictable tasks	Decompose first, then act
Reflection	Have then optimize	Medium (Iterative)	Self-learning iteration, high output quality	Further increases cost and latency	Quality-oriented, polishable tasks	Agile development iteration
Multi-Agent	Each does its job	Strong (Division of labor)	Specialized division, parallel, scalable	Complex coordination	Complex, cross-domain tasks	Team specialization

Selection Principle: The more uncertain the task → lean towards ReAct; the more standard the task → lean towards PlanExe; the higher the quality requirement → layer on Reflection; the higher the complexity → go Multi-Agent.

Combined Use in Practice: Complex systems can first use PlanExe to formulate a macro plan, use ReAct for the details of each macro step, and introduce Reflection after key nodes for checking.

Human vs. Agent: These four methods make explicit the unconscious thinking habits of humans. Humans excel at "metacognition," knowing what method they are using to think; an Agent's thinking method is still preset and requires human selection.

IV. Memory Layer: Storage and Retrieval

Memory is the Agent's state storage. Without memory, the Agent starts from scratch with every conversation, unable to learn or understand user preferences. It is divided into two levels by scope: short-term memory and long-term memory.

1. Two Levels by Scope

① Session-Level Memory

Short-term memory stores the current task context and disappears when the task ends. Its main form is conversation history.

Implementation: Directly utilizes the LLM's context window. When the conversation becomes too long, compression is needed:

Sliding Window: Only keep the most recent N turns.
Summarization: Periodically summarize the conversation, replacing verbose history with a summary.

② Cross-Session / Persistent Memory

Long-term memory stores information across tasks and sessions. The core technology is RAG. It comes in three deployment forms:

Personal Multi-Device Migration: Personal memory + Markdown files + memory decay mechanism.
Local Privacy Deployment: SQLite + Vector Retrieval.
Production-Grade Distributed: Ensures memory consistency in a distributed environment.

RAG (Retrieval-Augmented Generation)

The LLM's context window is limited and cannot hold all knowledge. RAG's solution is to attach an external knowledge base to the LLM: before generation, it retrieves the most relevant information from an external database and feeds it to the LLM as additional context. The mechanism is "fetch on demand," not "cram everything into the brain."

RAG's Four-Step Mechanism (using "user likes lattes" as an example):

Step	Action	Example
① Store	Convert long-term memory into high-dimensional vectors via an embedding model and store in a vector database	"I like lattes" → vector stored
② Retrieve	When relevant clues appear in a subsequent conversation, convert the question into a vector and perform a similarity search	"Recommend a coffee" → recall "likes lattes"
③ Augment	Use the retrieved memory as context and send it to the LLM along with the question	Known info: user likes lattes
④ Generate	The LLM generates a personalized response based on the augmented context	"Perhaps a classic latte is a good choice"

RAG Four-Step Mechanism Flowchart:

2. Storage Base

Traditional Storage: Markdown documents; databases like ES / Redis / PostgreSQL, etc.
RAG Vector Store: Vector databases for semantic retrieval.

Mainstream Vector Database Comparison (2026):

Database	Type	Core Advantages	Main Application Scenarios
Pinecone	Commercial Cloud Service	Fully managed, out-of-the-box, stable performance	Rapid prototyping, SMB applications
Milvus	Open Source	Distributed architecture, high scalability, rich features	Large-scale production environments, high scalability needs
Weaviate	Open Source	Multimodal support, built-in multiple Embedding models, GraphQL interface	Complex data types, multimodal retrieval applications
ChromaDB	Open Source	Lightweight, Python-native, developer-friendly	Local development, data science experiments, small applications
Redis	Open Source/Commercial	In-memory database, extremely low latency, versatile (combined with RediSearch)	Scenarios demanding extreme real-time performance, existing Redis systems

In practice, hybrid retrieval (vector + keyword) is often used to balance semantics and exact matching.

Human vs. Agent: Human memory has emotional weighting and associative triggers, and actively forgets irrelevant details; Agent memory relies on explicit storage and recall, precise and lossless but lacking emotion and contextual association.

Memory Dimension	Human	Agent	Comparison Point
Short-term	Working memory (approx. 7±2 items)	Session-level memory (context window)	Agent has larger capacity but easily overflows and loses info
Long-term	Experience, skills, emotional memory	Persistent memory (MD/SQLite/Vector DB)	Agent is precise and lossless; humans rely on associative reconstruction
Retrieval	Association + emotional trigger	RAG vector/keyword retrieval	Agent can recall fully; human recall rate is lower but relevance is higher
Forgetting	Active forgetting	Requires designed decay mechanism	Forgetting is a noise-reduction advantage for humans
Cross-device	Cannot be migrated	Multi-device sync	Agent is migratable

The memory layer is an area where Agents structurally surpass humans—precise, lossless, cross-device, migratable. The cost is the need to actively design a decay mechanism, otherwise "remembering too much" actually dilutes relevance.

V. Capabilities, Collaboration, and Connection: Peripherals and Bus

Perception allows the Agent to "input," the brain allows it to "think," and action allows it to "output." Connecting to external tools and collaborating with other Agents requires a connection layer.

Three-Layer Evolution Overview — Moving from "monolithic capability" to a "collaborative network," three types of infrastructure progressively expand the Agent's boundaries:

Connection Layer	Analogy	Role	Scope	Representative
Tools	Hand / Single Peripheral	Callable functions, execute atomic actions like queries and calculations, no unified standard	Monolithic Capability	Function Calling
MCP	USB Port / Bus	Tool side implements once per protocol, all Agents reuse, unified discovery and management	Tool Ecosystem Standardization	stdio / SSE
A2A	Network Protocol / Internet	Agent Card supports discovery, mutual trust, and cross-Agent collaboration	Collaborative Network	Agent Card

1. Tools

Specific functions the Agent can call to execute atomic actions like queries, calculations, and external operations. Without a unified standard, each Agent integrates independently, and the integration cost grows linearly with the number of tools.

2. MCP

MCP (Model Context Protocol), open-sourced by Anthropic in 2024, turns "tools" into "plug-and-play peripherals." The tool side implements an MCP Server once, and any MCP-supporting Agent can connect.

MCP Core Three Elements:

Element	Role
Resources	Readable data exposed to the Agent
Tools	Executable functions, actively called by the Agent
Prompts	Preset prompt templates, standardized interaction

Boundary Note: MCP Prompts are "reusable interaction templates" exposed outward by the tool side; Skills in Chapter VI are "workflow prompts + domain knowledge" internalized by the Agent.

3. A2A

A2A (Agent2Agent Protocol), proposed by Google, allows Agents from different vendors and frameworks to discover, negotiate, and delegate tasks to each other, building an "Internet of Agents."

Agent Card: Each Agent publishes a standardized card declaring its capabilities, endpoints, authentication methods, and supported inputs/outputs.

Four Steps of Collaboration: Discover → Negotiate → Delegate → Return.

MCP vs. A2A: MCP solves "how an Agent uses tools" (vertical integration), A2A solves "how an Agent finds another Agent" (horizontal collaboration).

Architectural Patterns for Multi-Agent Systems

The core differences between the three architectural patterns can be summarized as "who calls the shots" and "how Agents communicate with each other." The diagram below shows the three common patterns side-by-side:

Diagram Guide:

Hierarchical: A manager Agent sits in the center, responsible for task decomposition and result aggregation; clear structure, easiest to implement, similar to traditional management structures.
Peer-to-Peer: No fixed center among Agents; they negotiate directly with each other; flexible but consistency is hard to guarantee, suitable for open collaboration requiring frequent alignment.
Hybrid: Uses hierarchy at the macro level to control direction, while allowing worker Agents to collaborate as peers locally; large complex systems often use this "fractal" structure.

Architectural Pattern	Structure	Characteristics	Typical Representative
Hierarchical	Manager Agent decomposes tasks and assigns them to worker Agents, results are aggregated and reported upward	Similar to corporate management structure, most common	AutoGen
Peer-to-Peer	All Agents have equal status, communicate and negotiate directly	Decentralized, flexible	CrewAI
Hybrid	Macro-level hierarchical management, local peer-to-peer collaboration	Combines advantages of both	—

Relationship of the Three: Tools are the hands, MCP is the interface standard, A2A is the network protocol. Moving from monolithic capability to a collaborative network.

5. Consensus: The Semantic Foundation for Agent Collaboration

Protocols (A2A) solve the problem of Agents "being able to communicate," but not "being able to align"—different Agents may have completely different understandings of the same thing. This is the problem that consensus aims to solve.

What is Consensus?

In the absence of a global command, the ability for multiple entities to form a consistent judgment on the same matter. Humans achieve it through language, history, culture, and tacit understanding; Agents require explicit alignment mechanisms.

Three Levels of Consensus in AI Agents:

Level	Scenario	Core Mechanism	Current Maturity
Single Agent Internal	Maintaining consistent judgment standards across conversations	Skill solidifies logic, memory retains historical references	✅ Relatively mature
Between Multiple Agents	Agents with different roles aligning on goals and standards	Orchestrator Agent enforces allocation; or peer-to-peer consensus negotiation via shared context	⚠️ Hierarchical is mature, peer-to-peer is still early-stage
Between Human and Agent	Human needs correctly understood by Agent, Agent output recognized by human	Human-in-the-loop intervention and correction + jointly aligned judgment benchmarks	⚠️ Fallback solutions are mature, deep alignment is still a bottleneck

Consensus is a hurdle for Agents moving from "usable" to "reliable."

A single Agent can output stably with Skills and memory, which is relatively easy. But how multiple Agents self-negotiate to reach agreement without a unified brain, and the deep semantic alignment between humans and Agents, are the core bottlenecks in current engineering.

Essentially, the "things humans are responsible for" in human-machine collaboration is a continuous process of reaching consensus with AI—humans clarify needs to the Agent, the Agent explains its output to the human, and each round of interaction narrows the consensus gap. When consensus breaks, human fallback (Chapter VIII) is the final alignment mechanism.

This is precisely the meaning of "becoming someone who masters AI"—it's not about throwing a task at the Agent and being done, but continuously calibrating consensus in every round of alignment.

VI. Skill: Reusing Experience

Encapsulate reusable capabilities into Skills to avoid starting from scratch every time.

1. Structure of a Skill

A standard Skill typically contains three parts:

Component	Content	Role
Metadata	name, description, tags, version, author	Allows the Agent to discover and decide whether to activate
Instructions	Role setting, workflow steps, constraints, output specifications	Guides the Agent on how to execute the task
Resources	Template files, reference documents, sample code, data sources	Provides materials for execution

2. Problems Solved by Skills

Problem 1: Repetitive Labor

In every conversation, the Agent understands the task from scratch, repeatedly consuming tokens for the same process. Skills solidify the "how-to," encapsulating once for multiple reuses.

Problem 2: Unstable Quality

The output quality of the Agent across different scenarios is highly dependent on prompt quality. Skills standardize best practices to ensure stable output.

Problem 3: Difficulty in Accumulating Experience

Human experience disappears with the conversation, and a team's best practices cannot be passed on. Skills make experience explicit from tacit, becoming an accumulable asset.

Problem 4: Difficulty in Cross-Tool Migration

Different platforms (Claude Code, Cursor, DeepSeek) have different interaction methods. Skills enable knowledge to flow across tools through a unified specification.

Difference from Memory: Memory stores "what happened" (e.g., user preferences, history logs), Skills store "how to do it" (e.g., code review process, data analysis framework).

VII. Orchestration Framework Implementation: From Thought to Engineering

Thinking methods are the "thought," orchestration frameworks are the "tool." This section uses LangChain's official Deep Agents as the main thread, mapping "thinking methods / memory / Skills / capability collaboration" one-to-one onto engineering components.

1. Three-Layer Architecture

Layer	Representative	Core Capabilities	What It Solves
Runtime	LangGraph	Graph orchestration, persistence, state management, streaming output, human-in-the-loop	Complex process control and execution engine
Framework	LangChain	Model abstraction, `create_agent` (ReAct loop + tool invocation), tool interface, middleware	Standardization of single Agent basic capabilities
Harness	Deep Agents	Planning, virtual file system, sub-agents, memory, skills	Reliability for end-to-end complex tasks

The relationship between the three layers can be understood as "base → middleware → application suite": LangGraph is responsible for state flow and graph orchestration, LangChain packages LLM + tools into a standard Agent, and Deep Agents provides out-of-the-box complex task capabilities like long tasks, memory, file systems, and sub-agents on top of the first two layers.

Diagram Guide:

Bottom Layer: LangGraph: Provides graph orchestration, state management, persistence, streaming output, human-in-the-loop—the "operating system" level engine for all Agent execution.
Middle Layer: LangChain: Packages LLM, tools, and prompts into a standardized Agent (mainly the ReAct loop of create_agent), making single Agent capabilities reusable.
Top Layer: Deep Agents: Aimed at end-to-end complex tasks, providing advanced capabilities like planning, sub-agents, virtual file system, long-term memory, Skills, and human fallback.
Dependency Direction: Deep Agents' components call down to LangChain's Agent / Tool / Prompt capabilities; LangChain's Agent runs on top of LangGraph's state graph and persistence mechanisms.

Deep Agents is not a replacement for LangGraph, but an "application suite" built on top of the LangGraph runtime + LangChain framework. Simple tasks can use just the LangChain single layer; end-to-end complex tasks require Deep Agents.

How LangGraph Builds a Graph

Core Capabilities of LangChain create_agent

LangChain's create_agent series of functions (create_react_agent, create_tool_calling_agent, etc.) are factory functions that package "one LLM + a set of tools" into a runnable Agent. Their core responsibilities are:

Bind Tools: Inject functions/packaged tools into the scope callable by the LLM.
Construct Prompt Templates: Assemble the system prompt, role setting, and tool descriptions in a fixed format to feed to the model.
Implement the ReAct Loop: Let the model output Thought in each round, then decide which tool's Action to call, get the Observation, and enter the next round—this is the Thought → Action → Observation loop discussed in Chapter III.
State Flow: Maintain multi-turn conversation state, re-injecting each tool's return result back into the context until the exit condition is met and a final answer is given.

In one sentence: create_agent is not a specific thinking method, but a "standard Agent launcher" that packages ReAct thinking method + tool invocation + state maintenance into a single line of code to start. It eats a model and tools, and spits out an Agent capable of cyclic reasoning-action. LangGraph then goes one level up: turning this cyclic link into a visual state graph node, supporting branching, concurrency, persistence, and human-in-the-loop.

2. Deep Agents' Four Capability Pillars

Capability Pillar	Key Components	Corresponding Principles in This Article
Execution Environment	Virtual File System, Tools/MCP, Code Sandbox, Streaming Output	Action Layer, Capability Collaboration
Context Management	Skills, Long-term Memory, Summarization & Context Offloading, Prompt Caching	Memory, Skill
Delegation	`write_todos`, `task`	PlanExe, Multi-Agent
Control	`interrupt_on`, File System Permissions	Human Fallback

3. Virtual File System

Traditional Agents stuff large chunks of information into the prompt, causing context bloat. Deep Agents uses a file system for Context Engineering: letting the Agent read on demand and store in categories, rather than spreading all materials on the table at once.

The documentation currently mentions three types of mechanisms:

Six File Operations: ls, read_file, write_file, edit_file, glob, grep — these are the atomic commands for the Agent to interact with the virtual file system.
Automatic Offloading of Large Results: When the content returned by a tool call exceeds a token threshold, the full content is written to the file system, and the conversation history only keeps the file path + content preview, preventing prompt explosion.
Automatic History Summarization: When the context reaches the window limit and there is no offloadable content, a summary is generated to replace the original conversation, and the original conversation is written to the file system for retention.

Security Isolation: Execution Sandbox on Top of the File System

The virtual file system solves "how to manage context," but Agents often need to execute code, call command lines, or access networks, so the question of "is execution safe" must be addressed. In engineering, sandboxes are typically chosen in layers based on isolation strength:

Selection Logic: The virtual file system is the Agent's "work desktop," and sandbox isolation is the "protective shield around the workbench." The two are complementary—the file system is responsible for context organization, the sandbox for execution security. Ordinary document processing may not need a sandbox; but once an Agent needs to execute user-submitted code, access external networks, or operate on sensitive data, the runtime environment must be placed in a sandbox.

The virtual file system is the infrastructure for context management and is the most essential upgrade of Deep Agents over a regular ReAct loop.

4. Task Planning `write_todos`: Engineering Plan-and-Execute

Mechanism: The write_todos tool is automatically injected when calling create_deep_agent(), requiring no manual configuration. Each task contains three fields: subject (title), description (description), status (status). Status flows linearly: pending → in_progress → completed.

Three Execution Phases: Make a plan (all pending) → Execute step-by-step (mark status) → Dynamic adjustment (new tasks can be added/adjusted if new needs are discovered during execution).

Correspondence to Chapter III Thinking Methods: Its essence is Plan-and-Execute (PlanExe), but it is not a strict two-phase separation—the Agent can modify the plan during execution, making it "PlanExe with dynamic adjustment capability" (echoing the replan mechanism mentioned earlier).

Key Design: The task list is persisted in Agent State, not in the conversation history. This means even if the conversation history is summarized and compressed, the list remains intact—it acts as the Agent's "North Star," solving the problem of "forgetting the goal midway" during long tasks.

5. Sub-Agent `task`: Solving Bloat with Isolated Context

Mechanism: A built-in task tool allows the main Agent to dispatch sub-tasks to specialized sub-agents for execution.

Core Problem Solved: Context window bloat. Sub-agents have their own independent context window, autonomously execute, and then only return a single final report to the main Agent—all intermediate searches, file reads, and trial-and-error processes are isolated within the sub-agent's own context, not polluting the main Agent's.

Correspondence to Chapter III Thinking Methods: Multi-Agent (each does its own job). The main Agent is responsible for orchestration, sub-agents for specialized execution, with context naturally isolated.

Insight: This is the implementation of "divide and conquer" on Agents—the main Agent's context remains lean (only holding plans and result summaries), while heavy exploration is thrown to sub-agents. Sub-agents are another sharp tool for context management, complementary to file system offloading: the file system offloads "data," sub-agents offload "process." The tutorial also mentions "async subagents" for further parallelization.

6. Skills: Progressive Disclosure + Cross-Tool Standard

Specification: Deep Agents' Skill format is evolving towards an open specification (like the SKILL.md convention: YAML frontmatter metadata + Markdown instruction body). This concept is similar to skills/instructions files in tools like Claude Code, OpenAI Codex, and Cursor, but the ecosystem is still evolving, and the specific schema of each platform should be checked during actual cross-tool migration.

Progressive Disclosure Three-Level Loading — This is the most core design decision for Skills:

Level	Loaded Content	Timing	Cost
L1 Metadata	Only frontmatter (name + description)	Injected into system prompt at startup	~hundreds of tokens for 20 Skills
L2 Instructions	Full SKILL.md body	Loaded only after Agent matches via description	On-demand
L3 Resources	Files under references/, assets/	Read at the LLM's discretion when referenced by instructions	On-demand

Key Point: description is the sole basis for the Agent to decide whether to activate a Skill—the Agent will not read the body in advance to match. This ensures that as the number of Skills grows, the startup overhead only increases linearly, allowing for "infinite scalability."

Analogy: From the original tutorial—"Skills are to AI Agents what npm packages are to Node.js." Tools are atomic operations (search once, read a file); Skills are packaged reuse of "multi-step workflows + domain knowledge + template resources." This is precisely the engineering implementation of "experience reuse" from Chapter VI.

7. Long-Term Memory: `memory.md` + LangGraph Store

Mechanism: Declare memory file paths (e.g., memory.md, preferences.md) via the memory= parameter. The Agent automatically loads them into the system prompt at startup. Memory writes go to the /memories/ path and are persisted to the LangGraph Store via StoreBackend (InMemoryStore for development, PostgresStore for production), retained across sessions.

Naming Note: Deep Agents documentation examples might use AGENTS.md as the memory/preference file. To avoid confusion with the "AGENTS.md global Agent configuration protocol," this article uses memory.md as an example. If your project's Deep Agents template indeed uses AGENTS.md, treat it as the persistent memory file convention under that framework, which is not the same thing as the global configuration protocol.

Self-Updating: When the Agent learns new information during a conversation, it uses the built-in edit_file to update the memory file, and the changes persist to the next conversation—the Agent can "self-evolve" and develop its own professional capabilities.

Isolation: Isolated by namespace via assistant_id (Agent-level), user_id (User-level), org_id (Organization-level), supporting multi-user isolation and organization-level sharing.

Relationship of the Three (sharing the same set of file operation interfaces, differentiated by path prefix and backend):

Dimension	Virtual File System workspace	Skills	Long-term Memory
Storage Backend	StateBackend	StoreBackend	StoreBackend
Lifecycle	Within a single conversation	Persistent across conversations	Persistent across conversations
Content Nature	Temporary working files	Procedural memory (how to do)	Semantic memory (what to know)

Insight: Memory, Skills, and File System share the same set of read/write/edit interfaces, differing only in storage backend and path prefix. This unified abstraction is the brilliance of Deep Agents' design—Chapter IV talks about "Memory," Chapter VI talks about "Skill," but in engineering, they are actually three usages of the same file system.

8. Orchestration Framework Panorama and Selection

Beyond the official main thread of LangChain/LangGraph/Deep Agents, there are more orchestration frameworks in the ecosystem:

Framework	Positioning	Suitable For
LangGraph	Runtime engine, graph orchestration	Developers needing ultimate controllability
LangChain	Framework building blocks, single Agent capability	Developers building custom Agents
Deep Agents	Application suite, out-of-the-box	Teams needing reliable implementation of complex tasks
Dify	Low-code visual orchestration	Rapid prototyping, business validation
AutoGen Studio / Flowise, etc.	Low-code / visual platforms	Writing less code, rapid prototyping
Claude Desktop / Claude Code	Out-of-the-box Agent experience	End-users directly using Agents

Framework selection is also problem-driven: For ultimate flexibility and control → LangChain/LangGraph; for rapid implementation and lower barrier → Dify; for out-of-the-box → platform-type products. Three layers are not always better—simple tasks can use just the LangChain single layer; end-to-end complex tasks require a Harness like Deep Agents. Blindly adopting high-level architecture only introduces unnecessary complexity.

Back to "Human vs. Agent" : Every capability of Deep Agents is compensating for a human cognitive shortcoming—write_todos compensates for "easily losing track in long tasks," sub-agents for "limited attention," the virtual file system for "small working memory capacity," and Skills for "difficulty in passing on experience." In engineering, these are about "externalizing good human brain habits into mechanisms that machines won't forget."

9. Extension: OpenViking—ByteDance's Open-Source Agent Context Database

Volcano Engine (ByteDance) open-sourced OpenViking, which is highly aligned with the Deep Agents philosophy: it is not a vector database, but a context database for AI Agents, solving the problem of "how to uniformly organize, load on-demand, and self-iterate" Agent context.

The core difference in one sentence:

Vector databases (like Milvus, Pinecone, VikingDB) solve "how to store vectors and retrieve them quickly"; OpenViking solves a higher-level problem—how Agent context is managed like a file system. The two have a "file system" to "hard drive" relationship.

Five Core Features (another industrial implementation of nearly the same philosophy as Deep Agents):

Core Feature	Description	Corresponding Deep Agents Mechanism
📁 File System Management Paradigm	Maps memory, resources, and skills uniformly to a `viking://` virtual directory, using `ls` / `find` for standardized location and management.	Virtual File System (VFS)
🧠 Layered Context Loading (L0/L1/L2)	Adopts a pre-generated hierarchical summary strategy: loads on-demand from overview (L0) to details (L2), significantly reducing token consumption.	Skill Progressive Disclosure (L1/L2/L3)
🔍 Directory Recursive Retrieval	Follows a high-precision retrieval link of "intent analysis → directory location → vector retrieval → sub-directory drill-down → result aggregation."	Context Offloading + RAG
👁️ Visual Retrieval Trajectory	Completely records and displays the directory browsing and file location path, achieving observability and debuggability of the retrieval process.	File System Natural Path Tracking
🔄 Automatic Session Management & Self-Iteration	Asynchronously analyzes execution results and user feedback after a session ends, automatically updating user profiles (`memory.md`) and the Agent's experience library.	Long-Term Memory Storage (LangGraph Store)

Relationship with VikingDB: VikingDB is ByteDance's cloud vector database service. OpenViking can use it as a storage base—the open-source version can run locally, and the commercial version leverages VikingDB for large-scale storage and high-performance retrieval. This reaffirms: Vector databases are infrastructure; context databases are higher-level Agent infrastructure.

Endorsement and Performance (according to Volcano Engine's official materials, early data for reference only): Open-sourced a core capability subset of the VLDB 2026 paper "VikingMem: A Memory Base Management System for Stateful LLM-based Applications"; officially claims accuracy improved from 57.21% to 80.32% on the LoCoMo user memory benchmark, with token consumption reduced by 63.2%.

One-Sentence Summary: OpenViking and Deep Agents arrive at the same destination by different routes—both are using the "file system paradigm + layered on-demand loading + memory self-iteration" to answer the same question: For a long-running Agent, how should context be managed? Deep Agents is LangChain's official suite (deeply integrated with the LangGraph runtime), while OpenViking is ByteDance's independent open-source implementation (multi-model Provider, locally runnable, academically grounded). Two routes validate the same thing: context engineering is becoming the new infrastructure of the Agent era.

Deep Agents vs. OpenViking Comparison Diagram — Two implementations, one set of context engineering concepts:

VIII. Implementation Scenarios: Empowerment and Replacement, Human Fallback

Core Judgment: Wherever there is a need for "people" and "processes," MLLMs can be used for empowerment and replacement.

Division of Labor Boundary: Agents empower and replace; humans are responsible for fallback.

This is the key design when implementing the entire architecture—Agents can autonomously complete a large amount of execution work, but at the level of "what to do" and "is it right," humans remain irreplaceable decision-makers:

Agent is Responsible For: Execution, generation, initial screening, workflow routing.
Human is Responsible For: Requirement generation and clarification, review and verification, decision-making and fallback.

Why is Human Fallback Irreplaceable?

Because every person is unique. Everyone's cognition, experience, and values are different, and these differences shape their individual needs and value judgments. An Agent can efficiently execute the "how," but cannot replace a human in deciding "what" to do and whether it is "done well"—the latter requires a person's unique understanding of their own situation and independent judgment of quality.

Human fallback is not a technical compromise, but a confirmation of "human irreplaceability." This is consistent with the discussion on "consensus" in Chapter V—humans and Agents continuously align their understanding in collaboration, with humans ultimately steering the direction and quality.

Engineering Implementation: LangGraph provides interrupt / interrupt_on, and Deep Agents provides file system permissions and human intervention points. It is recommended to explicitly set Human-in-the-loop for high-risk steps (transfers, publishing, deletion, external commitments), rather than just emphasizing "human fallback" at the policy level.

Conclusion

1. In this new Age of Exploration, become someone who masters AI (human-machine coordination), not someone who is replaced.

The essence of an AI Agent is an "intelligent agent that thinks like a human," but it is ultimately a tool. What truly determines value is whether one can use the Agent well—human-machine coordination, not human-machine confrontation.

2. Every round of technological change brings new productivity, as well as new opportunities and jobs.

From classifiers to LLMs, from single Agents to Multi-Agents, every paradigm shift has eliminated a batch of old jobs and created a batch of new ones (Prompt Engineers, Agent Orchestrators, Skill Designers…). The trend is irreversible, but there are always opportunities within the trend—the key is not to resist change, but to stand on the favorable side of change.

3. The Future is Now: Four Major Evolutionary Directions

Looking back and forward from 2026, the technical architecture of AI Agents is evolving in the following directions:

Stronger Autonomous Learning Capability: Future Agents will not only use predefined tools but also autonomously discover and learn new tools—automatically learning to call new services by reading API documentation, or even self-generalizing new skills by observing human operations.
From the Digital World to the Physical World: With the development of embodied intelligence, an Agent's "actions" will not be limited to calling APIs and operating software, but will be able to control robots, drones, and other physical entities to complete tasks in reality, becoming a key bridge connecting digital intelligence and physical reality.
Edge Computing and Decentralization: To protect privacy and reduce latency, more and more lightweight Agents will be deployed on edge devices (phones, cars, smart glasses); simultaneously, an "Internet of Agents" based on open protocols like A2A will gradually form, with a massive number of decentralized Agents discovering, negotiating, and collaborating with each other, constituting an unprecedented global intelligent network.
Deep Integration of Human-Machine Collaboration: Future architectures will pay more attention to "Human-in-the-loop" design—Agents will no longer completely replace humans, but act as human "super-assistants" or "cognitive exoskeletons," working under human supervision and guidance, with the ability to intervene and correct behavior at any time, forming a seamless human-machine collaborative workflow.