跪拜 Guibai
← All articles
Architecture · Artificial Intelligence · Programmer

The Agent Anatomy: How MLLMs, Memory, and Multi-Agent Loops Build AI That Thinks Like Us

By ethantan ·
Read original on juejin.cn ↗ Google Translate ↗ Alt translation

The shift from monolithic LLM calls to modular, memory-backed agent systems changes what production AI looks like. Context management—not model size—is becoming the hard engineering problem, and the frameworks that solve it (virtual file systems, isolated sub-agent contexts, progressive skill loading) directly determine whether an agent can run a long, multi-step task without losing the plot or blowing its token budget.

Summary

AI agents are moving from single-model wrappers to modular systems modeled on human cognition. A central multimodal large model (MLLM) acts as the brain, orchestrating perception, memory, action, and four explicit thinking methods: ReAct for step-by-step exploration, Plan-and-Execute for structured tasks, Reflection for iterative quality improvement, and Multi-Agent for parallel specialization. Memory splits into session-level context windows and persistent RAG-backed stores, while Tools, MCP, and the A2A protocol form a three-layer connection stack that scales from single functions to an internet of agents. Deep Agents, LangChain's application harness, engineers these ideas into a runtime with a virtual file system for context management, `write_todos` for dynamic planning, isolated sub-agents to prevent context bloat, and Skills with progressive disclosure for reusable expertise. ByteDance's OpenViking pursues the same context-engineering goals with a file-system-native context database. The architecture converges on a clear boundary: agents handle execution and generation, while humans own requirements, judgment, and fallback—making consensus alignment the core bottleneck as agents move from usable to reliable.

Takeaways
An AI agent is a modular system with an MLLM as its central CPU, surrounded by perception, memory, action, and thinking-method layers that mirror human cognition.
Four thinking methods define how an agent loops: ReAct (step-by-step with tool calls), Plan-and-Execute (plan first, then execute with replan capability), Reflection (generate, critique, improve), and Multi-Agent (orchestrator dispatches to specialized sub-agents).
Chain-of-Thought prompting, proposed in January 2022, is the underlying technique that makes all four thinking methods possible by forcing step-by-step reasoning before answers.
Memory is split into short-term (context window, managed by sliding windows or summarization) and long-term (RAG with vector databases like Milvus, Pinecone, or ChromaDB), with hybrid vector-plus-keyword retrieval as the practical default.
Tools, MCP, and A2A form a three-layer connection stack: Tools are atomic functions, MCP standardizes tool interfaces so any agent can reuse them, and A2A lets agents discover and delegate to each other via Agent Cards.
Deep Agents engineers the theory into a runtime with a virtual file system that offloads large results and auto-summarizes history, `write_todos` for dynamic task planning persisted in Agent State, and `task` for spawning sub-agents with isolated context windows.
Skills use progressive disclosure: only metadata loads at startup; full instructions load on-demand when the agent matches a task to a skill description, keeping startup token costs linear even with hundreds of skills.
ByteDance's OpenViking is a context database—not a vector database—that maps memory, resources, and skills into a `viking://` virtual directory with layered loading, claiming a jump from 57.21% to 80.32% accuracy on a user-memory benchmark while cutting token use by 63.2%.
Human fallback is not a compromise but the architecture's design boundary: agents execute and generate, humans own requirements, verification, and final decisions, with LangGraph's `interrupt` and Deep Agents' file-system permissions as the engineering hooks.
Consensus—getting multiple agents or human-agent pairs to align on meaning—is the unsolved bottleneck that separates usable agents from reliable ones.
Conclusions

Context management has quietly become the central engineering challenge of the agent era. The virtual file system, sub-agent context isolation, and progressive skill loading are not convenience features—they are the difference between an agent that finishes a complex task and one that drowns in its own conversation history.

Calling Deep Agents a 'harness' rather than a framework is precise: it does not replace LangGraph or LangChain but adds the file system, planning, and delegation primitives that turn a reasoning loop into a reliable long-running worker.

The MCP versus A2A distinction clarifies a confusion that trips up many teams. MCP is vertical (agent-to-tool), A2A is horizontal (agent-to-agent), and both are necessary once you move beyond single-agent demos.

Persisting the `write_todos` task list in Agent State rather than conversation history is a small design choice with outsized impact. It means the agent keeps its plan even after the chat context is summarized away—a direct fix for the 'forgot what it was doing mid-task' failure mode.

Skills with progressive disclosure solve the token-economy problem that naive prompt-stuffing creates. An agent can carry hundreds of skills but only pays the token cost for the ones it actually activates, making skill libraries economically viable.

OpenViking and Deep Agents arriving at nearly identical context-engineering patterns—file-system abstraction, layered loading, self-updating memory—from completely independent codebases suggests this is not a fashion but a convergence on what agent infrastructure actually requires.

The consensus layer is where agent reliability hits a wall that better models alone cannot fix. Multi-agent systems without explicit alignment mechanisms will produce inconsistent results, and human-agent alignment remains a manual, conversation-by-conversation process.

Human fallback is framed here as an architectural feature, not a deficiency. The explicit stance that humans are irreplaceable for 'what to do' and 'is it right' decisions draws a clean line that many agent hype cycles blur.

Concepts & terms
MLLM (Multimodal Large Model)
A large model that processes text, images, audio, and video through a unified architecture, using modality-specific encoders (Transformer for text, ViT for images, Whisper for audio) that project into a shared semantic vector space. It is the 'brain' at the center of the agent architecture.
Chain of Thought (CoT)
A prompting technique proposed by Google researchers in January 2022 that guides an LLM to generate step-by-step reasoning before producing a final answer. It is the foundational technique underlying all four agent thinking methods (ReAct, PlanExe, Reflection, Multi-Agent).
ReAct
A thinking method that loops through Thought → Action → Observation, deciding each next step based on the previous observation. Requires an explicit exit condition to avoid infinite loops. Best for exploratory, high-uncertainty tasks.
Plan-and-Execute (PlanExe)
A two-phase thinking method that first decomposes a task into ordered steps, then executes them sequentially. Mature implementations include a replan mechanism for when the environment changes mid-execution. Best for standard, predictable workflows.
RAG (Retrieval-Augmented Generation)
A four-step mechanism (Store → Retrieve → Augment → Generate) that gives an LLM access to an external knowledge base. Information is embedded into vectors, stored in a vector database, retrieved by similarity search when relevant, and injected as additional context before generation.
MCP (Model Context Protocol)
An open protocol by Anthropic (2024) that standardizes how agents connect to tools. A tool provider implements an MCP Server once, and any MCP-compatible agent can discover and use its Resources (data), Tools (functions), and Prompts (interaction templates).
A2A (Agent-to-Agent Protocol)
A protocol proposed by Google that lets agents from different vendors and frameworks discover each other, negotiate capabilities, and delegate tasks. Each agent publishes an Agent Card declaring its capabilities, endpoints, and authentication requirements.
Progressive Disclosure
A three-level loading strategy for Skills: L1 loads only metadata (name, description) at startup, L2 loads full instructions only when the agent matches a task to the skill's description, and L3 loads referenced resource files on demand. Keeps token costs linear as the skill library grows.
Virtual File System (Deep Agents)
A context-engineering primitive that lets an agent read, write, edit, list, glob, and grep files instead of stuffing everything into the prompt. Large tool results are automatically offloaded to files, and conversation history is summarized to files when the context window fills up.
OpenViking
ByteDance's open-source context database for AI agents. It maps memory, resources, and skills into a `viking://` virtual directory with layered loading (L0/L1/L2), directory-recursive retrieval, and automatic session-level memory self-updating. It is a context database, not a vector database.
Consensus (in Agent Systems)
The problem of getting multiple agents or human-agent pairs to form consistent judgments without a global command. Operates at three levels: single-agent internal consistency (mature), multi-agent alignment (hierarchical is mature, peer-to-peer is early-stage), and human-agent semantic alignment (deep alignment remains a bottleneck).
Source: juejin.cn ↗ Google Translate ↗ Backup ↗