← Back to the summary

AI's Body Isn't Ready: Why Agents Fail and What Harness Engineering Really Means

Author: vivo Internet Project Team - Jiang Zuohan
This article redefines the AI system architecture from the core perspective that "Large models are not horses, but brains," pointing out that the current problem lies not in model capability but in the immaturity of the Agent as a "body." It analyzes engineering defects in perception, action, feedback, and orchestration, compares Harness-type systems to the life-support mechanisms of an ICU, emphasizes that the current chaos stems from the lack of convergence on best practices, and argues that the present stage is essentially the early era of "not knowing how to use tools," where humanity is gradually defining the correct way to use AI through practice.

Large models are not horses; they are brains — and brains that have just awakened.

Grasp the core points in 1 minute with the image below 👇

1. First, Throw Away the Metaphor "Large Models Are Horses, Harness Is the Saddle"

The concept of Harness Engineering has become very popular recently, and along with it, a metaphor has started to circulate: "Large models are horses, Harness is the saddle."

This metaphor is inaccurate.

The characteristic of a horse is that it has an independent will and needs to be tamed, restrained, and guided. The relationship between rider and horse is essentially one of confrontation and negotiation. But AI systems are not built this way. We don't obtain capabilities by "taming" the model, nor do we make the model more obedient by "cracking a whip."

More critically, this metaphor implies a premise that large models are primitive, cumbersome objects that need to be constrained. But the truth is the opposite. Large models are among the most complex intelligent organs developed in recent years; they are more like brains than livestock.

If we must give a more realistic metaphor, then large models are more like brains, and Agents are more like bodies.

The advantage of this metaphor is that it better explains the real problem with current AI systems: the problem is not that "the brain isn't smart enough," but that "the body hasn't fully developed yet."

2. The Development of AI Is More Like a Super-Accelerated Evolution Where the Brain Comes Before the Body

From a larger time scale, life did not first have a brain and then a body; the two co-evolved over a long period.

From the initial stress responses, to ganglia, to more complex sensory systems, and then to the cerebral cortex that truly supports reasoning and planning, life took an extremely long time to complete this evolutionary path. Meanwhile, eyes, ears, limbs, and nervous systems evolved in tandem.

The body is not a container for the brain; it is the infrastructure through which the brain perceives and acts upon the world.

The development of human technology has similar characteristics.

The evolution of agricultural society was measured in millennia, the Industrial Revolution in centuries, and the Information Age in decades. It wasn't until the last decade or so that the pace of technological evolution suddenly changed.

Cities increased information density, networks broke down geographical barriers to information flow, and systems like navigation, recommendation, and instant messaging precipitated a large number of "high-frequency cognitive actions" into directly callable best practices.

From this perspective, AI is not a simple tool upgrade but a larger-scale explosion of cognitive capability.

From AlexNet in 2012 to today, in just over a decade, AI has completed a full leap from recognition, understanding, and generation to multimodal processing, code generation, and tool calling. AlphaGo's defeat of Lee Sedol in 2016 and Ke Jie in 2017 is a very clear watershed moment: it means "the brain has lit up."

Subsequent model evolution has been even more dramatic. Models like ChatGPT, GPT-4, Claude, and Gemini have iterated rapidly, and ecosystems like chat dialogs, code interpreters, API calls, workflows, and multi-agent collaboration have emerged simultaneously.

On the surface, it seems AI already has eyes, ears, hands, and feet.

But the problem is that although these organs exist, they are far from forming a mature, stable, and coordinated body system.

3. The Core Problem with Current Agent Systems Is That the Body Hasn't Grown Properly

If we say large models are the brain, then the most realistic state of many current Agent systems is "the brain has developed too fast, but the body is still in a premature infant stage."

This problem manifests mainly in four aspects.

3.1 Immature Sensory System

Multimodal models, speech recognition, document understanding, and web parsing solve the problem of "seeing and hearing," but they don't automatically equate to "seeing clearly and understanding accurately."

For example:

PDF parsing may have directory misalignment, table breakage, and disordered image-text sequences.
Web scraping may introduce a lot of noise, and the main text may not be fully recognized.
Image recognition may miss key elements.
Although speech transcription is accurate, the lack of scene context leads to semantic understanding偏差.

These issues all point to one thing: current AI systems have input capabilities, but the quality of input is unstable, lacking reliable pre-processing and context localization mechanisms.

In other words, the eyes are there, but the retina hasn't fully developed.

3.2 Uncoordinated Motor System

Tool calling is one of the Agent's core action capabilities. It can call APIs, access web pages, execute code, and operate applications, seemingly possessing "hands and feet."

But the reality is that this motor system is far from stable.

Common problems include:

Incorrect parameter filling, leading to API call failures.
UI operation偏移, clicking on the wrong target.
Inconsistent execution environments, causing code to fail.
Lack of feedback confirmation after operation completion, preventing closure.

These types of problems are not about "not being able to move," but about "uncoordinated movements." Their essence is similar to the neuromuscular junction not having established a stable connection, resulting in the system being able to issue action commands, but the action quality and feedback loop are unreliable.

3.3 Crude Resource Scheduling System

Large models are high-energy-consumption systems. Context windows, tokens, inference costs, and latency are essentially resource scheduling problems.

Many current Agent systems are still relatively primitive in resource usage, mainly showing two extremes:

Too little information is given, the context is insufficient, and the reasoning chain breaks.
Too much information is given, the prompt is overloaded, key points are drowned out, and system performance degrades.

This type of problem is less about "insufficient model capability" and more about "an immature blood supply system."

3.4 Missing Autonomic Nervous System

This is the most critical point.

The human body has a large number of background automatic regulation mechanisms, such as heartbeat, breathing, temperature control, and digestion, which do not require explicit commands from the person.

Many current Agent systems precisely lack this kind of background maintenance capability.

For example:

Incomplete error recovery mechanisms.
Task retries rely on stacking manual rules.
Lack of stable strategies for context cleanup and compression.
Unsystematic degradation and fallback plans.
Incomplete health checks and anomaly monitoring.

These capabilities should be system-level infrastructure, but at the current stage, many places still rely on hardcoded if-else statements to keep running.

Therefore, the problem with current Agents is not that the brain is not strong enough, but that the body system is far from forming a complete physiological structure.

4. The Biggest Vacuum in the AI Field Right Now Is the Vacuum of Best Practices

After a rapid technological explosion, a common problem often emerges: capability growth outpaces the sedimentation of methods.

The development of cities didn't start with traffic rules, building codes, and mature infrastructure. The development of the internet didn't start with stable forms like navigation, search, and recommendations.

AI is the same.

It's only been about ten years since AlphaGo, and only a few years since ChatGPT truly entered the public eye. This stage is still one where methods have not yet converged and practices are still diverging.

Many current common methods have obvious transitional characteristics.

4.1 Prompt Engineering Is More Like "Asking for Directions Orally"

Prompt Engineering is characterized by reliance on experience, expression skills, and specific model versions.

For the same task, a slight change in the prompt, or a change in the model, context, or temperature parameter, can lead to significantly different output quality.

This shows that Prompt Engineering is more like a temporary communication skill than a stable system method.

4.2 RAG Is More Like a "Static Map"

RAG solves the problem of "how to connect external knowledge to the model," but it doesn't inherently solve whether the knowledge is up-to-date, whether retrieval is precise, or whether the path is dynamically optimized.

Maps are certainly important, but a map is not real-time traffic conditions.

Therefore, although RAG is an important component, it still cannot be equated to a complete cognitive system.

4.3 Agent Frameworks Are More Like "Assembled Prosthetics"

Current various Agent frameworks commonly suffer from inconsistent interface standards, inconsistent tool integration methods, and fragmented state management capabilities.

They are all trying to solve the problem of "how to form a closed loop of perception, cognition, and action," but most are still in the assembly stage, far from a truly unified, stable, and low-cognitive-load engineering system.

Therefore, what AI engineering truly lacks right now is not another new concept, but several more fundamental things:

Systematic Anatomy of Agents: How perception, cognition, and action coordinate.
Diagnostic Methodology for Agents: When a system fails, should you check the brain or the body first?
Rehabilitation Mechanisms for Agents: How to enable the system to form stable experience from failures, rather than re-reasoning from scratch each time.

These problems are essentially not pure algorithmic problems, but engineering system problems.

And engineering system problems can never be solved by a single design; they can only be repeatedly verified, corrected, and precipitated in real-world scenarios.

5. The True Role of Harness Is Not a Saddle, but an ICU

If we continue using the "brain + body" model, then the role of Harness Engineering becomes clearer.

Harness is not a saddle.

A saddle serves a healthy horse that can already run, but many current Agent systems are not at this stage. They are more like a premature infant with an advanced brain but an unstable body.

In this situation, what the system first needs is not a rein, but monitoring.

Therefore, Harness is more like an ICU.

The capabilities it truly provides include:

Lifecycle Monitoring: Observing token consumption, latency, error rates, and context pressure.
Resource Maintenance: Supplementing information when context is insufficient, and cleaning and compressing when information is overloaded.
Signal Regulation: Filtering noisy inputs and constraining the risk of output actions.
Fault Rescue: When a module fails, quickly switching to a backup path to keep the overall system running.

These capabilities are not glamorous, but they are critical.

Because this is not about "controlling the brain," but about maintaining the basic vital signs of the body.

Only by first keeping the system stably alive can subsequent continuous growth and self-optimization be discussed.

6. The Current State of AI Is Not Failure, but an Early Norm

When AlphaGo defeated Ke Jie, many saw it as a breakthrough in AI intelligence.

From a systems perspective, this means another thing: the brain has matured ahead of time, but the body is still in its infancy.

This is not a bad thing; it's a very typical state in the early stages of a technological revolution.

Cities weren't built in a day, navigation systems weren't stable in their first version, and recommendation systems also went through a long period of trial and error and convergence. AI's Agent systems will also go through this process, only their iteration speed is faster than any past infrastructure.

Previously, many engineering systems evolved on a yearly basis; now, many systems iterate on a weekly basis.

Therefore, we feel a very strong sense of contradiction:

On one hand, model capabilities are already stronger than expected.
On the other hand, systems engineering is still fragile, brittle, and lacks stability.

These two judgments are not contradictory; they are both true.

So the most important thing now is not to pretend this system is mature, but to admit the reality: we do have an extremely smart brain, but it is still strapped to an underdeveloped body.

Systems like Harness are a collection of wheelchairs, crutches, and monitors. They are not perfect, but they are a necessity today.

Because before the body can run stably, the system first needs to be maintained, protected, and monitored.

And so-called best practices will not be designed once at the theoretical level. They will only slowly emerge from a large number of real tasks, real failures, and real deliveries.

7. AI-Generated PPTs Are a Typical Scenario for Observing This Problem

If you want to find a scenario that best embodies the engineering problems of Agents, then AI-generated PPTs are a very typical sample.

On the surface, it seems like just a problem of "letting the model write a 20-page document."

But in reality, it's a systems engineering problem spanning requirement understanding, information completion, structure organization, page generation, visual matching, online editing, and final delivery.

A truly deployable AI PPT project typically includes at least the following stages:

1) Requirement Input

Input information such as topic, audience, page range, scenario template, and source materials.

2) Research Completion

When original information is insufficient or outdated, a research system is needed to supplement the latest information.

3) Outline Generation

First, form a structured outline, rather than generating pages one by one directly.

4) Task Decomposition

Decompose the outline into trackable tasks, clarifying current progress, failure nodes, and rollback points.

5) Page and Visual Generation

Generate content, layout, illustrations, and template style based on the page type.

6) Editing and Delivery

Support online adjustments, speaker note supplementation, note generation, and multi-format export such as PDF, PPTX, and HTML.

This chain illustrates one thing: the difficulty of AI-generated PPTs has never been just writing ability, but whether the entire chain is coordinated.

Using the earlier metaphor:

Document parsing is the sensory system.
Research capability is external memory.
Outline generation is the prefrontal cortex.
The task board is the nervous system.
Templates, layouts, and illustrations are the skeleton and skin.
Export, notes, and sharing are the hands and feet that actually act on the external world.

Therefore, the scenario of AI-generated PPTs very intuitively shows: when we say Agents need a "body," we are not talking about an abstract concept, but a complete set of engineering organs that must work together.

8. The Current State of AI Is Not Failure, but an Early Norm

Zooming in a bit more, the vivoPPT project itself is a sample of this judgment.

This chain was not fully designed from the start; it was gradually converged upon during development.

8.1 Initially, It Was "Directly Generate an Outline + Provide Many Templates"

This was a natural starting point.

The user inputs a topic, the system first generates an outline, and then lets the user choose from many templates. It seemed both intelligent and flexible.

But this path quickly exposed problems.

On one hand, the outline itself was unstable; on the other hand, the template was an additional variable. Before the content structure was stable, introducing style choices brought a second layer of uncertainty. The result was: the system seemed very free, but the actual output was unstable, and it was difficult for the user to determine whether the problem was with the content or the template.

In other words, this approach handed both "content planning" and "visual choice" to the model and the user simultaneously. On the surface, there were more choices, but the system complexity also increased synchronously.

8.2 Later, It Gradually Converged to "Fixed Template + Content First"

So the project made a key convergence later: instead of treating the template as a completely open variable, templates were organized into fixed solutions, and further emphasis was placed on "single template, content first."

The core judgment behind this change is: for most presentation scenarios, the real difficulty is not "which template to choose," but "what exactly to say on this page."

Therefore, the system began to require users to input more complete source materials, rather than just a single topic. Meeting minutes, project summaries, full proposals, research conclusions, speech drafts — these long texts were input as completely as possible. The system first organizes the presentation ideas, then generates an outline, and then decides the responsibility of each page.

This is essentially redefining the input layer: the system no longer assumes the model can produce high-quality generation based on a title alone; it requires the user to provide enough original text for the model to first understand the content, and then organize the content.

8.3 Further On, the Generation Target Changed from "Directly Output a Page" to "First Generate a DSL"

Once the template was fixed, the second problem became very apparent: if the system directly generates the final page, whether HTML or the final rendered result, subsequent editing, validation, reuse, and export are very difficult.

So the project continued to converge, introducing DSL as an intermediate layer.

This step is important.

Because the essence of DSL is not "generating in a different format," but adding a structured intermediate representation layer to the system. The page is no longer just the final result; it is first broken down into an editable, compilable, and checkable semantic structure. This provides stable interfaces between templates, content, layout, components, and export. The subsequent editor, preview, export, and AI rewriting finally have a unified object.

From an engineering perspective, this step is equivalent to adding a skeleton to "page generation."

8.4 Large Model Fine-Tuning

When the input expanded from plain text to rich text, the system gained stronger expressive power, but immediately encountered new problems.

Rich text is not just about adding bold, headings, and lists; it also brings information like images, tables, citations, and context hierarchy. Especially images, the system cannot just treat them as attachments.

An image in rich text, if only its src address is preserved, the model actually knows nothing. It doesn't know what the text before and after the image is about, what the caption is, or which chapter, page, or topic it belongs to.

Therefore, the project later added another layer of context parsing: in addition to preserving the HTML and plain text content of the rich text, it also extracts heading hierarchy, list structure, and table structure; for images, it combines the title, caption, adjacent paragraphs, and block-level text to generate semantic summaries, topic tags, and material descriptions, and then converts them into project materials.

This step illustrates a more direct point: when input capabilities are enhanced, the system does not automatically become stronger; instead, it forces you to make the "sensory system" more complete. An image is not considered processed just because it is "seen"; only when it is placed back into context does it truly become usable information for the model.

From this development process, it can be seen that the best practices that truly precipitate are usually not a universal Prompt, but a few simple but important process disciplines:

Research first, then write.
Outline first, then pages.
Taskify first, then parallelize.
Make it editable first, then deliverable.

9. Best Practices Are Never Designed

So, the current stage looks chaotic, but it's not surprising.

Some emphasize Prompt, some emphasize Agent, some do Memory, some do Workflow. Everyone is trying different paths, but overall, there hasn't been complete convergence.

This is not because everyone's understanding is insufficient, but because best practices are not a priori.

They are not designed through discussion; they gradually emerge through real use.

Only after repeated trial and error in a large number of real scenarios will the system gradually form a consensus:

Which steps must be retained.
Which capabilities must be sunk into infrastructure.
Which risks must be covered by fallbacks.
Which division of labor is most stable.

Ultimately, so-called best practices will slowly precipitate from "experience" into "intuition."

10. In the Future, We Will No Longer Discuss "Whether to Use AI"

Perhaps in the future, we will no longer discuss "whether to use Agents," just as we don't seriously discuss "whether to use navigation" today.

These choices will eventually shift from "technical options" to "default actions."

Real change will not happen when model parameters expand a bit more, or when the leaderboard rises a bit more.

Real change will happen when we begin to truly understand this entire system:

When to let it think.
When to let it act.
When to use tools.
When to hand it over to processes.
When to let humans intervene.

At that point, AI will truly evolve from a "collection of capabilities" into a "system that can be used long-term."

11. We Are Living in an Era of "Not Yet Knowing How to Use Tools"

And right now, we are in the early stages of this phase.

The tools are already powerful enough, but the methods of use have not yet fully formed.

This is somewhat like the period when humans first got maps, first had cars, and first accessed the internet. The tools themselves already have enormous potential, but the corresponding usage methods, engineering standards, and societal best practices are still being formed.

This is also a very rare stage.

Because in this stage, people are not just using tools; they are also participating in defining the correct way to use tools in the future.

In other words, we are participating in answering a question:

In the future, what will be the "correct way to use" AI?

Note: The article was created with AI assistance; the "living organism evolution" perspective and the "technological explosion" framework were proposed by the author.