跪拜 Guibai
← Back to the summary

Agent Reliability Isn't a Model Problem — It's a Trace Problem

Event Date: Thursday, June 18, 2026 Talk Transcript: Smart Minutes: Databend Community Event June 18, 2026 PPT: https://www.bohutang.me/talks/2026-trace-is-evals/

How can AI Agent deployment tackle the challenges of stability, cost, and evaluation?

This article is based on a transcript of Zhang Yanfei's talk at a Databend community event. Centered on the concept of "Trace is Evals," it systematically outlines the evolution of AI Agents from Prompt Engineering and Context Engineering to Harness Engineering, and explains why Agent stability, cost attribution, and performance evaluation must be built upon complete execution traces. Using comparative cases with Agents like Claude Code, Evot, and Pi, the article clarifies why Agent Trace differs from traditional Trace, and how Databend uses object storage, VARIANT, accelerated columns, full-text search, and Stream/Task to build a minimalist Trace storage and analysis foundation. Reading time is about 12 minutes.

From "Store and Query" to "Supporting Intelligent Decisions"

With the rapid development of large language models and AI Agents, we can clearly feel the technological paradigm shifting. Many people feel anxious amidst this change and hope to find more breakthroughs. In such an era, the role of the underlying data infrastructure is also changing. In the past, our core requirements for data infrastructure were: store it down, and query it out. Now, we hope it can support the operation of intelligent systems, participating in the process of intelligent decision-making and continuous optimization.

Databend is a cloud-native data warehouse. With the development of large models, new demands are being placed on data warehouses. Databend has recently been collaborating with some leading domestic large model manufacturers on related projects. During this collaboration, we have practiced and summarized some work around Agent trace analysis and attribution.

Today, let's look at this problem from a different perspective: An Agent runs for 3 hours, costs $12, and ultimately fails, but you don't know which step went wrong; even if it succeeds, you don't know which step was right. The next time you change the model, modify the Prompt, or add a Tool, it's hard to say whether the result got better or worse. This is no longer a Prompt Tuning problem, but a Harness Engineering problem.

Three Problems This Sharing Aims to Solve

The core question we are discussing today is: How can we use AI better than others? This sharing hopes to solve three specific problems.

First, reduce Token waste. We need to see clearly at which step the Agent is burning money, going in circles, or making redundant calls. Because AI can hallucinate, and hallucinations cause it to get lost during task execution. For a human, a task might be simple, and the path can be determined in a few steps; but for AI, which lacks human consciousness, the context it sees is essentially predicting one Token after another.

Second, make the Agent more stable. For the same Agent product, we need to find where failures, hallucinations, and quality fluctuations begin. Only by finding the cause can we have a way to do better than others.

Third, and a point of great confusion for many: If I want to make an Agent product and I change a Prompt, add a Tool, or switch a model, how do I quantify the effect of this change? Did the result get better or worse? If you can only judge by feeling, stable iteration is impossible.

From Databend's practice in collaborating with leading model manufacturers, a core method has been summarized: record the execution trace of every step of the Agent, and use data, not feeling, to optimize the Agent. This is also an important trend that model manufacturers are paying attention to.

Three Stages of Agent Engineering Evolution

Before diving into specific project practices, let's look at the three stages of Agent engineering evolution.

The first stage is Prompt Engineering. When ChatGPT first appeared, people mainly typed instructions into the chat dialog box, such as performing a search, translating a piece of text, or making a summary. The core of this stage was tuning instructions, adding examples, and writing chains of thought—optimizing the input text for a single model call.

The second stage is Context Engineering. Entering 2025, people realized that large models, based on general knowledge, couldn't directly combine with internal company documents and knowledge to answer questions, so RAG technology began to develop. Models could call upon company documents and knowledge bases, leading to the emergence of customer service assistants and knowledge base Q&A Agents. The core of this stage is managing what the model should see at each step, including RAG retrieval, memory compression, and window scheduling—optimizing the information flow in multi-turn interactions.

The third stage is Harness Engineering. By 2026, with the explosion of OpenClaw, people discovered that just by giving an Agent a tool, it could actively operate a local computer and autonomously execute tasks. At this stage, Coding Agents like Claude Code, Codex, and Coze began to proliferate, and the form of Agents evolved from single-turn Q&A to autonomous task execution and multi-Agent collaboration.

Development Stage Prompt Engineering Context Engineering Harness Engineering
Time 2022-2024 2025 2026
Characteristics Tuning instructions, adding examples, writing chains of thought — optimizing the input text for a single model call Managing what the model should see at each step: RAG retrieval, memory compression, window scheduling — optimizing the information flow in multi-turn interactions The entire execution infrastructure around the model: state maintenance, tool mediation, feedback injection, constraint enforcement, progress verification
Representative Forms Chatbot, single-turn Q&A, translation/summary RAG applications, multi-turn dialogue assistants, knowledge base Q&A Coding Agent, autonomous task execution, multi-Agent collaboration

The so-called Harness is scaffolding engineering: how we constrain the Agent and the model to make them perform better. From our observation, Harness contains two major parts. For product builders, Harness is the product framework, aiming to make the Agent run better; for model trainers, Harness is the examination room, aiming to make the model train better within it.

Harness from the Application Side: Making Agents Controllable, Observable, and Attributable

From the application side, the goal of Harness is to turn the model into a controllable product.

In the early days, we directly interfaced with APIs for single-turn interactions. For example, having the model send an email and returning the result after sending. Now, Agents have evolved to execute complex multi-turn tasks. A task might run for hours and involve permission and cost control. If you want to provide an Agent for internal company use, you must observe the execution process of every step and evaluate the results of every change.

Therefore, the goal of application-side Harness is to make the Agent controllable, observable, attributable, and capable of continuous improvement. And whether "improvement is effective" needs to be judged by Eval.

Harness from the Model Side: Training Tools and Processes into the Model

From the model side, the goal of Harness is to "train" tools and processes into the model.

Previously, external Harness would tell the model what to do: tool usage was described by Prompts, context compression relied on external logic, and execution strategies were constrained by scaffolding. Now, models are starting to Rollout on real Harnesses, using Traces and Rewards to learn tool calls, gradually training context management and execution strategies into the model.

From the actual scenarios of our cooperation with large model manufacturers, just changing the Harness can bring a 10x performance improvement.

For example, Cursor was an early product to create a Coding Agent. Recently, Cursor released the Composer 2.5 model, trained based on Kimi. It achieves good results at nearly one-tenth the price of Opus. Although there might still be a gap compared to Opus, the effect is already quite considerable.

Why can it do so well based on a relatively cheap model and the Cursor software? The reason is that Cursor lets the model learn how to use Cursor. Cursor itself retains a large amount of user usage data; the model can learn what tools the user called and what operations were taken at each step, and then learn these behavior patterns into the model, similar to implanting a strong generalization ability into the model.

For example, what tools are available in Cursor, and which tool the model should preferentially use when encountering a certain type of task—these preferences will be trained into the model, becoming a more important part of the model weights. So, if you use Cursor to call Composer, the effect will be very good; but if you use Claude Code or the open-source OpenCode to call the same model, the effect might be poor. The reason is that Composer learned behaviors within Cursor; it has already been trained for Cursor's Harness.

Therefore, Harness capabilities are gradually sinking into the model. Model manufacturers also know that this is their moat. It's not that the Harness disappears, but that the model begins to learn within the Harness, gradually internalizing tool usage, context management, and long-term task execution. And how to judge the quality of the Rollout also depends on the Trace.

Why Tool Names and Case Affect Performance

As long as the design follows the tool behaviors preferred by the model, the upper-level Agent product doesn't even need a very long Harness, and the Prompt doesn't need to be written very complexly.

For example, there are about 20+ tools in Claude Code. To make the model perform as it does in Claude Code, the tool names must be accurate, the case must be consistent, and the tool descriptions need to be kept as consistent as possible. This way, when calling Opus, the effect will be close to Claude Code's own performance.

The case of a tool name affects performance. The reason is that Opus was trained based on Claude Code. If a Tool's name had a fixed case during training, and you give it a lowercase version, the model might deviate when predicting the next Token. This deviation seems small, but it gets continuously amplified in multi-step Agent execution.

Why You Must Unfold the Trace

From the perspective of building an Agent product, everyone eventually encounters the same question: Did our change make it better or worse?

This question is hard to answer because an Agent is a system of uncertainty. Large models can hallucinate; the same task, the same software, executed at different times, can yield good or bad results.

How can we clarify this? From our practice, observability is needed. That is, the execution trace of every step of the interaction between the Agent and the large model must be recorded, supporting every step of judgment with clear data. We need to know from which step the Agent started to go bad, or why it got better; we also need to be able to compare the differences between two executions step-by-step, locating the "which step" in terms of cost and quality.

Ultimately, whether from the Agent product perspective or the model training perspective, Trace is needed. Trace is both the evaluation evidence for Agent products and the data fuel for model training. Only by unfolding the complete trace can we clarify "better or worse."

A Comparative Case: Claude Code, Evot, and Pi

Next, let's look at a comparative case. The three Agents are Claude Code, Evot, and Pi, and the model they all call is DeepSeek V4 Pro. In this case, Claude Code executed for 15 minutes, Evot for 5 minutes, and the open-source Agent Pi for 4 minutes. Here, Claude Code calling DeepSeek V4 Pro serves as a comparative reference.

One might think, Claude Code itself is very good, why is the execution time so long? Is the DeepSeek V4 Pro model too weak? Actually, no. Let's look again at Claude Code calling its own model, Opus 4.6; the same task was completed in 3 minutes.

Agent Called Model Tool Call / Execution Time Core Conclusion
Claude Code Opus 4.6 ~30 steps / 3 min 18 sec Model and tools perfectly matched, high efficiency
Claude Code DeepSeek V4 pro ~60+ steps / 15 min 02 sec Tool guidance ineffective for third-party models, low efficiency
Evot Opus 4.6 2 min 02 sec General agent, no deep adaptation for specific models
Evot DeepSeek V4 pro 5 min 38 sec
Pi Opus 4.6 2 min 38 sec General agent, no deep adaptation for specific models
Pi DeepSeek V4 pro 4 min 48 sec

This shows that Claude Code's Harness capabilities are highly matched with its own model. If you switch to another model, the Prompt, Tool, or other system settings might cause the model to get lost during execution, not knowing what to do.

So, why do model manufacturers want to build their own Coding Agents? This is the reason. They will combine their Agent's behavior, tool usage, and tool descriptions with model pre-training, sinking the Harness into the model. The model knows the tool preferences and how to cooperate with the Agent. In contrast, the effect of Claude Code calling DeepSeek V4 Pro is much worse.

You can also see that when Evot and Pi call DeepSeek V4 Pro and Opus 4.6, the execution time gap isn't large. This is because these two Agents are more general-purpose and haven't been deeply adapted specifically for Claude Code or a particular model. Claude Code is effective for the Opus model but not necessarily for third-party models.

Therefore, when you use Claude Code to call a third-party model, the poor effect isn't necessarily because the model is bad, but possibly because the Agent and model aren't well-coordinated. From the model manufacturer's perspective, they will train the Agent and model into a highly matched system. This is also why it becomes harder for third-party Agents and applications to fully leverage a model's capabilities.

If Claude Code calls DeepSeek V4 Pro, you'll find the execution time is longer, Token consumption is greater, and uncertainty is higher. If you only look at the final result without understanding the underlying mechanism, it's easy to conclude "this model is terrible." But in reality, just looking at the final result is far from enough; you must unfold the Trace to know where Tokens were burned, where time was slow, and at which step it started going in circles.

Agent Path Dependency and Bifurcation Points

The variability of Agents comes from path dependency. One tool call choice, one context trim, one error recovery, will change all subsequent steps.

After unfolding the Trace, you'll find that the differences start from the Tool Call sequence. In the path diagrams of the three Agents, each colored block represents a tool call; some steps have two colored blocks, one above the other, indicating the large model returned two parallel calls. For example, the second step of Claude Code has two parallel colored blocks.

This shows that the cooperation between the model and the Agent is very important. If the Agent and model cooperate well, the degree of parallelism will be higher. This process is similar to parallel execution in databases: once parallelism increases, task execution time shortens, and quality doesn't necessarily drop.

The Agent guides the model, telling it what to do next. If the model has undergone reinforcement training, it knows how to cooperate with the Agent and will execute very smoothly. During this process, the system collects a large number of Traces, such as how users use it and how each step is broken down, and then lets the model learn these behaviors.

Let's look at another bifurcation point example. The upper path cooperates well; step 4 chooses Edit, precisely modifies the target file, and then completes the task step by step. The lower path chooses Bash at step 4, producing too much output, causing it to circle around for 17 steps afterward, consuming more time and resources. Although it eventually completes the task, the process is clearly less efficient.

The interaction between the Agent and the large model is a chain reaction. Each step executed depends on the result of the previous step. The execution result is then handed back to the large model, which gives the next action, cycling continuously until a final result is obtained.

The large model itself has no state; it doesn't naturally maintain a complete task state. So, as long as one step goes wrong, and you feed the erroneous result as the next round's input to it, the subsequent path will continuously diverge.

If we know the divergence happened at step 4, we need to stop and analyze: Why did it diverge at step 4? Did the Prompt's guidance cause the model to make a wrong choice at this step? The bifurcation point is hidden within the complete execution trace; the Tool, Context, Token, and time consumed at each step must be stored, queried, and compared.

What Exactly Does the JSON in an LLM Request Contain?

What many people see is an IDE: you give the Agent a sentence, a task, and it starts executing back and forth. But how the underlying layer interacts with the large model isn't intuitive. When we do quantification, we constantly observe this process.

A single request sent to an LLM typically contains system instructions, the tools and descriptions available to the Agent, the back-and-forth interaction between the Agent and the large model, and the results produced by each interaction.

Looking at the previous Demo, we can see every step of Claude Code calling the DeepSeek model. When Claude Code is used, it first sends three consecutive System instructions, then returns the current task name. Therefore, Claude Code's first step isn't actually doing work, but getting the task title.

In the second step, Claude Code will hand over about 26 tools, plus the "Fix Bug" task we issued, to the large model. The large model returns an updated progress plan. In the third step, the large model still isn't doing real work, but continues updating the plan, similar to planning what to do first.

By the fourth step, the large model truly starts working, for example, returning which file to view first. Through this process, you can see the behavior of Claude Code's Harness: it plans the task first, clarifies what to do, and then guides subsequent steps to prevent the model from going off track. But this behavior might not work on models like DeepSeek V4 Pro. If the model isn't familiar with this Harness, it might not know how to cooperate.

If Claude Code calls its own Opus model, since the model has undergone reinforcement training, it will follow this Harness more closely. It knows what to do and is more "obedient" because both sides have trained together and cooperate more smoothly. This is also why calling DeepSeek V4 Pro yields poorer results, while calling Opus yields better ones.

Therefore, if you don't record these Traces and don't observe each step, it's hard to discover why the execution process went astray. This Demo ran about 69 steps, and every step can be observed. After visualization, a person can see the path differences; if you let AI track each step, the analysis speed will be even faster.

If the Agent Pi completes the task in 32 steps, and you hand Pi's Trace and Claude Code's Trace to a large model together, the large model can quickly analyze the gap and the reasons. The prerequisite is that you have already stored the detailed Trace data by various dimensions, including system instructions, tool descriptions, call results, etc.

Pi uses relatively few tools; the underlying Agent framework OpenCode's Pi mainly uses four types of tools: Read, Bash, Edit, and Write. Pi has its own set of Harness, with very concise system instructions, oriented towards general models, not specifically targeting Opus or GPT. Therefore, when Pi calls Opus or GPT, the gap won't be particularly large.

In contrast, Claude Code's system components are very large, divided into several blocks, each heavily customized for its own model. For example, System Reminder is an instruction for models like Opus, because the model has already been trained and knows how to follow these instructions. From this, it's clear that the Harness is a part that different Agent products need to polish over the long term for different models.

Why Agent Trace Differs from Traditional Trace

A single request sent to an LLM is essentially a large JSON. In the Demo, you can see that Claude Code's JSON is very large, while the JSON Pi gives to the large model is very short: the System instruction tells the model "You are a programming expert," plus four tools and the current task, very direct overall.

But by the second step, you'll find the JSON has become very long. This is because the system must also include the complete JSON from the first step, plus the execution result of the first step. In the third step, it must again stuff the result of the second step into the new JSON, piecing together a complete state.

Therefore, every input to the large model is the previous content plus new content, and the window gets larger and larger. Eventually, context compression is needed. After compression, it's handed back to the large model to continue execution. This is the basic mechanism of Agent execution.

So, we need to save the JSONs from the entire Session process, because they reflect the different results of each step's execution.

Traditional Trace records service call chains; a single request is typically at the second or minute level, fields come from SDKs or Instrumentation, the Schema is relatively stable, and the analysis focus is Latency, Status, and Error—very fixed and clear.

But Agent Trace is more complex. A task might last from tens of minutes to several hours, Spans are continuously appended, and state evolves across steps. Content comes from Prompts, Messages, Tool Calls, and Tool Results. The results returned by large models are often not valid JSON, and field types can drift. The analysis focus is no longer just latency and errors, but Tokens, Cost, Tool Choice, and key bifurcation points.

Therefore, Agent Trace and traditional Trace do not have a one-to-one correspondence. It's not that a vendor who does traditional Trace well can necessarily do Agent Trace well. The challenges of Agent Trace are long duration, large JSONs, and dirty data. The real trouble isn't "storing a trace chain," but handling incremental events in long-lifespan tasks, cleaning, disassembling, indexing, and aggregating dirty, nested, ever-changing large JSONs, and using them for attribution.

The Scale of a Real Agent Trace

The scale of a single real trace can be very large: a single Trace can range from 500KB to 500MB, nested 3 to 8 layers deep.

For example, an Agent Swarm is cluster-style. It first divides the work, executes in parallel after division, and then a unified Agent does scheduling and summarization. Such a single task might produce about 500MB of data, run for over ten hours, and contain over 100,000 Spans, which are the 100,000+ steps we mentioned earlier.

A Trace of this magnitude is simply impossible to handle manually. This is the pain point. A complex Trace like 500MB places high demands on the storage architecture. When dirty JSON comes in, the system needs to answer: How to clean it? How to disassemble it? How to build indexes?

Most importantly, store the Trace data first. But Trace isn't valuable just by being stored. Only after cleaning, disassembling, indexing, and aggregating can it support attribution, replay, and evaluation.

What Database Capabilities Are Needed for the Agent Trace Data Layer

To do Agent Trace well, the database needs several core capabilities.

First, the database must natively support JSON storage and retrieval.

Second, the database must have JSON cleaning and transformation capabilities. Because real Traces will have dirty JSON, the system must have powerful function capabilities to handle this data.

Third, JSON indexing must be fast enough. For JSONs of several hundred megabytes, even if they can be stored, you can't read them in full every time you query; that's too inefficient. The system needs to enable fast retrieval of trace data through common indexes. For example, fields like trace_id and model can be extracted and indexed, forming separate JSON acceleration columns.

Fourth, it must support full-text indexing within JSON. Full-text indexing is very important. For example, if a user complains that a certain task executed poorly, the system needs to find the corresponding conversation based on keywords, pull out the Trace for analysis, and finally judge if there's room for improvement.

Fifth, to meet compliance requirements, it must support JSON Path RBAC. For example, a Trace might contain sensitive information like user passwords, requiring authorization or masking by path. For instance, a certain field in a Message might be visible to the company's DA and business personnel, but not to others.

Sixth, Agent Trace data is huge, potentially generating hundreds of TBs of data per day, requiring long-term storage. Therefore, the database must support low-cost object storage.

Seventh, massive numbers of users and Agent Swarms will concurrently generate Traces, so the database also needs to support high-throughput writes.

Databend's Minimalist Trace Storage and Analysis Path

A large number of users continuously use Agents, everyone is constantly producing data, constantly writing large JSONs, which is a big challenge for the system.

Databend has chosen a minimalist Trace storage and analysis path. The traditional path might require source data, transactional databases, and data warehouses to be built independently. Databend's path is: Trace production data is continuously written to object storage S3; Databend's internal Tasks complete the ingestion and cleaning from S3, and after cleaning, write to the events table.

The entire JSON can be written into one field of one table. This table can perform full-text indexing, JSON acceleration, and dirty JSON cleaning, all within the same table. After cleaning, Stream captures new events, driving subsequent incremental computation. Aggregate Tasks automatically refresh traces, with the entire process requiring no external scheduler.

The most core aspect is that this table itself needs to possess many capabilities. For example, the JSON part will have materialized fields, requiring the creation of materialized tables; dirty data cleaning needs comprehensive function capabilities, and these capabilities can all be expressed through SQL. Databend has made many enhancements for this scenario.

First, sink the data first; don't just look at the final Pass/Fail. The original Trace needs to be retained long-term.

Second, computing power must keep up. After the data is stored, it must also be queryable and computable quickly. Databend provides acceleration columns, full-text search, and incremental aggregation capabilities.

Third, build on demand at the upper layer. Eval, Replay, and RL no longer each store a separate copy, but provide multiple upper-layer capabilities based on the same Trace data.

Summary: Trace is the Infrastructure for Agent Reliability

In summary, to do Agent trace analysis and attribution well, a Trace storage and computation foundation like Databend is needed.

It is built on object storage, VARIANT, accelerated columns, and Stream/Task, helping teams build upper-layer evaluation, replay, attribution, and training data on the same dataset.

Agent reliability won't come solely from larger models, nor from more complex Prompts. To truly move an Agent from Demo to production, you must first store every step's Trace, make it queryable and computable quickly, and be able to answer with data: which step started to go wrong, which step consumed the Tokens, and which change truly made the system better.

This is the core of "Trace is Evals."

CTA: