跪拜 Guibai
← All articles
Agent · Database · Big Data

Agent Reliability Isn't a Model Problem — It's a Trace Problem

By Databend ·
Read original on juejin.cn ↗ Google Translate ↗ Alt translation

Teams shipping Agent products are flying blind if they only look at final pass/fail rates. A single tool-call choice at step four can fork an execution into a 17-step detour that wastes dollars and minutes, and without per-step trace data, that fork is invisible. The gap between a 3-minute Opus run and a 15-minute DeepSeek run on the same task isn't model quality—it's Harness-model mismatch, and only traces reveal it.

Summary

A three-hour Agent run that costs $12 and fails leaves no clue which step went wrong. Even a successful run hides which decisions were correct. This opacity makes Agent iteration guesswork. The core claim from Databend's work with major model providers is that full execution traces are the only reliable evidence for evaluation and the only fuel for training better models.

Agent engineering has moved through three stages: Prompt Engineering (tuning single-call inputs), Context Engineering (managing multi-turn information flow via RAG), and now Harness Engineering—the scaffolding that constrains, observes, and corrects model behavior. A comparative run of Claude Code, Evot, and Pi against DeepSeek V4 Pro shows that Claude Code's Harness is tightly coupled to its own Opus model. Swap in a third-party model and the same Harness causes the Agent to wander, burning more tokens and time. Tool names and even their casing matter because models are trained on specific Harness behaviors; a lowercase tool name where the model expects uppercase can introduce prediction drift that compounds across steps.

Storing Agent traces is not the same as traditional distributed tracing. A single Agent Swarm task can produce 500MB of deeply nested, semi-structured JSON across 100,000 spans over hours. The data is dirty, the schema drifts, and the analysis shifts from latency and errors to token cost, tool choice, and bifurcation points. Databend's approach sinks raw JSON directly into object storage, then uses VARIANT columns, accelerated indexes, full-text search, and Stream/Task pipelines to clean, index, and aggregate traces inside the database—without external schedulers.

Takeaways
Agent engineering has progressed from Prompt Engineering to Context Engineering to Harness Engineering, where the execution scaffolding is the product differentiator.
Claude Code's Harness is deeply coupled to Opus; calling a third-party model through it can increase step count by 2x and execution time by 5x.
Tool name casing affects model performance because models are trained on specific Harness signatures, and a case mismatch introduces token-prediction drift.
A single Agent Swarm task can generate 500MB of trace data with 100,000+ spans, requiring a storage layer that handles dirty, deeply nested JSON at scale.
Agent Trace differs from traditional APM tracing: spans last hours, schemas drift, and the analysis targets token cost, tool choice, and path bifurcation, not just latency and errors.
Databend's trace pipeline writes raw JSON to S3, then uses internal Tasks, Streams, and VARIANT columns to clean, index, and aggregate traces without external orchestration.
Full-text search over trace JSON lets teams pull the exact conversation behind a user complaint and analyze whether the execution path can be improved.
JSON Path RBAC enables column-level access control on trace data, so sensitive fields like passwords can be masked per role.
Conclusions

Harness Engineering is splitting into two distinct tracks: product builders use it to constrain Agent behavior, while model providers use it as a training gym where models learn tool use and execution strategies through rollout and reward.

Cursor's Composer 2.5 achieves near-Opus quality at one-tenth the cost because it trains the model on Cursor's own usage data—effectively baking the Harness into the model weights. This makes third-party Agent builders structurally disadvantaged unless they control the training loop.

The observation that Claude Code sends three System messages before returning a task title, then spends two more steps planning before any real work, reveals that Agent Harnesses impose a hidden planning tax. A model not trained on that ritual will fight it, not benefit from it.

Agent path dependency means a single wrong tool choice at step four cascades into a 17-step recovery loop. This is not a model hallucination problem in isolation; it's a system dynamics problem where the Harness feeds bad state back into the context window.

Traditional observability vendors are poorly positioned for Agent Trace because their systems assume stable schemas and short-lived spans. Agent data is the opposite: multi-hour sessions, malformed JSON, and analysis that cares about semantic choices, not just status codes.

Storing traces is cheap; making them queryable for attribution is the hard part. The value isn't in the raw JSON lake—it's in the indexing, cleaning, and aggregation pipeline that turns 500MB blobs into answerable questions about cost and correctness.

Concepts & terms
Harness Engineering
The third phase of Agent engineering, focused on the execution scaffolding around a model: state maintenance, tool mediation, feedback injection, constraint enforcement, and progress verification. It treats the model as a component to be guided and corrected, not just prompted.
Agent Trace
A full, step-by-step record of an Agent's interaction with an LLM, including system instructions, tool definitions, messages, tool calls, and results. Unlike traditional distributed traces, Agent traces span hours, contain deeply nested JSON, and are analyzed for token cost, tool choice, and path bifurcation rather than just latency and errors.
Path Dependency
In Agent execution, each step's output becomes the next step's input. A single tool-call choice or context-trimming decision alters all subsequent steps, creating divergent execution paths that can only be understood by comparing full traces.
VARIANT
A semi-structured data type in Databend that stores JSON natively, allowing querying, indexing, and transformation of nested JSON documents directly within SQL without pre-defining a schema.
JSON Path RBAC
Role-based access control applied at the level of individual JSON paths, enabling fine-grained permissions where specific fields within a trace (such as passwords or PII) can be masked or restricted per user role.
Agent Swarm
A multi-Agent architecture where a coordinating Agent divides work among parallel sub-Agents, then aggregates their results. A single Swarm task can produce hundreds of megabytes of trace data across tens of thousands of execution spans.
Source: juejin.cn ↗ Google Translate ↗ Backup ↗