Agent Reliability Isn't a Model Problem — It's a Trace Problem
Teams shipping Agent products are flying blind if they only look at final pass/fail rates. A single tool-call choice at step four can fork an execution into a 17-step detour that wastes dollars and minutes, and without per-step trace data, that fork is invisible. The gap between a 3-minute Opus run and a 15-minute DeepSeek run on the same task isn't model quality—it's Harness-model mismatch, and only traces reveal it.
A three-hour Agent run that costs $12 and fails leaves no clue which step went wrong. Even a successful run hides which decisions were correct. This opacity makes Agent iteration guesswork. The core claim from Databend's work with major model providers is that full execution traces are the only reliable evidence for evaluation and the only fuel for training better models.
Agent engineering has moved through three stages: Prompt Engineering (tuning single-call inputs), Context Engineering (managing multi-turn information flow via RAG), and now Harness Engineering—the scaffolding that constrains, observes, and corrects model behavior. A comparative run of Claude Code, Evot, and Pi against DeepSeek V4 Pro shows that Claude Code's Harness is tightly coupled to its own Opus model. Swap in a third-party model and the same Harness causes the Agent to wander, burning more tokens and time. Tool names and even their casing matter because models are trained on specific Harness behaviors; a lowercase tool name where the model expects uppercase can introduce prediction drift that compounds across steps.
Storing Agent traces is not the same as traditional distributed tracing. A single Agent Swarm task can produce 500MB of deeply nested, semi-structured JSON across 100,000 spans over hours. The data is dirty, the schema drifts, and the analysis shifts from latency and errors to token cost, tool choice, and bifurcation points. Databend's approach sinks raw JSON directly into object storage, then uses VARIANT columns, accelerated indexes, full-text search, and Stream/Task pipelines to clean, index, and aggregate traces inside the database—without external schedulers.
Harness Engineering is splitting into two distinct tracks: product builders use it to constrain Agent behavior, while model providers use it as a training gym where models learn tool use and execution strategies through rollout and reward.
Cursor's Composer 2.5 achieves near-Opus quality at one-tenth the cost because it trains the model on Cursor's own usage data—effectively baking the Harness into the model weights. This makes third-party Agent builders structurally disadvantaged unless they control the training loop.
The observation that Claude Code sends three System messages before returning a task title, then spends two more steps planning before any real work, reveals that Agent Harnesses impose a hidden planning tax. A model not trained on that ritual will fight it, not benefit from it.
Agent path dependency means a single wrong tool choice at step four cascades into a 17-step recovery loop. This is not a model hallucination problem in isolation; it's a system dynamics problem where the Harness feeds bad state back into the context window.
Traditional observability vendors are poorly positioned for Agent Trace because their systems assume stable schemas and short-lived spans. Agent data is the opposite: multi-hour sessions, malformed JSON, and analysis that cares about semantic choices, not just status codes.
Storing traces is cheap; making them queryable for attribution is the hard part. The value isn't in the raw JSON lake—it's in the indexing, cleaning, and aggregation pipeline that turns 500MB blobs into answerable questions about cost and correctness.