A Self-Evolving Harness That Finds Its Own Bugs and Submits Fix PRs
Agent products degrade silently because Harness bugs don't always crash—they waste tokens, return wrong results, or fail intermittently. A self-inspecting Harness catches these regressions before users report them and generates concrete, reviewable PRs, turning a passive support queue into an automated maintenance loop.
Most AI agent failures originate in the Harness—the engineering layer that manages tool calls, context windows, and error recovery. A new open-source demo, evo-agent-demo, shows how to instrument that layer with structured tracing, then feed the trace data into an LLM-powered inspection pipeline that buckets errors, generates classification patterns, and backfills historical data. The system distinguishes provider errors from user errors from genuine Harness bugs, and for the latter, it spawns a fix agent that reads source files, applies patches, and submits a PR.
Beyond crash-level errors, a behavior analyzer clusters successful operations by intent and flags unhealthy patterns: a simple question that burns eight tool calls, a database tool with a 65% success rate, a research task consuming 10× the normal token budget. Each flagged group gets a Harness-level improvement suggestion, and critical ones enter the same auto-fix queue as bugs.
The architecture is language-agnostic and the cost is modest—roughly ¥30–80 per month on DeepSeek for hourly inspections and daily auto-fixes. A planned replay engine would close the loop by verifying that a generated PR actually fixes the bug before a human ever sees it, using recorded LLM responses and tool outputs to turn the agent loop into a pure function for deterministic testing.
Structured tracing is not observability theater—it is the raw material for an automated maintenance loop, and without immediate writes, crash forensics are impossible.
Error classification that separates provider flakiness from genuine code defects prevents alert fatigue and keeps the auto-fix queue focused on actionable items.
Using the same agent-loop architecture for both the product and the fix agent means the system debugs itself with the same tools it offers users, which is elegant but also means any loop-level bug can impair the fixer too.
Behavior analysis on successful operations catches waste that error logs never see: a 15-second, 3-tool answer to 'introduce yourself' is a Harness problem, not a model problem.
The three-level maturity model (manual → assisted → led) gives teams a gradual on-ramp; just adding tracing moves a project from L1 to L2 overnight.
A replay engine that substitutes recorded LLM outputs makes Harness testing deterministic—the model's non-determinism is removed from the equation, isolating the engineering layer.
Accumulating a regression corpus from every merged fix PR creates a growing safety net that prevents previously patched bugs from returning, with zero additional developer effort per bug.