跪拜 Guibai
← All articles
Frontend · Backend · AI Programming

A Self-Evolving Harness That Finds Its Own Bugs and Submits Fix PRs

By 谭sir ·
Read original on juejin.cn ↗ Google Translate ↗ Alt translation

Agent products degrade silently because Harness bugs don't always crash—they waste tokens, return wrong results, or fail intermittently. A self-inspecting Harness catches these regressions before users report them and generates concrete, reviewable PRs, turning a passive support queue into an automated maintenance loop.

Summary

Most AI agent failures originate in the Harness—the engineering layer that manages tool calls, context windows, and error recovery. A new open-source demo, evo-agent-demo, shows how to instrument that layer with structured tracing, then feed the trace data into an LLM-powered inspection pipeline that buckets errors, generates classification patterns, and backfills historical data. The system distinguishes provider errors from user errors from genuine Harness bugs, and for the latter, it spawns a fix agent that reads source files, applies patches, and submits a PR.

Beyond crash-level errors, a behavior analyzer clusters successful operations by intent and flags unhealthy patterns: a simple question that burns eight tool calls, a database tool with a 65% success rate, a research task consuming 10× the normal token budget. Each flagged group gets a Harness-level improvement suggestion, and critical ones enter the same auto-fix queue as bugs.

The architecture is language-agnostic and the cost is modest—roughly ¥30–80 per month on DeepSeek for hourly inspections and daily auto-fixes. A planned replay engine would close the loop by verifying that a generated PR actually fixes the bug before a human ever sees it, using recorded LLM responses and tool outputs to turn the agent loop into a pure function for deterministic testing.

Takeaways
Harness bugs are the dominant failure mode in AI agents, not model hallucinations, and they require structured tracing to diagnose.
Tracing must write each step to the database immediately; buffering in memory loses the most valuable data when a crash occurs.
Error bucketing groups failures by provider, error type, status code, tool name, and message before sending summaries to an LLM for pattern generation.
Generated patterns classify errors into four categories: user_error, provider_error, harness_bug, and ignore—only harness_bug triggers auto-fix.
Auto-fix uses the same agent-loop architecture as the chat product, equipping an LLM with file-search, read, write, and commit tools.
Behavior analysis clusters successful operations by intent and scores them on duration, step count, token usage, success rate, and tool error rate.
Unhealthy behavior groups get Harness-level improvement suggestions; those marked critical enter the same auto-fix pipeline as bugs.
A replay engine can verify fixes deterministically by substituting recorded LLM responses and tool outputs, turning the agent loop into a pure function.
LobeHub's production data shows pattern count saturating after 9 inspection rounds and agent success rate climbing from 75% to over 95%.
Monthly LLM costs for the full inspection–analysis–fix pipeline run roughly ¥30–80 on DeepSeek, with auto-fix being the most expensive step.
Conclusions

Structured tracing is not observability theater—it is the raw material for an automated maintenance loop, and without immediate writes, crash forensics are impossible.

Error classification that separates provider flakiness from genuine code defects prevents alert fatigue and keeps the auto-fix queue focused on actionable items.

Using the same agent-loop architecture for both the product and the fix agent means the system debugs itself with the same tools it offers users, which is elegant but also means any loop-level bug can impair the fixer too.

Behavior analysis on successful operations catches waste that error logs never see: a 15-second, 3-tool answer to 'introduce yourself' is a Harness problem, not a model problem.

The three-level maturity model (manual → assisted → led) gives teams a gradual on-ramp; just adding tracing moves a project from L1 to L2 overnight.

A replay engine that substitutes recorded LLM outputs makes Harness testing deterministic—the model's non-determinism is removed from the equation, isolating the engineering layer.

Accumulating a regression corpus from every merged fix PR creates a growing safety net that prevents previously patched bugs from returning, with zero additional developer effort per bug.

Concepts & terms
Harness
The engineering layer that wraps an LLM and manages tool calls, context windows, error recovery, and the agent loop—analogous to an operating system for the model.
Agent Loop
The cyclic process where an LLM receives a user message, decides which tool to call, receives the tool's output, and repeats until it produces a final response.
Tracing
Structured, step-level recording of every LLM call and tool invocation within an agent operation, including inputs, outputs, token counts, duration, and errors.
Error Pattern
A database-stored classification rule with match conditions (provider, status code, error type, message regex) that maps a recurring error to a category like harness_bug or provider_error.
Error Bucketing
Grouping raw errors by provider, error type, status code, tool name, and message before sending summarized buckets to an LLM for pattern generation, avoiding redundant analysis.
Behavior Analysis
Clustering successful agent operations by user intent and scoring each cluster on health metrics (duration, steps, tokens, success rate) to detect inefficiency that never produces an error.
Replay Engine
A verification mechanism that feeds recorded LLM responses and tool outputs back through the agent loop, turning Harness execution into a deterministic pure function for testing fixes.
Safety Gate
A pre-push verification step in the auto-fix pipeline that replays the bug-triggering trace against both unpatched and patched code to confirm the fix works before a PR is submitted.
Source: juejin.cn ↗ Google Translate ↗ Backup ↗