A Self-Evolving Harness That Finds Its Own Bugs and Submits Fix PRs

Let AI Agent Systems Discover Their Own Bugs and Submit Their Own Fix PRs: Self-Evolving Harness

Author: 谭sir Tags: Frontend, Backend, AI Programming

This article introduces how to give the engineering code (Harness) of an AI Agent the ability to self-evolve—automatically recording runtime data, automatically identifying error patterns, and automatically generating fix PRs (Pull Requests). The content covers monitoring, error pattern recognition, automatic fixes, behavior analysis, and production implementation plans. Each chapter will be explained with the code and running results of the demo project evo-agent-demo.

Starting with a Bug

Suppose you have built an AI Agent product that can search for information, query databases, and execute code. Before going live, it was repeatedly tested internally, and no problems were found, so it was officially launched.

But after the product was online for a while, user feedback started coming in one after another, reporting various errors. Because internal testing cannot cover all usage scenarios, and the number of real users far exceeds the team's testers, their questioning styles, usage habits, and boundary operations far exceed the scope of test cases, leading to many unexpected situations. For example, customer service received complaints like "I asked a question, and the AI just reported an error," someone posted on a tech forum saying the program crashed after a multi-turn conversation, and the operations team also reported that the search function worked intermittently.

This method of discovering problems through user feedback is passive—by the time users report issues, many users have already been affected. What's more troublesome for developers is that it's hard to pinpoint the real cause of the bug based on a single feedback report, because there is no execution context from when the problem occurred.

The bugs in most AI Agent products basically appear in the layer of engineering code that wraps the model. This layer of code has a specific name: Harness. Harness is the engineering system that drives the model's operation. An analogy: the model is the CPU, and the Harness is the operating system—the operating system manages process scheduling and memory allocation, while the Harness manages tool calls, context windows, and error recovery.

From Passive to Active

Since the problem lies in the Harness, we need to know exactly what happens each time the Harness executes. We can add a layer of Tracing to the Agent, which is like installing a dashcam on the Agent. Every time the Agent executes, what tools it called, what parameters were passed, what the response results were, etc., are all recorded in the database. This way, we don't have to wait for user feedback; we can check the database ourselves to see which requests had problems.

But with so many requests generated every day, possibly thousands or tens of thousands, and many errors among them, relying on manual review one by one is unrealistic. So the system also needs to automatically group errors by type (for example, grouping all 429 rate limit errors together), and then determine whether each type of error is a user problem/provider problem, or a bug in the Harness itself—if it's a Harness bug, the LLM needs to analyze the relevant source code based on the error cause and generate fix code, and finally submit a PR to fix the problem. The following content of this article will unfold in this order, accompanied by the code and running results of evo-agent-demo for explanation.

Quick Demo Experience

Let's run evo-agent-demo once to see the effect, and later we will break down the implementation of each step:

git clone https://github.com/woai3c/evo-agent-demo
cd evo-agent-demo && pnpm install
cp .env.example .env          # Fill in at least one LLM API Key
pnpm db:seed                  # Populate demo data
pnpm simulate --mock          # Simulate 100 user requests (including various errors, no API calls)
pnpm dev                      # Start frontend and backend development servers

After the project starts, open the http://localhost:5173/admin/inspections page and click the "Inspect: Identify Patterns" button to observe the log panel outputting the inspection progress in real time. After the inspection is complete, you can switch to the overview page to see the changes in various metrics.

Insert image description here

Quickly Understanding the Full Picture of AI Agent Harness Engineering

Readers with AI Agent development experience can skip this section and start directly from "Tracing."

This section introduces what components an Agent product consists of and what each is responsible for, giving readers who haven't been exposed to AI Agent development an overall concept first.

What is an AI Agent

Simply put, an AI Agent is an LLM application that can call tools to perform tasks. Ordinary chatbots can only reply with text, but an Agent can also perform tasks like searching, querying databases, and modifying code. The chatbot's process is "user speaks → model replies → end," while the Agent's is "user speaks → model decides what tool to call → gets results → continues thinking → may call more tools → ... → finally replies to the user." This looping process is called the Agent Loop.

Let's use the demo in this article (codenamed Evo) as an example. The user asks "Help me research the Hono framework," the Agent will first call the webSearch tool to search for information, then select a few URLs from the results, then call the webFetch tool to fetch the page content, and finally synthesize the information to write a summary. Throughout the process, the model autonomously decides the choice and order of tool calls, without human intervention.

Insert image description here

Let's break down the internal process with a sequence diagram:

sequenceDiagram
    autonumber
    actor U as User
    participant AL as Agent Loop
    participant LLM as LLM API
    participant T as Tool

    U->>AL: "Help me research the Hono framework"
    AL->>LLM: Send message + tool list
    LLM-->>AL: tool_call: webSearch("Hono framework 2025")
    AL->>T: Execute webSearch
    T-->>AL: Return search results
    AL->>LLM: Pass tool results back
    LLM-->>AL: tool_call: webFetch("https://hono.dev")
    AL->>T: Execute webFetch
    T-->>AL: Return page content
    AL->>LLM: Pass tool results back
    LLM-->>AL: Generate text reply (streaming output)
    AL->>U: Stream display reply

The Agent Loop is the dispatch center between the user, model, and tools. The user's message is given to the model, the model decides what tool to call, the Agent Loop executes the tool and returns the results to the model, and the model decides whether to continue calling tools or reply directly to the user. This loop might turn once or ten times, depending on the complexity of the task.

Core Components of an Agent

An Agent product roughly consists of these parts:

    ┌─────────────────────────────────────┐
    │              Harness                │
    │                                     │
    │  ┌───────────┐  ┌───────────────┐  │
    │  │ Agent Loop│  │ Tool System   │  │
    │  │ (Scheduler)│  │ (Search/DB/   │  │
    │  │           │  │  Code Exec/...)│  │
    │  └─────┬─────┘  └───────┬───────┘  │
    │        │                │          │
    │  ┌─────┴────────────────┴───────┐  │
    │  │     Context Management (Memory)│  │
    │  └──────────────────────────────┘  │
    │                                     │
    │  ┌──────────────────────────────┐  │
    │  │     Tracing (Dashcam)        │  │
    │  └──────────────────────────────┘  │
    │                                     │
    │  ┌──────────────────────────────┐  │
    │  │     Evolution Engine (Self-Improvement)│  │
    │  └──────────────────────────────┘  │
    └─────────────────────────────────────┘
             ↕
        ┌─────────┐
        │   LLM   │  ← Replaceable (DeepSeek / OpenAI / Anthropic / ...)
        └─────────┘

Harness is all the engineering code that wraps the model. The model provides reasoning ability, and the Harness is responsible for turning that ability into a usable product.

The Agent Loop is the scheduler, driving the "ask model → call tool → ask model again" loop. In the demo, the entry point for this loop is the agentLoop() function (packages/server/src/agent/loop.ts), which sends streaming requests to the model via the Vercel AI SDK, progressively consuming the events returned by the model. When the model requests a tool call, the SDK automatically executes the corresponding tool and passes the results back. Also, to prevent infinite loops, a single conversation in this demo executes a maximum of 15 steps (in a production environment, it's far more than 15 rounds).

The tool system defines what the Agent can do. In the Evo demo, there are 6 tools: web search, fetch webpage, read file, execute code, query database, send email. Each tool consists of three parts: a natural language description for the model to see (the model decides whether to call this tool based on the description), a set of parameter formats defined with Zod Schema, and a function that actually performs the operation.

Context management solves the token limit problem. Every time the Agent calls a tool, the tool's input and output must be placed into the conversation history and sent to the model. After 10 rounds of conversation plus a few web searches, the Context Window might be nearly full. The demo uses two methods to deal with this: message compression (packages/server/src/context/compression.ts)—when the token count of the conversation history exceeds 70% of the window capacity, old messages are automatically compressed into a summary; tool result truncation (packages/server/src/context/truncation.ts)—a webpage returned by a search might be dozens of KB, only the first 30,000 characters plus a tail segment are kept.

Tracing records the execution data of each step—what tool was called, how many tokens were spent, how long it took, whether it succeeded. This data is the foundation for all subsequent analysis, detailed in the next section.

The evolution engine automatically discovers errors from Tracing data, classifies causes, and generates fix code. Its three parts—error Pattern recognition, automatic fixing, and behavior analysis—will be discussed separately later.

Demo Project Overview

Evo is a pnpm monorepo containing three packages:

packages/shared (@evo/shared) — TypeScript types and constants, shared by server and web
packages/server (@evo/server) — Hono + Node.js backend, containing Agent Loop, tool system, Tracing, evolution engine, REST API
packages/web (@evo/web) — Vite + React 19 + Tailwind + shadcn/ui frontend, containing chat interface and admin dashboard

The database uses better-sqlite3, LLM integration uses Vercel AI SDK, supporting six providers: DeepSeek, OpenAI, Anthropic, Alibaba Bailian, Zhipu, Moonshot, with DeepSeek as the default.

Key code locations: packages/server/src/agent/loop.ts is the entry point for the Agent Loop, packages/server/src/evolution/ is the full implementation of the evolution engine. Readers can refer to these files for the complete code while reading the subsequent content.

Differences Between Demo and Production Environment

The demo has made many simplifications. The main differences from a production environment are:

Dimension	Demo Implementation	Production Alternative
Inspection Trigger	Manually click "Start Inspection" in admin dashboard	Scheduled task (every 1-2 hours), analyzing newly added errors from the previous cycle
Auto-fix Trigger	Manually click "Auto Fix" in admin dashboard	Scheduled task (daily at midnight), generating PRs for unfixed harness_bugs
Database	SQLite single file, suitable for local development	PostgreSQL / MySQL, supporting concurrency and multiple instances
Simulated Data	`pnpm simulate --mock` batch injection	Trace data generated by real user traffic
Error Injection	`simulate-errors.ts` randomly injects errors	Errors occurring naturally in the production environment
PR Submission	Submit to local Git repository	Submit to code repository via GitHub/GitLab API, going through CI/CD (Continuous Integration/Continuous Deployment) process

All operations in the demo are manually triggered—click a button to execute an inspection, click a button to generate a fix PR. This is so you can see the input and output of each step step by step. In a production environment, these operations should be automatically executed by scheduled tasks, calling the same functions that are manually triggered in the demo.

Tracing: Installing a "Dashcam" on the Agent

The purpose of a dashcam is to watch the footage when an accident happens, and to review driving habits even when nothing goes wrong—Tracing in an Agent system serves a similar purpose. However, unlike regular logs, Tracing records structured data, and this data is continuously consumed by the subsequent automatic analysis programs, not just looked at when problems occur.

Trace Data Structure

Let's take an example. The demo has a built-in music library as sample data. Suppose a user asks "Which singer has the most albums," Tracing will record the entire execution process of the Agent:

    Operation: op_abc123
      ├─ Step 0: call_llm  → Model decides to call dbQuery     (Duration 800ms, tokens: 1200)
      ├─ Step 1: call_tool → dbQuery executes SQL        (Duration 50ms, success)
      ├─ Step 2: call_llm  → Model organizes answer based on query results (Duration 600ms, tokens: 900)
      └─ Result: success, total duration 1450ms

The entire process is broken down into one Operation and several Steps. An Operation corresponds to a complete user request, recording summary information like total duration, token usage, and final status. Each Step corresponds to one LLM call or one tool call within the Agent Loop.

Each Step records the following fields:

Field	Description
`type`	Step type: `call_llm` (calling the model) or `call_tool` (calling a tool)
`duration_ms`	Duration of this step (milliseconds)
`tokens`	Number of tokens consumed (only for `call_llm` steps), includes three components: input, output, cached
`tool_name`	Name of the tool called (only for `call_tool` steps)
`tool_success`	Whether the tool call was successful (boolean)
`tool_input`	Input parameters for the tool call
`tool_output`	The complete result returned by the tool
`error`	If an error occurred, detailed error information
`llm_response`	The complete text response from the model (only for `call_llm` steps)

tool_input and tool_output retain the complete context of the tool call. In the admin panel, you can view the input and output of each step, making it easier to troubleshoot problems. llm_response retains the model's reply, which the subsequent replay engine needs to reproduce the execution process.

Insert image description here

How to Integrate Tracing

Source code: packages/server/src/agent/loop.ts

In the demo, the Agent Loop records a trace data entry after each step is completed: after a tool call finishes, tracer.onToolResult() is executed to record the call result; after the LLM reply finishes, tracer.onStepFinish() is executed to record the token usage. This way, the execution code and the recording code always appear in pairs, preventing a situation where a step is executed but the record is missed.

The position of trace calls in the Agent Loop:

Simplified code (see source code for full implementation):

export async function* agentLoop(params: AgentLoopParams): AsyncGenerator<StreamEvent> {
  const tracer = new Tracer({ userId, conversationId, provider, model: modelId })

  const result = streamText({ model: llmModel, system: SYSTEM_PROMPT, messages, tools, maxSteps: 15 })

  for await (const part of result.fullStream) {
    switch (part.type) {
      case 'step-start':
        tracer.onStepStart()
        break
      case 'tool-call':
        tracer.onToolCallStart()
        yield { type: 'tool-call', toolName: part.toolName, input: part.args }
        break
      case 'tool-result':
        tracer.onToolResult(part.toolName, toolInput, success, outputSize, resultObj, errorMsg)
        yield { type: 'tool-result', toolName: part.toolName, success, outputSize }
        break
      case 'step-finish':
        tracer.onStepFinish(tokenUsage, contextInfo, stepText)
        break
      case 'error':
        tracer.onError(String(part.error))
        yield { type: 'error', message: String(part.error) }
        break
    }
  }

  const operationId = tracer.finish(status)
  yield { type: 'done', operationId }
}

Each case branch has a tracer.onXxx() call. The Tracer is created at the function entry and finish()ed at the exit. However many steps the Agent Loop runs, the trace records that many steps.

The Tracer in the demo is self-implemented because the subsequent evolution engine needs to query trace data using SQL (bucketing, matching, backfilling). If your product has already integrated platforms like Sentry or PostHog, you can use these platforms' APIs for manual instrumentation at the same positions in the Agent Loop, recording the same data fields as in the table above. However, the instrumentation data ultimately needs to be synchronized to your own database, or exported via the platform API, because the evolution engine needs to query this data directly.

Two Design Points: Immediate Write and Token Recording

Immediate Write. After the Agent completes each step, the Tracer immediately writes this step's data to SQLite, without caching it in memory. Because the data at the time of a crash is the most valuable. Suppose the Agent crashes at step 5, but the records of the first 4 steps are already in the database, so there is context to check when analyzing the crash cause. If cached in memory for batch writing, the entire trace data would be lost upon a program crash. In the demo, each onXxx() method of the Tracer immediately calls traceStore.insertStep() to write to the database. With SQLite's WAL (Write-Ahead Logging) mode enabled, writing once per step does not cause performance issues.

Token Recording. The Tracer records promptTokens, completionTokens, and cache hit counts in each step-finish event (different providers have different field names—DeepSeek calls it promptCacheHitTokens, Anthropic calls it cacheReadInputTokens, OpenAI calls it cachedPromptTokens). When the Operation ends, Tracer.finish() aggregates the token usage of all steps. The subsequent behavior analysis will use this data, for example, discovering that "the token consumption of a certain type of operation is 10 times that of other operations."

Automatic Error Pattern Recognition

Trace can tell us "what error occurred," but it cannot tell us "how to handle it." This section explains how the system automatically classifies errors and how the classification rules (Patterns) are generated.

Error Records in Trace

Each error recorded by Trace contains fields like provider, error type, HTTP status code, tool name, and error message. Here are several typical errors:

Search API rate limit:

provider=deepseek, error_type=rate_limit, status_code=429, tool_name=null
message: "429 Too Many Requests"

SQL query on a non-existent table:

provider=null, error_type=tool_error, status_code=null, tool_name=dbQuery
message: "no such table: orders"

Model returned unparseable JSON:

provider=openai, error_type=invalid_json, status_code=null, tool_name=null
message: "Unexpected token at position 1234"

Context exceeded token limit:

provider=deepseek, error_type=context_overflow, status_code=400, tool_name=null
message: "This model's maximum context length is 65536 tokens"

These error data all carry fields like provider, type, and status code, but the fields themselves only describe "what happened," not "how to handle it." So we need a rule to determine: what category does this error belong to? This is the role of the Pattern.

Insert image description here

Pattern: Error Classification Rules

Source code: packages/server/src/evolution/pattern-matcher.ts

A Pattern is an error classification rule stored in the database. Each Pattern consists of three parts: a name, matching conditions (which fields must satisfy what values), and a classification (which category this error belongs to). When an error's fields match a Pattern's matching conditions, the system knows which category this error belongs to, without manual judgment. For example:

Pattern Example:
  Name: deepseek-rate-limit-429
  Category: provider_error (provider's problem, not fixable by Harness)
  Match Rule: provider = deepseek AND statusCode = 429

Different error categories correspond to four handling methods: user_error (user's problem, e.g., API Key expired) no code change needed; provider_error (provider's problem, e.g., rate limiting) no code change needed, but retries might be added; harness_bug (our own bug) needs code fix—this is what the "Auto Fix" section later handles; ignore (admin deems no handling needed) skipped directly, also ignored during auto-fix.

When each error is written to the database, the system calls matchError() to try matching existing Patterns—comparing provider, errorType, statusCode, toolName one by one to see if conditions are met, and finally checking the error message with a regex. A match is only counted if all conditions are met. Unmatched errors remain there, waiting for the next inspection round to be bucketed, analyzed by the LLM, and have new Patterns generated.

export function matchError(error: ErrorToMatch): Pattern | null {
  const patterns = loadPatterns()
  for (const pattern of patterns) {
    const rule = pattern.matchRule
    if (rule.provider && rule.provider !== '*' && rule.provider !== error.provider) continue
    if (rule.errorType && rule.errorType !== error.errorType) continue
    if (rule.statusCode != null && rule.statusCode !== error.statusCode) continue
    if (rule.toolName && rule.toolName !== error.toolName) continue
    if (rule.messageRegex) {
      if (!new RegExp(rule.messageRegex, 'i').test(error.message)) continue
    }
    // Hit: update count
    return pattern
  }
  return null
}

Insert image description here

LLM Inspection: Generating New Patterns

Source code: packages/server/src/evolution/error-bucketer.ts, packages/server/src/evolution/inspector.ts

The previous section mentioned that unmatched errors remain waiting for inspection processing. When the system first goes online, the Pattern library is empty, and all errors are unmatched. Inspection is the process of using an LLM to analyze these errors and generate new Patterns.

Bucketing

The LLM doesn't need to look at errors one by one. The root cause is the same whether the same error appears 15 times or once. So the inspection first groups unmatched errors by type—bucketErrors() performs a GROUP BY grouping on five dimensions: provider, error type, HTTP status code, tool name, and error message, and returns the error count for each bucket:

const rows = db.prepare(
  `SELECT e.provider, e.error_type, e.status_code, e.tool_name, e.message, COUNT(*) as count
   FROM errors e ${where}
   GROUP BY e.provider, e.error_type, e.status_code, e.tool_name, e.message
   ORDER BY count DESC`
).all(...params)

Bucketing result example:

Bucket	Count
deepseek × rate_limit × 429 × null	15
null × tool_error × null × dbQuery	8
openai × invalid_json × null × null	3

For example, the 15 deepseek rate limit errors in the table become one bucket after bucketing, and the LLM only needs to analyze it once. The unmatched parameter of bucketErrors() only returns error buckets not yet covered by Patterns; those already having Patterns don't need re-analysis.

LLM Analysis

After bucketing, the system sends the summary information of unmatched error buckets—provider, error type, status code, tool name, error message, occurrence count—to the LLM, letting it determine the root cause of each bucket and generate the corresponding Pattern.

The results returned by the LLM need to be directly written to the database, so the format cannot be arbitrary. A Schema is defined using Zod, passed to the model via Vercel AI SDK's generateObject. The model must return JSON according to this structure; the SDK will automatically retry if the format is incorrect:

const PatternSuggestionSchema = z.object({
  patterns: z.array(z.object({
    name: z.string().describe('Human-readable pattern name, e.g. deepseek-rate-limit-429'),
    category: z.enum(['user_error', 'provider_error', 'harness_bug', 'ignore']),
    errorType: z.string(),
    matchRule: z.object({
      statusCode: z.number().nullable().optional(),
      provider: z.string().optional(),
      toolName: z.string().nullable().optional(),
      messageRegex: z.string().optional(),
    }),
    reasoning: z.string().describe('Why you classified it this way'),
  })),
  bugs: z.array(z.object({
    title: z.string(),
    description: z.string(),
    rootCause: z.string(),
    suggestedFix: z.string(),
    severity: z.enum(['low', 'medium', 'high']),
  })),
  summary: z.string(),
})

const result = await generateObject({
  model,
  schema: PatternSuggestionSchema,
  prompt,  // Contains detailed information of unmatched error buckets
  abortSignal: AbortSignal.timeout(120_000),
})

In the code above, the patterns field is the Pattern generated by the LLM for each error bucket—name, category, match rule, and classification reasoning. The bugs field is a detailed analysis of harness_bug: root cause, fix suggestion, severity—this information will be used in the "Auto Fix" section later. The summary field is a summary of this inspection round.

After the LLM returns the results, applyFixes() writes the new Patterns to the database and immediately backfills all historical errors—finding all records with empty pattern_id, comparing them one by one with the new Patterns, and marking those that match. Errors that were previously unrecognized are now automatically classified after having Patterns.

Insert image description here

Readers can verify this themselves using the demo. After executing pnpm simulate --mock to inject simulated requests, manually execute an inspection in the admin panel. The inspection will take out all unmatched error buckets, the LLM will generate Patterns for each bucket, and then backfill historical errors with the new Patterns. After the inspection is complete, the previously accumulated unmatched errors will mostly be covered.

Auto Fix: Let the System Submit Its Own PRs

The previous section identified harness_bug. This section lets the system automatically generate fix code and submit PRs. From locating files to creating PRs, no developer involvement is needed in between, but merging PRs still requires developer review.

Fix Flow

Source code: packages/server/src/evolution/auto-pr.ts

Auto fix uses the same Agent Loop architecture as chat—the LLM autonomously searches for files, reads code, and modifies code through tools.

The inspection in the previous section marks the Harness's own bugs as harness_bug Patterns and updates the status to fix_status = 'unfixed'. Auto fix processes these unfixed Patterns one by one, generating fix code and submitting PRs for each Pattern. In addition to harness_bug, unhealthy behaviors marked as critical (discussed in the next section) are also processed. Bug fix branches use the fix/ prefix, behavior optimizations use improve/.

For each target to be fixed, the system first creates a git branch, then starts a fix Agent. This Agent is equipped with tools for file searching (glob, grep), reading and writing (readFile, editFile, writeFile), and committing (submitFix). After receiving the error information, the Agent searches for relevant files, reads code, locates the problem, and applies modifications on its own—just like the chat Agent, all done autonomously by the LLM through tools.

After the Agent completes the modifications, the system commits all modified files. If a remote repository already exists, it will automatically push and create a PR; otherwise, it only creates a branch locally. Finally, the fix_status status is updated to pr_created.

Below are some screenshots of the auto fix running process and the PRs submitted by auto fix: Insert image description here

Insert image description here

Safety Boundary: Why Not Let It Merge Code Itself

Auto fix only generates PRs; merging still requires developer review. LLM-generated code may introduce new problems. Without automated tests covering the modified behavior, letting the system merge code itself is too risky. If you want to skip developer review, at least two things are needed: a set of automated tests covering the modified behavior, and a replay engine that can reproduce the operation that triggers the bug. The "Implementation Guide" section later will discuss the design of the replay engine.

Behavior Analysis: No Error ≠ No Problem

The previous sections all addressed "what errors occurred." This section looks at another type of problem—"no error reported, but something's not quite right."

Error Patterns vs. Behavior Patterns

The previous sections analyzed records of actual errors—error Patterns. Behavior analysis has a different object; it looks at all operations, including successful ones. The difference between the two:

	Error Patterns	Behavior Patterns
Analysis Object	Records of actual errors	All operations, including successful ones
What's Discovered	System crashes or exceptions	Not crashed but inefficient, low quality
Example	Search API 429 rate limit, SQL syntax error	Simple question called 8 tools, search-type operations consume too many tokens
Classification Method	Four fixed categories (user_error / provider_error / harness_bug / ignore)	LLM dynamically groups by user intent
Final Action	harness_bug → Auto-generate fix PR	Generate improvement suggestions, severe ones also auto-generate fix PRs

What Counts as "Unhealthy"

A user asks a simple question ("Introduce yourself"), but the Agent calls 3 tools and takes 15 seconds to answer. The status is "success," but the efficiency is clearly off. It might be that the system prompt's guidance on tool usage is not precise enough, causing searches even when tools aren't needed.

The success rate of database query operations is only 65%. The Agent will switch to another way to answer after a tool fails, so it won't report an error, but one-third of the query results are incorrect. It might be that the dbQuery tool's Schema description lacks table structure information, causing the model to write incorrect SQL.

The average token consumption of search research operations is 10 times that of other operations. Either the pages returned by webFetch aren't truncated, or the Agent fetched too many irrelevant pages after searching.

Analysis Flow

Source code: packages/server/src/evolution/behavior-analyzer.ts

To discover the above types of problems, cluster analysis of the operation data is needed first. The system loads summaries of the last 100 operations (one operation corresponds to one complete user request, containing multiple steps, as introduced in the Tracing section) from the database, formats each into a line of text, and concatenates them all into the prompt sent to the LLM:

[0] "Help me research the Hono framework" → tools: webSearch→webFetch→webFetch → success (5 steps, 12.3s)
[1] "Which singer has the most albums" → tools: dbQuery → success (3 steps, 2.1s)
[2] "Summarize this article" → tools: webFetch → success (3 steps, 4.5s)
[3] "Check the top 10 albums by sales" → tools: dbQuery → error (4 steps, 3.2s)
...

Each line of text contains the user message, tool call chain, status, step count, and duration. The LLM groups these operations based on user intent and tool usage patterns, and returns the name, description, representative tool sequence for each group, as well as a list of numbers belonging to that group. For example, [0] and [2] above both involve web search and content fetching, so they would be grouped as "Web Research + Summarization"; [1] and [3] both only used dbQuery, so they would be grouped as "Database Query." There are no predefined categories for grouping; the LLM decides based on the actual data. The output format is constrained using generateObject + Zod Schema.

Taking 100 entries in the demo is because all summaries need to be put into the prompt at once; too many would exceed the context window. In production, it should be changed to incremental analysis by time window. To avoid duplicate analysis, the system records the operation set signature of the previous round and skips if no new operations are added.

After grouping comes scoring. The performance of different types of tasks varies greatly—web research involving webSearch/webFetch is naturally much slower than local dbQuery, so using the same set of thresholds to score both is unreasonable. Therefore, the system automatically selects the corresponding threshold based on the behavior group's tool sequence:

Dimension	Lightweight Tasks (Local Tools)	Heavy Tasks (Including Network Tools)
Avg Duration per Operation	≤ 15 sec	≤ 60 sec
Avg Steps per Operation	≤ 10	≤ 20
Avg Tokens per Operation	≤ 50k	≤ 150k

Success rate (≥ 80%) and tool error rate (≤ 20%) do not differentiate task types; these two metrics are unrelated to complexity. Each dimension is 0.2 points, with a maximum score of 1.0.

const HEALTH_THRESHOLDS = {
  minSuccessRate: 0.8,
  maxToolErrorRate: 0.2,
  light: { maxAvgDuration: 15_000, maxAvgSteps: 10, maxAvgTokens: 50_000 },
  heavy: { maxAvgDuration: 60_000, maxAvgSteps: 20, maxAvgTokens: 150_000 },
}

const WEB_TOOLS = ['webSearch', 'webFetch']

function getThresholdTier(toolSequence: string) {
  return WEB_TOOLS.some(t => toolSequence.includes(t))
    ? HEALTH_THRESHOLDS.heavy : HEALTH_THRESHOLDS.light
}

The judgment logic is simple: check if the behavior group's representative tool sequence contains any network tools. If yes, use the heavy standard; if not, use light. If your Agent has more types of tools, just add new classifications.

A behavior group is judged as "unhealthy" if its health score is below 0.8 (meaning at least two dimensions are substandard), or if it triggers the two key flags low_success_rate or high_tool_error_rate.

Finally, generating suggestions. For groups with low scores, the LLM is called again to provide specific Harness-layer improvement suggestions. The information the LLM receives includes the behavior name, specific health metric values, tool call sequences, and which dimensions are substandard. The requirement is that suggestions target the Harness code itself (e.g., "add retry to webFetch," "optimize dbQuery's Schema description"), not prompt-level adjustments.

Each suggestion given by the LLM is marked with severity: critical indicates a clear code modification plan that can directly enter the auto fix flow; suggestion indicates a suggested improvement, left for the developer to decide whether to implement. critical level suggestions are marked as fix_status = 'unfixed' and will later enter the auto fix processing queue.

Insert image description here

Implementation Guide: From Demo to Production, and Verifying Fixes with a Replay Engine

The previous sections explained each module in the demo. This section discusses how to migrate these things to a real product, and the current biggest shortcoming—how to automatically verify whether a fix actually works.

Migrating to Your Project

Although the demo uses TypeScript + Node.js, the implementation ideas are language-agnostic:

Step	What to Do	Explanation
1	Add Tracing in the Agent Loop	Immediately write to the database after the Agent completes each step. The key is immediate writing—records of completed steps won't be lost even if a crash occurs
2	Error Bucketing	Perform GROUP BY grouping by provider × error type × status code × tool name × error message, see the most frequent error types through SQL
3	Scheduled LLM Inspection	Periodically send unmatched error buckets to the LLM, constrain the output format using `generateObject` + Zod Schema (or the corresponding language's JSON Schema library), automatically generate new Patterns
4	Automatic Backfill	Immediately backfill historical errors after new Patterns are generated; previously unmatched errors can now be automatically classified
5	Auto Fix	`harness_bug` identified by inspection and issues marked as `critical` by behavior analysis both enter the auto fix queue. During fixing, the Agent will read relevant source code, generate fix code, and submit a PR

The most important is Step 1. Without trace data, nothing else can be done.

Scheduling Strategy

All operations in the demo require manual button clicks. In a production environment, scheduled tasks should be configured:

Task	In Demo	Production Recommendation	Reason
Error Inspection (Identify Patterns)	Manual click	Automatically execute every 1-2 hours	Errors greatly impact user experience and need quick discovery. Only analyze newly generated unmatched errors from the previous cycle each time
Behavior Analysis (Health Assessment)	Manual click	Automatically execute every 24 hours	Behavior patterns are slow variables, requiring enough new operation data to be meaningful
Auto Fix (Generate PR)	Manual click	Execute once daily at a fixed time, e.g., midnight	Code modifications require prudence, schedule during off-peak hours

Execution order has dependencies: Inspection executes first → Behavior Analysis executes next → Finally Auto Fix processes all pending fixes. The implementation method is flexible—cron job, GitHub Actions, Temporal all work—the core is just calling the same functions on a schedule.

Cost Reference

Estimated for a production environment: inspection every hour, behavior analysis and auto fix once daily each.

Task	Single Round Token Consumption	Monthly Frequency	Monthly Tokens
Inspection	~10k (error bucket summary + LLM analysis)	24 times/day × 30 days = 720 times	~7.2 million
Behavior Analysis	~20k (100 operation summaries + clustering)	1 time/day × 30 days = 30 times	~0.6 million
Auto Fix	200k-1M/target (full Agent Loop, including multiple rounds of file reading and code generation)	~10 targets/month	~2-10 million

Auto fix consumes the most, because the fix Agent carries the full conversation history with each LLM call. The more files read and the more places modified, the larger the context. Simple fixes (modifying one or two files) are about 200k tokens, complex fixes (spanning multiple files, requiring multiple iterations) can exceed 1 million tokens.

Estimating with DeepSeek V4-Pro (input ¥3/million tokens, cache hit ¥0.025/million tokens, output ¥6/million tokens), the monthly cost is about ¥30-80, depending on the number and complexity of fix targets. Inspection prompts have high repetition; costs will be lower when cache hit rates are high. Switching to models like GPT-4o or Claude, the price will be one to two orders of magnitude higher.

The above is only a rough estimate based on the demo scale; actual costs depend on system complexity, number of errors, and fix frequency.

Three Levels of Self-Evolution

The various steps discussed earlier can be divided into three levels based on the degree of automation:

Level	What Humans Do	What the System Does	Covered in This Article
L1 Manual	Read logs, classify errors, write fix code	None	—
L2 Assisted	Confirm classification results + decide whether to fix	Automatically collect data, identify suspicious errors	—
L3 Led	Review PRs + high-risk decisions	Collect → Identify → Classify → Generate changes → Submit PR	Mainly demonstrated in this article

flowchart LR
    subgraph L1[L1 Manual]
        A1[Human reads logs] --> A2[Human classifies] --> A3[Human fixes]
    end
    subgraph L2[L2 Assisted]
        B1[System collects data] --> B2[System identifies suspicious errors] --> B3[Human confirms + Human fixes]
    end
    subgraph L3[L3 Led]
        C1[System collects] --> C2[System identifies] --> C3[System classifies] --> C4[System generates changes] --> C5[System submits PR] --> C6[Human reviews PR]
    end
    L1 -.->|Add Tracing| L2
    L2 -.->|Add LLM Inspection + Auto Fix| L3

The three levels in the table correspond to the steps in "Migrating to Your Project" above: after adding Tracing, you reach L2; then adding LLM inspection and auto fix reaches L3.

Replay Engine: Automatically Verifying Fix Effectiveness

The biggest shortcoming of the current system is: auto fix generates a PR, but whether the PR actually fixes the bug still relies on developer testing and review. The replay engine aims to solve this problem, letting the system verify the fix's effectiveness itself.

Why Deterministic Replay is Feasible

Recall the Tracing section: the steps table records the input and output of each step—llm_response stores the model's reply, tool_output stores the JSON returned by the tool. During replay, there's no need to call a real LLM or execute real tools; just pass the recorded data to the Agent Loop in step order, and the entire execution process becomes a pure function.

This is particularly effective for harness_bugs. These bugs occur in how the Harness processes return values, independent of the return values themselves. For example, "compression wasn't triggered when tokens were near the limit, causing the next API call to fail"—regardless of what the model said, as long as the context management logic has a problem, the bug will appear. Replaying the same LLM replies and tool return values has a high probability of reproducing it.

Implementation Plan: Reusing the Same agentLoop

No need to write a new execution engine. Replay still uses the same agentLoop(), just replacing the LLM and tools with mock versions that directly return recorded data.

First, look at how agentLoop works during normal execution:

sequenceDiagram
    participant AL as agentLoop
    participant LLM as Real LLM API
    participant T as Real Tool

    AL->>LLM: Send message
    LLM-->>AL: Call dbQuery
    AL->>T: Execute dbQuery(SQL)
    T-->>AL: Query result JSON
    Note over AL: Harness processes return value<br/>(context management, error recovery, etc.)
    AL->>LLM: Pass tool result back
    LLM-->>AL: Generate final reply

During replay, replace both the LLM and tools, but the agentLoop code itself does not need modification:

sequenceDiagram
    participant AL as agentLoop (same code)
    participant SP as ScriptedProvider
    participant TR as ToolReplayer

    Note over SP: Read recorded llm_response<br/>from steps table
    AL->>SP: Send message
    SP-->>AL: Return recorded call dbQuery
    Note over TR: Read recorded tool_output<br/>from steps table
    AL->>TR: Execute dbQuery(SQL)
    TR-->>AL: Return recorded query result JSON
    Note over AL: Harness processes return value<br/>(this is what's being tested)
    AL->>SP: Pass tool result back
    SP-->>AL: Return recorded final reply

ScriptedProvider implements the same interface as a real LLM (LanguageModelV1), but doesn't connect to an upstream API; it directly returns the recorded llm_response from the steps table in step order. ToolReplayer replaces the tool registry; each tool_call directly returns the recorded tool_output, without sending emails, executing code, or requesting external APIs.

From agentLoop's perspective, replay and real execution are no different—it doesn't know whether the other side is a real LLM or a video playback. What's being tested is the processing logic of agentLoop after getting the return values: whether context management correctly triggered compression, whether tool results were correctly truncated, whether error recovery missed any edge cases.

Integration with Auto Fix

After auto-pr.ts generates code modifications and before pushing, add a verification step (Safety Gate):

Reproduction verification is divided into two steps: first, replay the operation that triggered the bug using the unpatched code—it must fail (confirming the trace can reproduce the bug); second, replay the same operation using the patched code—it must succeed (confirming the fix is effective). If the first step succeeds, it means this bug might be related to conditions outside the Harness, and replay cannot verify it, so skip. Regression verification is running the patched code against traces corresponding to previously fixed bugs; only if all pass is submission allowed.

Automatic Accumulation of Regression Corpus

Whenever a fix PR for a harness_bug is merged (fix_status → merged), the trace that triggered this bug is automatically stored into a regression case set. Afterwards, each time a new fix PR is generated, besides verifying that the new PR itself can fix the target bug, all existing traces in the regression case set must also be replayed to ensure the new changes haven't reintroduced previously fixed bugs. The more bugs fixed, the larger the regression case set, and the broader the regression protection coverage.

Boundaries

What the replay engine verifies is the Harness's processing logic, not the LLM's answer quality. Because the LLM's replies are fixed recorded data during replay, how well the model answers is not within the verification scope. The replay engine only cares about one thing: after the Harness receives these return values, whether the code logic for context management, tool result processing, and error recovery executed correctly.

Due to complexity, I did not implement the replay engine in the demo, but this is indeed feasible, as I have already implemented it in another product before.

What You Get After Doing All This

Readers can directly verify the effectiveness of this process in the demo. First execute pnpm simulate --mock to inject simulated data (you can also omit the --mock parameter to use real API calls), then run an inspection in the admin panel. Because the Pattern library is still empty at this point, all errors are in an "unmatched" state. After the inspection executes, the system will group similar errors together, generate corresponding Patterns for each group of errors, and then use the new Patterns to backfill previous historical errors, marking those that can be matched. As a result, most errors in the originally all "unmatched" error list are now classified, and the unmatched error rate drops significantly.

But inspection only identifies and classifies errors; what truly reduces errors is the subsequent auto fix. After inspection marks harness_bug type Patterns, auto fix will generate fix PRs for these bugs. After developers review and merge the PRs, the corresponding bugs will no longer appear in subsequent requests. As inspection continuously discovers new bugs, auto fix continuously generates PRs, and developers continuously merge fixes, the errors generated by the system will become fewer and fewer, and the problems requiring manual handling will also decrease.

This approach comes from LobeHub's production practice (《What needs self-evolution is not the Agent, but the Harness》), their published data:

After 9 rounds of inspection, Patterns grew from 31 to 104 and then saturated, with the number of new Patterns added per round continuously dropping from 31 to 0
During the inspection process, over 20 defects in the Harness itself were discovered, including Schema incompatibility, negative max_tokens, reasoning_content loss, Context Window overload, etc.
Agent success rate improved from about 75% early on to over 95%

Once this mechanism is running, errors generated by each request will be recorded by trace, and the next scheduled inspection will automatically classify it into an existing Pattern or generate a new Pattern. The same type of error only requires developer review once (confirming whether the Pattern classification is accurate); when the same type of error occurs again later, the system matches it directly, without requiring developer intervention. As the Pattern library accumulates with inspection rounds, unmatched errors will become fewer and fewer.

A Self-Evolving Harness That Finds Its Own Bugs and Submits Fix PRs

Let AI Agent Systems Discover Their Own Bugs and Submit Their Own Fix PRs: Self-Evolving Harness

Starting with a Bug

From Passive to Active

Quick Demo Experience

Quickly Understanding the Full Picture of AI Agent Harness Engineering

What is an AI Agent

Core Components of an Agent

Demo Project Overview

Differences Between Demo and Production Environment

Tracing: Installing a "Dashcam" on the Agent

Trace Data Structure

How to Integrate Tracing

Two Design Points: Immediate Write and Token Recording

Automatic Error Pattern Recognition

Error Records in Trace

Pattern: Error Classification Rules

LLM Inspection: Generating New Patterns

Bucketing

LLM Analysis

Auto Fix: Let the System Submit Its Own PRs

Fix Flow

Safety Boundary: Why Not Let It Merge Code Itself

Behavior Analysis: No Error ≠ No Problem

Error Patterns vs. Behavior Patterns

What Counts as "Unhealthy"

Analysis Flow

Implementation Guide: From Demo to Production, and Verifying Fixes with a Replay Engine

Migrating to Your Project

Scheduling Strategy

Cost Reference

Three Levels of Self-Evolution

Replay Engine: Automatically Verifying Fix Effectiveness

Why Deterministic Replay is Feasible

Implementation Plan: Reusing the Same agentLoop

Integration with Auto Fix

Automatic Accumulation of Regression Corpus

Boundaries

What You Get After Doing All This

My Other AI Agent Articles

References