Dewu's AI Harness Wraps the Full PDCA Loop Around Recommendation Agents

This article is a technical record compiled from a presentation by Dewu technical experts at AICon Shanghai.

It is the opening piece of the "Dewu Recommendation AI Harness Engineering Practice Series," a three-part serial. The series will systematically deconstruct the complete technical system for AI code generation, protection and verification, and safe launch within Dewu's complex recommendation business scenarios. It will detail the overall architecture framework of the self-developed AI Harness, the full-process security protection mechanism, the core algorithm implementation of the hybrid agent, and the practical details of industrial-grade engineering implementation.

This part (Part 1): Provides an overall introduction to the team's self-developed AI Harness system, including the construction approach, the full-lifecycle protection mechanism, the hybrid agent architecture, and implementation results.

1. From AI Coding to AI Builder

AI writing code is nothing new. The real difficulty is: how to make AI continuously produce within complex business systems according to goals, boundaries, and quality standards. Dewu's recommendation answer is not to reinvent a tool that writes code better, but to build an AI Harness around the entire PDCA chain, making requirements constrainable, execution uninterrupted, effects measurable, and experience reusable.

Over the past year, the AI Coding experience has matured rapidly. It can write code, supplement tests, fix bugs, and even perform very efficiently in local tasks. But in real engineering systems, "being able to run" does not equal "producing according to goals."

Recommendation systems are especially like this: long chains, many modules, changing one place might affect multiple recall paths; effect changes are hard to explain, and experience is difficult to solidify into standards. If AI only stays in the Do phase, it becomes just a faster code generator, not an engineering partner that can drive business iteration.

Core change: AI-ification is not just about the development phase, but the entire cycle loop.

2. Why AI Coding Alone Is Not Enough

Traditional engineering iteration can be abstracted into PDCA: Plan aligns goals and boundaries, Do completes development and implementation, Check verifies effects and risks, Act consolidates reviews and the next round of optimization. AI Coding mainly solves Do, but failures in complex systems often do not only occur in Do.

Therefore, our goal for recommendation AI is not to make AI "better at writing code," but to bring AI into the complete iteration flywheel: clearer goals, uninterrupted execution, quantifiable effects, and reusable experience.

AI Coding to AI Builder: Being able to run does not equal producing according to goals.

3. The Essence of Harness: Not an Iron Cage, but an Environment

Before talking about Harness, consider a movie: The Truman Show. Truman is locked in a giant fake world, but the truly effective constraints are not cameras, the island, or the actors, but the environment itself: it makes Truman feel that this is just the way the world is.

A good AI Harness is the same. It is not about hanging a series of hard rules outside the AI, but embedding goals, boundaries, dependencies, verification, and feedback capabilities into the collaborative environment, making it difficult for the AI to cross boundaries while "acting naturally."

A good Harness is not an iron cage, but an environment. It makes the AI feel like it is acting freely, but every step is naturally within a verifiable, rollback-able, and reusable engineering context.

The Truman Show: The most effective harness is the environment, making him feel that the world is supposed to be this way.

Seven-Stage Guardrails: Breaking PDCA into Measurable Collaboration Surfaces

Seven-Stage Guardrails: Full Coverage of PDCA

4. Plan: Using Contracts to Turn Requirements into Guardrails

Many requirement failures happen not because the code was poorly written, but because the understanding was wrong from the start. Natural language PRDs are ambiguous for humans, and even more so for AI. So the core of the Plan phase is to transform requirements into structured contracts that AI can understand, execute, and verify.

In Dewu's recommendation practice, T-PRD breaks down requirements into EPs, each EP bound to an impact scope, metric direction, stability red lines, and acceptance assertions. Taking "negative feedback re-weighting" as an example, the product says "users click 'not interested', hoping to see fewer similar products." Engineering must break this down into executable units like signal access, multi-granularity down-weighting strategies, experiments, and metric guardrails.

feature: negative_feedback_rerank
goal: After a user clicks "not interested," reduce the exposure of similar products.
scope:
  - Signal: not_interested / dislike
  - Ranking: item / spu / shop / brand
guardrails:
  - Prohibit significant degradation of core click-through rate
  - Must retain diversity and novelty observations
  - All affected modules must have a rollback path

5. Do: Let AI Develop with Zero Waiting

The biggest fear in autonomous AI development is "waiting for people." If it finishes writing code but cannot run it, cannot get logs, and dependent services are unstable, it will keep turning back to ask humans, ultimately becoming a very expensive auto-complete.

6. Check: Make Recommendation Effects Measurable 24/7

Checking in recommendation systems is difficult because often the team itself cannot easily judge "whether this recommendation is good or not." Traditional methods rely on AUC, GAUC, online experiments, and manual reviews, which are costly and slow to provide feedback.

The Axis Recommendation AI Evaluation Platform introduces AI reviewers that simulate different user profiles and score recommendation results on dimensions like novelty, quality, and relevance. It does not replace online experiments but adds an extra layer of experience risk radar before launch. AI scores all results, experts sample and review, and the review knowledge is then deposited back into the evaluation system.

Key Point: AI evaluation is not to prove the model is definitely right, but to expose experience risks earlier and allow review criteria to be continuously accumulated.

Check: Axis Recommendation AI Evaluation Platform, turning experience review into 24/7 automatic review.

7. Act: Turning Bad Cases into the Next Round of Capability

When an anomaly occurs online, the system enters a process of Bad Case capture, diagnosis, sandbox replay, and Story deposition. A single problem investigation should not just leave a conclusion, but should leave a path that can be directly reused next time.

8. After the Seven Stages, Three Deep Pain Points Remain

Process guardrails can solve many problems, but the Agent itself still has limitations: knowledge gets lost, behavior drifts, and paths are opaque. These are not problems of a specific phase, but problems carried by Agent engineering.

9. Knowledge Governance: Documents for People, Coding Puts Shackles on AI

There is a programmer joke: Programmers dislike two things most, first, others not writing documentation, and second, writing documentation themselves. AI is the same. If you don't tell it the rules, it runs wild; if you tell it in pure natural language, it struggles to stably understand boundaries.

Dewu's recommendation team divides knowledge into three layers: L1 is the overall architecture, defining non-negotiable action boundaries; L2 is module design documents, explaining key trade-offs and dependency relationships; L3 is code comments, closest to the AI, fetched on demand while reading code.

In experiments, after supplementing L3 comments, the accuracy for simple problems rose from 52% to 91%, and complex problem accuracy reached 100%; total token consumption for simple problems dropped by 48%, and for complex problems by 26%. A single context might become longer, but the number of task completion turns decreased significantly, so overall costs actually fell.

L3 Comment Evaluation: Moving the model from guessing to verifiable.

10. TuiChaCha: A Hybrid Agent Architecture of Highway and ATV

In the recommendation chain troubleshooting scenario, a realistic observation is: 80% of problems are high-frequency, categorizable, and reproducible, while 20% are long-tail, complex, and require exploration. These two types of problems should not be solved by the same Agent path.

Highway: Determinism Comes from Code

A classic joke is: Your girlfriend tells you to buy two bananas, and if you see apples, buy four. A human brain will guess whether to buy bananas or apples; code won't, it just executes according to conditions.

The Highway principle is the same: a good Highway is not better at guessing, but doesn't guess at all. Stable paths are written into code, so every execution happens in the same place, observes in the same place, and locates errors in the same place. The LLM is only responsible for the final result polishing.

ATV: Long-tail Problems Need Controlled Exploration

The remaining 20% of long-tail problems cannot be covered by hard-coded programs. ATV provides tools, MCP, and constraints, allowing the Agent to autonomously decompose, call tools, read results, and generate conclusions in a ReAct manner. After a successful exploration, Memory prunes the trajectory, elevates one-time features like UIDs into business variables, and after a Dry Run admission, deposits them as new Highway capabilities.

Memory: Turning a single success into the next default capability.

11. NOW: From Single-Point Efficiency to Engineering Compound Interest

When Plan, Do, Check, and Act are all governed by the AI Harness, the benefit is not just "someone writes code faster," but the entire iteration system starts turning.

The value of this system is not to let AI replace engineers, but to make the engineering system itself more suitable for humans and AI to work together.

12. Epilogue: The Carbon-Silicon Butterfly Dream

Over two thousand years ago, Zhuangzi woke up, not knowing if he had dreamed of being a butterfly, or if a butterfly had dreamed of being him. Today's AI collaboration has a similar illusion: on one hand, we write Prompts for large models, feed them Context, and encourage them to enter a creative state; on the other hand, we ourselves, within processes, work orders, SOPs, and evaluation metrics, increasingly resemble an interface.

Thus, an interesting reversal emerges: we treat AI like humans, accepting its emergence, hallucinations, and uncertainty; simultaneously, we also treat humans like AI, engineering communication prerequisites, inputs and outputs, execution boundaries, and health metrics.

The Harness is the edge of the dream. It doesn't judge who is dreaming, it only ensures that when AI wakes, there are rules to fall back on, and when humans are tired, there are processes for support. The ultimate proposition is not "Can AI write code?" but "Can we incorporate AI into a controllable, measurable, and reusable engineering collaboration system?" This is the real leap from wild code to goal-oriented production.

Carbon-Silicon Butterfly Dream: The Harness is the edge of the dream, and also the safety net for engineering collaboration.

Next Part Preview: The "Dewu Recommendation AI Harness Engineering Practice Series" (Part 2) "Recommendation System Diagnostic Agent: From 'Calling APIs' to 'Thinking' | Dewu Technology" will further deconstruct more principles and engineering implementation details based on this article.

Past Reviews

Text / San Bai

Follow Dewu Technology for weekly technical干货 (dry goods).

If you find the article helpful, feel free to comment, forward, and like~

Reproduction without permission from Dewu Technology is strictly prohibited, otherwise legal liability will be pursued according to law.