跪拜 Guibai
← Back to the summary

VLM-Driven UI Testing That Heals Itself Across iOS, Android, and HarmonyOS

1. Why We Need AI-Native UI Testing

Three Pain Points of Traditional Solutions

Pain Point 1: High Cost of Test Case Migration

Test case platforms accumulate a large number of descriptive test cases, but they are not directly executable. QA needs to manually translate them one by one: understanding business logic, writing element locators, and debugging execution paths. For a medium-sized module, the conversion cost can take several person-days.

Pain Point 2: Low Debugging Efficiency, High Manual Intervention

The troubleshooting process after a test case fails is: view screenshots, compare pages, determine the cause of failure, modify the script, and re-execute. When the failure is caused by non-obvious factors such as "pop-up occlusion" or "process changes," the debugging cost is extremely high.

Pain Point 3: Writing One Set Per Platform Triples Maintenance Costs

iOS, Android, and HarmonyOS have completely different element location methods. When the UI is revamped, three sets of scripts fail simultaneously.

The AI-Native Solution

Image

2. Capability 1: Automatic Conversion of Test Case Platform Data

Problem Scenario

The JSON exported from the internal test case platform is a multi-layered tree structure (directory nodes + test case nodes), with each test case carrying fields such as breadcrumb path, priority, and tags. Taking a certain business module as an example, this structure contains hundreds of nodes, over a hundred test cases, and a maximum depth of more than ten layers. The traditional processing method requires QA to manually translate them one by one, taking several person-days to complete the conversion of over a hundred test cases.

Solution: Automated Pipeline

ai_uitester has designed an automated pipeline to automatically convert the test case platform JSON into executable scripts.

Platform JSON Export
 ↓
Phase 1: Tree Flattening — Extract all leaf nodes and breadcrumb paths
 ↓
Phase 2: Test Case Parsing — Breadcrumb path → Structured test case data
 ↓
Phase 3: Deduplication — Deduplicate against existing suites
 ↓
Phase 4: LLM Enhancement — Generate executable steps + Inject Wiki knowledge
 ↓
Phase 5: Persistence — Write to configuration file
 ↓
Phase 6: Version Archiving — Record version history

Core Phase: LLM Enhancement

The LLM enhancement module converts "descriptive test cases" into "executable scripts." The input is the Checkpoint description from the test case platform (e.g., "A certain list displays normally and can be swiped to view more"), and the output is a complete JSON script containing step types such as App, Tap, Wait, Assertion, and Swipe.

Prompt Engineering: Enabling LLM to Generate High-Quality Steps

StepGenerator uses carefully designed prompts. Key constraints include: Step type specifications (each type has a strict Instruction format):

Image

Key Design Decisions:

Parallel Enhancement and Checkpoint Resumption

LLM enhancement supports high-concurrency processing and implements module-level Wiki preloading — test cases within the same module share the same Wiki content, avoiding duplicate calls. After each phase of the pipeline is completed, a checkpoint file is written; upon interruption, completed phases are automatically skipped. When enhancement fails completely, a Fallback test case is constructed, with its name determined through a multi-level fallback strategy.

Actual Results

Image

Wiki Knowledge Base: Consumption Panorama and Core Principles

The Wiki knowledge base is not an independent document collection but an infrastructure layer embedded in multiple core modules of ai_uitester. It is consumed in 5 major scenarios:

Wiki Quality Directly Affects Three Core Metrics:

Image

The principle of "quality over quantity" runs throughout — a five-level fallback matching strategy (Exact match → Remove priority suffix → Remove parentheses → LLM semantic match → Skip and cache) ensures that the knowledge injected into the prompt is accurate and reliable; incorrect knowledge is more harmful than no knowledge.

3. Capability 2: AI Intelligent Debugging and Test Case Self-Healing

The Plight of Traditional Debugging

The troubleshooting loop after a test case fails: Fail → View screenshot → Determine cause → Modify script → Re-execute → Fail again → ……, a single test case may need to be debugged multiple times before passing.

AI Intelligent Debugging Mode

ai_uitester has a built-in AI intelligent debugging mode that implements automatic diagnosis and self-healing repair:

Test Case Execution (while loop, supports dynamic step changes)
  ↓ Step Failure
Failure Classifier (Rule Engine)
  ├─ device / timeout / network → Auto switch device or retry
  └─ business → Enter AI Diagnosis
  ↓
AI Diagnosis (VLM)
  ├─ Input: Steps annotated with ✓✗○ + Error Info + Failure Screenshot + Wiki Knowledge
  └─ Output: diagnosis + confidence + complete_steps + resume_from_index
  ↓
┌──────────────────────────────────────────────┐
│  Confidence >= Threshold      │  Confidence < Threshold │
│  → Auto-apply fix             │  → Pop up manual review │
│  → Replace execution steps    │  → Show step diff        │
│  → Re-execute from            │  → Timeout countdown     │
│    resume_from_index          │  → Accept/Reject         │
│  → Execution passes → Solidify│  → Timeout auto Reject   │
│    test case                  │                          │
│  → Execution fails → Rollback │                          │
│    to original test case      │                          │
└──────────────────────────────────────────────┘

Failure Classifier: Filter First, Then Diagnose

Not all failures require AI diagnosis. The system uses a rule engine to filter out non-business failures such as device faults, timeouts, and network issues, automatically retrying them; only business logic failures enter the AI diagnosis process.

Five Types of Root Cause Diagnosis

Image

Three Real Self-Healing Cases

Case 1: UI Change — Functional button position moved (Confidence: 0.9)

Original: [tap] Click a function button in the bottom toolbar → Failed: Button not found
Fix: [tap] Click the corresponding button in the top function menu bar of the page

Case 2: UI Change — Entering a page requires an extra operation (Confidence: 0.8)

Original: [tap] Click entry → Failed: Did not enter the target page after clicking
Fix: Insert a wait after the step, then click again
  [tap] Click entry
  [wait] Wait for the target page to load      ← New
  [tap] Click the entry button again            ← New

Case 3: Preceding Step Failure — Pop-up occlusion causes all subsequent steps to fail (Confidence: 0.9)

After a cold start of the App, a confirmation pop-up appeared, occluding the home page. The AI diagnosis detected that although intermediate steps were marked as ✓, they did not actually produce the expected effect. The diagnosis result was "preceding step failure," rolling back the execution pointer to step 2 and inserting a conditional Action to handle the pop-up after launching the App.

Confidence Mechanism: Better to Miss a Click Than to Click Wrong

In automated testing, "clicking the wrong place" is far more harmful than "not clicking." Confidence calibration anchors:

Image

Two iron rules: (1) MatchedText must be copied character-by-character from the screenshot, no fabrication allowed; (2) Better not to click than to click wrong.

4. Capability 3: VLM-Driven Cross-Platform Unification

The Revolutionary Nature of the VLM Approach

The core execution model of ai_uitester is a "Screenshot → Understand → Execute" closed loop. What the VLM sees is a pixel-level screenshot, not a DOM structure. This means: cross-platform unification is inherent (the same set of instructions works across all three platforms), natural immunity to UI changes (a button can still be found even if it moves), and what you see is what you get (the test logic is completely consistent with the interface a human sees).

Unified API Interface

The execution engine provides a unified API covering categories such as operations, assertions, queries, and waits, shielding the underlying platform differences.

Image

The same JSON script can be executed on Android, iOS, and HarmonyOS without any modification.

Automatic Underlying Driver Selection

The corresponding underlying driver framework is automatically selected based on the device type, completely transparent to the upper-layer code.

Image

Core Execution Engine: BaseAIDriver

BaseAIDriver serves as the abstract base class for all platform drivers, implementing the core perception-decision loop of "Screenshot → Large Model Analysis → Decision Execution → Logging → Re-screenshot." This loop executes for a maximum of 20 rounds. Click operations are accompanied by a confidence verification mechanism, and after querying the knowledge base, execution is forced to continue.

Image

Image

Four Major Constraints of Prompt Engineering

Deep Thinking Mode

When deep thinking mode is enabled, the model gains three capabilities: sub-goal decomposition, progress tracking (✓/→/○ visualization), and before/after screenshot comparison, suitable for complex business processes.

5. Architectural Design Trade-offs

Why "Step-by-Step Execution" Instead of "One-Time Planning"? The core challenge of UI testing is state uncertainty — the screen changes after each operation step, and pre-planning might be based on outdated information. The cost is that a single operation may require multiple rounds of LLM calls (up to 20 rounds), balanced through deep thinking sub-goal decomposition.

Why is the Confidence Threshold Set to 0.5? After extensive real-world testing and tuning, it strikes a balance between accuracy and coverage — if the threshold is too high, many operations are rejected, leading to low execution efficiency; if too low, the risk of mis-clicks increases. The current threshold ensures high confidence that "pass means correct."

Why Does Self-Healing Return a Complete Step List Instead of an Incremental Diff? Incremental diffs can easily cause index shifts after multiple fixes; a complete list is more intuitive and reliable. Token consumption is higher, but it avoids "fixes introducing new bugs."

6. Industry Comparison: Why It's a "New Paradigm"

Currently, there are three main technical routes for UI automation testing in the industry. We conduct a systematic comparison across four dimensions: core technology stack, cross-platform capability, maintenance cost, and self-healing capability.

Traditional Solutions: Appium / Selenium / XCUITest

Core Principle: Based on element location — find UI elements through locators such as ID, XPath, Accessibility ID, Class Name, and then perform operations like Click/Input/Swipe. The underlying layer obtains the View Tree/DOM structure through each platform's Accessibility API or UIAutomator.

Typical Code:

# Android (Appium)
driver.find_element(By.ID, 'com.example.app:id/btn_login').click()
# iOS (XCUITest)
app.buttons['Login'].tap()
# HarmonyOS (Hypium)
driver.find_component('Login').click()

Advantages: Fast execution speed (single-step operations in milliseconds), mature community, well-established CI/CD integration solutions, rich assertion capabilities.

Disadvantages:

AI-Assisted Solutions: Test.ai / Applitools

Core Principle: This approach overlays AI capabilities on top of traditional automation frameworks, mainly divided into two implementation directions. AI Element Location (Test.ai): Relies on CV or NLP models to replace hardcoded ID/XPath, matching elements through visual features of screenshot regions or using natural language descriptions instead of locators; AI Visual Comparison (Applitools): Uses VLM to compare screenshot diffs, automatically determining "visual regressions," but does not replace the underlying execution engine.

Advantages: Reduces locator maintenance costs; natural language descriptions are more readable than IDs; visual regression testing can discover UI issues missed by traditional assertions.

Disadvantages:

AI-Native Solution: ai_uitester (This Article)

Core Principle: Uses VLM as the execution engine, replacing element location with a "Screenshot → Understand → Execute" closed loop. The VLM not only identifies UI elements but also understands page semantics, business processes, and context. The knowledge base (Wiki) decouples business rules from test execution.

Typical Code:

{
  "steps": [
    {"type": "tap", "instruction": "Click the first Tab 'Community' in the bottom navigation bar"},
    {"type": "tap", "instruction": "Click the publish button in the top right corner of the page"},
    {"type": "assertion", "instruction": "Assert that the function entry appears on the page"}
  ]
}

The same JSON script is universally applicable across Android, iOS, and HarmonyOS without any modification.

Core Differences from Traditional and AI-Assisted Solutions:

Image

The Evolutionary Relationship of the Three Routes

The relationship between the three is not one of replacement, but a leap in capability dimensions:

Traditional Solution (Appium)
  └─ Solves "Can it be tested?" → Provides basic execution capability
      └─ AI-Assisted Solution (Test.ai)
            └─ Optimizes "Is it easy to test?" → Reduces locator maintenance cost
                └─ AI-Native Solution (ai_uitester)
                      └─ Redefines "Who tests?" → Shifts from human-driven to AI-driven

The key difference summed up in one sentence: The assumption of traditional and AI-assisted solutions is "the UI doesn't change," so humans need to maintain locators; the assumption of the AI-Native solution is "the UI will definitely change," so it lets AI understand and adapt to changes. This is a fundamental shift in testing philosophy — from "resisting change" to "embracing change."

7. Business Results Data

ai_uitester has been deployed and running in the client testing of the Dewu App. Core metrics are as follows:

Core Efficiency Metrics

Image

Quality Improvement Metrics

Image

Note: The self-healing success rate is affected by the quality of the knowledge base and the execution scenario; complex process changes still require manual confirmation. The above data is based on the measured averages of multiple core business modules.

8. Summary

ai_uitester represents a new AI-Native paradigm for UI automation testing:

Image

This is not just a tool upgrade, but a paradigm shift in testing — from "code-driven" to "vision-driven," from "manual debugging" to "AI self-healing," and from "three separate platforms" to "unified abstraction." The closed-loop design of the Wiki knowledge base ensures this is not a one-off tool, but a testing infrastructure that becomes smarter with use.

Previous Reviews

  1. From Wild Code to Goal-Oriented Production: Engineering Practice of Dewu's AI Harness Recommendation | AICon Speech Compilation

  2. From Forms to Agents: The AI Practice Path of Dewu Community Activity Building

  3. From Tracking Requirements to Rule Assets: Hermes Agent Reconstructs Dewu's Data Warehouse Workflow

  4. Giving Claude Code a Self-Evolution and Memory System | Dewu Technology

  5. Reconstructing the Alert Troubleshooting Process with LLM Agents | Dewu Technology

Text / Lin Lin

Follow Dewu Technology for weekly technical干货 (practical content)

If you find the article helpful, feel free to comment, forward, and like~

Reproduction without permission from Dewu Technology is strictly prohibited, otherwise legal liability will be pursued according to law.