VLM-Driven UI Testing That Heals Itself Across iOS, Android, and HarmonyOS

1. Why We Need AI-Native UI Testing

Three Pain Points of Traditional Solutions

Pain Point 1: High Cost of Test Case Migration

Test case platforms accumulate a large number of descriptive test cases, but they are not directly executable. QA needs to manually translate them one by one: understanding business logic, writing element locators, and debugging execution paths. For a medium-sized module, the conversion cost can take several person-days.

Pain Point 2: Low Debugging Efficiency, High Manual Intervention

The troubleshooting process after a test case fails is: view screenshots, compare pages, determine the cause of failure, modify the script, and re-execute. When the failure is caused by non-obvious factors such as "pop-up occlusion" or "process changes," the debugging cost is extremely high.

Pain Point 3: Writing One Set Per Platform Triples Maintenance Costs

iOS, Android, and HarmonyOS have completely different element location methods. When the UI is revamped, three sets of scripts fail simultaneously.

The AI-Native Solution

2. Capability 1: Automatic Conversion of Test Case Platform Data

Problem Scenario

The JSON exported from the internal test case platform is a multi-layered tree structure (directory nodes + test case nodes), with each test case carrying fields such as breadcrumb path, priority, and tags. Taking a certain business module as an example, this structure contains hundreds of nodes, over a hundred test cases, and a maximum depth of more than ten layers. The traditional processing method requires QA to manually translate them one by one, taking several person-days to complete the conversion of over a hundred test cases.

Solution: Automated Pipeline

ai_uitester has designed an automated pipeline to automatically convert the test case platform JSON into executable scripts.

Platform JSON Export
 ↓
Phase 1: Tree Flattening — Extract all leaf nodes and breadcrumb paths
 ↓
Phase 2: Test Case Parsing — Breadcrumb path → Structured test case data
 ↓
Phase 3: Deduplication — Deduplicate against existing suites
 ↓
Phase 4: LLM Enhancement — Generate executable steps + Inject Wiki knowledge
 ↓
Phase 5: Persistence — Write to configuration file
 ↓
Phase 6: Version Archiving — Record version history

Core Phase: LLM Enhancement

The LLM enhancement module converts "descriptive test cases" into "executable scripts." The input is the Checkpoint description from the test case platform (e.g., "A certain list displays normally and can be swiped to view more"), and the output is a complete JSON script containing step types such as App, Tap, Wait, Assertion, and Swipe.

Prompt Engineering: Enabling LLM to Generate High-Quality Steps

StepGenerator uses carefully designed prompts. Key constraints include: Step type specifications (each type has a strict Instruction format):

Key Design Decisions:

Conditional pop-ups use the Action type: handled when present, automatically skipped when absent;
Non-standard steps are automatically downgraded: a validator acts as a safety net, preferring correction over discarding;
Every test case must contain an App step, otherwise validation reports an error directly.

Parallel Enhancement and Checkpoint Resumption

LLM enhancement supports high-concurrency processing and implements module-level Wiki preloading — test cases within the same module share the same Wiki content, avoiding duplicate calls. After each phase of the pipeline is completed, a checkpoint file is written; upon interruption, completed phases are automatically skipped. When enhancement fails completely, a Fallback test case is constructed, with its name determined through a multi-level fallback strategy.

Actual Results

Wiki Knowledge Base: Consumption Panorama and Core Principles

The Wiki knowledge base is not an independent document collection but an infrastructure layer embedded in multiple core modules of ai_uitester. It is consumed in 5 major scenarios:

Test Case Enhancement Phase: Inject Wiki knowledge into test case generation to make generated steps more precise;
Self-Healing Diagnosis Phase: Load Wiki by module to assist the LLM in distinguishing "UI changes" from "test case description errors";
Runtime Execution Phase: Inject an index at each operation; the LLM loads the corresponding page on demand when encountering blind spots;
Skill Consumption: Multiple downstream skills read the Wiki as background knowledge;
Feedback Loop: Record search logs after each execution to continuously optimize matching strategies.

Wiki Quality Directly Affects Three Core Metrics:

The principle of "quality over quantity" runs throughout — a five-level fallback matching strategy (Exact match → Remove priority suffix → Remove parentheses → LLM semantic match → Skip and cache) ensures that the knowledge injected into the prompt is accurate and reliable; incorrect knowledge is more harmful than no knowledge.

3. Capability 2: AI Intelligent Debugging and Test Case Self-Healing

The Plight of Traditional Debugging

The troubleshooting loop after a test case fails: Fail → View screenshot → Determine cause → Modify script → Re-execute → Fail again → ……, a single test case may need to be debugged multiple times before passing.

AI Intelligent Debugging Mode

ai_uitester has a built-in AI intelligent debugging mode that implements automatic diagnosis and self-healing repair:

Test Case Execution (while loop, supports dynamic step changes)
  ↓ Step Failure
Failure Classifier (Rule Engine)
  ├─ device / timeout / network → Auto switch device or retry
  └─ business → Enter AI Diagnosis
  ↓
AI Diagnosis (VLM)
  ├─ Input: Steps annotated with ✓✗○ + Error Info + Failure Screenshot + Wiki Knowledge
  └─ Output: diagnosis + confidence + complete_steps + resume_from_index
  ↓
┌──────────────────────────────────────────────┐
│  Confidence >= Threshold      │  Confidence < Threshold │
│  → Auto-apply fix             │  → Pop up manual review │
│  → Replace execution steps    │  → Show step diff        │
│  → Re-execute from            │  → Timeout countdown     │
│    resume_from_index          │  → Accept/Reject         │
│  → Execution passes → Solidify│  → Timeout auto Reject   │
│    test case                  │                          │
│  → Execution fails → Rollback │                          │
│    to original test case      │                          │
└──────────────────────────────────────────────┘

Failure Classifier: Filter First, Then Diagnose

Not all failures require AI diagnosis. The system uses a rule engine to filter out non-business failures such as device faults, timeouts, and network issues, automatically retrying them; only business logic failures enter the AI diagnosis process.

Five Types of Root Cause Diagnosis

Three Real Self-Healing Cases

Case 1: UI Change — Functional button position moved (Confidence: 0.9)

Original: [tap] Click a function button in the bottom toolbar → Failed: Button not found
Fix: [tap] Click the corresponding button in the top function menu bar of the page

Case 2: UI Change — Entering a page requires an extra operation (Confidence: 0.8)

Original: [tap] Click entry → Failed: Did not enter the target page after clicking
Fix: Insert a wait after the step, then click again
  [tap] Click entry
  [wait] Wait for the target page to load      ← New
  [tap] Click the entry button again            ← New

Case 3: Preceding Step Failure — Pop-up occlusion causes all subsequent steps to fail (Confidence: 0.9)

After a cold start of the App, a confirmation pop-up appeared, occluding the home page. The AI diagnosis detected that although intermediate steps were marked as ✓, they did not actually produce the expected effect. The diagnosis result was "preceding step failure," rolling back the execution pointer to step 2 and inserting a conditional Action to handle the pop-up after launching the App.

Confidence Mechanism: Better to Miss a Click Than to Click Wrong

In automated testing, "clicking the wrong place" is far more harmful than "not clicking." Confidence calibration anchors:

Two iron rules: (1) MatchedText must be copied character-by-character from the screenshot, no fabrication allowed; (2) Better not to click than to click wrong.

4. Capability 3: VLM-Driven Cross-Platform Unification

The Revolutionary Nature of the VLM Approach

The core execution model of ai_uitester is a "Screenshot → Understand → Execute" closed loop. What the VLM sees is a pixel-level screenshot, not a DOM structure. This means: cross-platform unification is inherent (the same set of instructions works across all three platforms), natural immunity to UI changes (a button can still be found even if it moves), and what you see is what you get (the test logic is completely consistent with the interface a human sees).

Unified API Interface

The execution engine provides a unified API covering categories such as operations, assertions, queries, and waits, shielding the underlying platform differences.

The same JSON script can be executed on Android, iOS, and HarmonyOS without any modification.

Automatic Underlying Driver Selection

The corresponding underlying driver framework is automatically selected based on the device type, completely transparent to the upper-layer code.

Core Execution Engine: BaseAIDriver

BaseAIDriver serves as the abstract base class for all platform drivers, implementing the core perception-decision loop of "Screenshot → Large Model Analysis → Decision Execution → Logging → Re-screenshot." This loop executes for a maximum of 20 rounds. Click operations are accompanied by a confidence verification mechanism, and after querying the knowledge base, execution is forced to continue.

Four Major Constraints of Prompt Engineering

Constraint 1: Only perform one action at a time. The screen state changes after each operation step; executing step-by-step ensures each decision is based on the latest screen.
Constraint 2: Strict rules for element matching. MatchedText must be copied character-by-character from the screenshot. When Confidence <= 0.5, Action: Null must be returned.
Constraint 3: Automatic injection of high-priority knowledge. Pop-ups, permissions, login pages, etc., do not need to be explicitly written in the test case; the VLM handles them automatically.
Constraint 4: Platform-specific adaptation. The prompt automatically switches system operation instructions based on the platform, transparent to the upper-layer code.

Deep Thinking Mode

When deep thinking mode is enabled, the model gains three capabilities: sub-goal decomposition, progress tracking (✓/→/○ visualization), and before/after screenshot comparison, suitable for complex business processes.

5. Architectural Design Trade-offs

Why "Step-by-Step Execution" Instead of "One-Time Planning"? The core challenge of UI testing is state uncertainty — the screen changes after each operation step, and pre-planning might be based on outdated information. The cost is that a single operation may require multiple rounds of LLM calls (up to 20 rounds), balanced through deep thinking sub-goal decomposition.

Why is the Confidence Threshold Set to 0.5? After extensive real-world testing and tuning, it strikes a balance between accuracy and coverage — if the threshold is too high, many operations are rejected, leading to low execution efficiency; if too low, the risk of mis-clicks increases. The current threshold ensures high confidence that "pass means correct."

Why Does Self-Healing Return a Complete Step List Instead of an Incremental Diff? Incremental diffs can easily cause index shifts after multiple fixes; a complete list is more intuitive and reliable. Token consumption is higher, but it avoids "fixes introducing new bugs."

6. Industry Comparison: Why It's a "New Paradigm"

Currently, there are three main technical routes for UI automation testing in the industry. We conduct a systematic comparison across four dimensions: core technology stack, cross-platform capability, maintenance cost, and self-healing capability.

Traditional Solutions: Appium / Selenium / XCUITest

Core Principle: Based on element location — find UI elements through locators such as ID, XPath, Accessibility ID, Class Name, and then perform operations like Click/Input/Swipe. The underlying layer obtains the View Tree/DOM structure through each platform's Accessibility API or UIAutomator.

Typical Code:

# Android (Appium)
driver.find_element(By.ID, 'com.example.app:id/btn_login').click()
# iOS (XCUITest)
app.buttons['Login'].tap()
# HarmonyOS (Hypium)
driver.find_component('Login').click()

Advantages: Fast execution speed (single-step operations in milliseconds), mature community, well-established CI/CD integration solutions, rich assertion capabilities.

Disadvantages:

Cross-platform redundant construction: Three sets of scripts and three sets of locators for the same functionality. For one button, Android uses resource-id, iOS uses accessibilityLabel, HarmonyOS uses componentId — the locators are completely different across the three platforms;
UI changes cause immediate failure: Changing button text from "Next" to "Confirm" invalidates the locator; a page revamp adding a nested layer breaks the XPath. A medium-sized App typically requires 15-30% of test cases to be modified with each version iteration;
Maintenance cost grows linearly: The more test cases, the higher the maintenance cost. For regression testing of a large-scale test case suite, a single UI revamp can take weeks to fix;
No self-healing capability: Stops upon failure, requiring manual intervention to determine the cause and modify the script.

AI-Assisted Solutions: Test.ai / Applitools

Core Principle: This approach overlays AI capabilities on top of traditional automation frameworks, mainly divided into two implementation directions. AI Element Location (Test.ai): Relies on CV or NLP models to replace hardcoded ID/XPath, matching elements through visual features of screenshot regions or using natural language descriptions instead of locators; AI Visual Comparison (Applitools): Uses VLM to compare screenshot diffs, automatically determining "visual regressions," but does not replace the underlying execution engine.

Advantages: Reduces locator maintenance costs; natural language descriptions are more readable than IDs; visual regression testing can discover UI issues missed by traditional assertions.

Disadvantages:

Essentially still element location: Although AI is used for "flexible matching," the execution model hasn't changed — it still needs to find the element and click it. When the element doesn't exist (the UI has genuinely changed), AI location also fails to find it;
Cross-platform adaptation still required: Android and iOS have different screenshot resolutions, font rendering, and component styles; AI models require platform-specific training or adaptation;
Self-healing limited to re-location: If a button moves from the bottom to the top, AI location can find it; but if the interaction flow changes (a new intermediate page is added, the operation sequence changes), AI location is helpless. This is "element-level self-healing," not "flow-level self-healing";
No business understanding: Doesn't know business rules, doesn't know why a step failed, can only report "element not found."

AI-Native Solution: ai_uitester (This Article)

Core Principle: Uses VLM as the execution engine, replacing element location with a "Screenshot → Understand → Execute" closed loop. The VLM not only identifies UI elements but also understands page semantics, business processes, and context. The knowledge base (Wiki) decouples business rules from test execution.

Typical Code:

{
  "steps": [
    {"type": "tap", "instruction": "Click the first Tab 'Community' in the bottom navigation bar"},
    {"type": "tap", "instruction": "Click the publish button in the top right corner of the page"},
    {"type": "assertion", "instruction": "Assert that the function entry appears on the page"}
  ]
}

The same JSON script is universally applicable across Android, iOS, and HarmonyOS without any modification.

Core Differences from Traditional and AI-Assisted Solutions:

The Evolutionary Relationship of the Three Routes

The relationship between the three is not one of replacement, but a leap in capability dimensions:

Traditional Solution (Appium)
  └─ Solves "Can it be tested?" → Provides basic execution capability
      └─ AI-Assisted Solution (Test.ai)
            └─ Optimizes "Is it easy to test?" → Reduces locator maintenance cost
                └─ AI-Native Solution (ai_uitester)
                      └─ Redefines "Who tests?" → Shifts from human-driven to AI-driven

The key difference summed up in one sentence: The assumption of traditional and AI-assisted solutions is "the UI doesn't change," so humans need to maintain locators; the assumption of the AI-Native solution is "the UI will definitely change," so it lets AI understand and adapt to changes. This is a fundamental shift in testing philosophy — from "resisting change" to "embracing change."

7. Business Results Data

ai_uitester has been deployed and running in the client testing of the Dewu App. Core metrics are as follows:

Core Efficiency Metrics

Quality Improvement Metrics

Note: The self-healing success rate is affected by the quality of the knowledge base and the execution scenario; complex process changes still require manual confirmation. The above data is based on the measured averages of multiple core business modules.

8. Summary

ai_uitester represents a new AI-Native paradigm for UI automation testing:

This is not just a tool upgrade, but a paradigm shift in testing — from "code-driven" to "vision-driven," from "manual debugging" to "AI self-healing," and from "three separate platforms" to "unified abstraction." The closed-loop design of the Wiki knowledge base ensures this is not a one-off tool, but a testing infrastructure that becomes smarter with use.

Previous Reviews

Text / Lin Lin

Follow Dewu Technology for weekly technical干货 (practical content)

If you find the article helpful, feel free to comment, forward, and like~

Reproduction without permission from Dewu Technology is strictly prohibited, otherwise legal liability will be pursued according to law.