Testing · AIGC · LLM

VLM-Driven UI Testing That Heals Itself Across iOS, Android, and HarmonyOS

By 得物技术 · Jul 2, 2026

Read original on juejin.cn ↗ Google Translate ↗ Alt translation

Cross-platform UI testing has been a three-script tax that grows with every release. A single VLM-driven script that self-heals across platforms collapses the maintenance burden and makes CI pipelines viable for teams shipping to Android, iOS, and HarmonyOS simultaneously.

Summary

Dewu's ai_uitester swaps traditional UI element locators for a VLM that reads screenshots, understands page semantics, and decides what to tap next. The same JSON test script runs unchanged on Android, iOS, and HarmonyOS because the model sees pixels, not platform-specific view trees. When a test fails, a rule engine filters out device and network issues before handing business failures to the VLM for diagnosis and repair. The system can detect that a pop-up silently blocked earlier steps, roll back the execution pointer, insert a conditional dismissal action, and resume, all without human intervention.

A six-phase pipeline converts hundreds of descriptive test cases from a legacy platform into executable scripts. An LLM enhancement stage injects module-level Wiki knowledge so generated steps match real UI flows, and a five-level fallback matching strategy keeps bad knowledge out of prompts. The Wiki is consumed at enhancement time, during self-healing diagnosis, at runtime for on-demand page lookups, and in a feedback loop that tunes matching over time.

Production data from the Dewu app shows a 90% reduction in test-case conversion time, a 70% drop in debugging effort, and an 80% self-healing success rate. The architecture bets that the UI will always change, so it builds for adaptation rather than resistance, treating the knowledge base as living infrastructure that improves with every execution.

Takeaways

— VLM sees pixel-level screenshots, not DOM or view trees, so one JSON test script runs on Android, iOS, and HarmonyOS without modification.

— A six-phase pipeline converts hundreds of descriptive test cases from a legacy platform into executable JSON scripts, cutting conversion time by 90%.

— LLM enhancement injects module-level Wiki knowledge during step generation so scripts match real UI flows, not idealized descriptions.

— A five-level fallback matching strategy (exact, strip priority suffix, strip parentheses, semantic match, skip-and-cache) prevents bad knowledge from entering prompts.

— Failure classifier filters device, timeout, and network errors before routing business failures to the VLM for diagnosis.

— Self-healing can detect silent preceding-step failures (e.g., a pop-up that blocked earlier actions), roll back the execution pointer, insert a fix, and resume.

— Confidence threshold is set to 0.5; below that the system returns Action: Null rather than risk a mis-click.

— Deep-thinking mode adds sub-goal decomposition, progress tracking with ✓/→/○ markers, and before/after screenshot comparison.

— Production metrics: 90% reduction in test-case conversion time, 70% drop in debugging effort, 80% self-healing success rate.

— Wiki knowledge base is consumed in five scenarios: enhancement, self-healing diagnosis, runtime execution, downstream skills, and a feedback loop that continuously optimizes matching.

Conclusions

Treating the UI as a visual surface rather than a widget tree is the architectural bet that makes cross-platform unification trivial. The model doesn't care whether a button is a UIButton or a android.widget.Button; it cares what the button looks like and what it says.

Self-healing that returns a complete step list instead of an incremental diff is a deliberate trade-off: higher token cost, but it avoids the index-drift bugs that plague diff-based repair after multiple fixes.

The confidence threshold of 0.5 encodes a safety-first philosophy that is rare in AI testing tools. Most systems optimize for coverage; this one optimizes for not breaking the app by clicking the wrong thing.

Wiki quality directly gates three metrics (generation accuracy, self-healing success rate, and execution pass rate), which means the knowledge base is not documentation — it is a runtime dependency with production consequences.

The five-level fallback matching strategy is a quiet piece of defensive engineering that prevents hallucinated knowledge from poisoning prompts, and it is more important than the LLM call itself.

Running the same script on HarmonyOS alongside Android and iOS is a signal that Huawei's platform is being treated as a first-class target, not an afterthought, which matters for teams with a China-market footprint.

Concepts & terms

VLM (Vision-Language Model)

A model that processes images and text together, enabling it to look at a screenshot, read natural-language instructions, and decide where to tap or what to assert.

Self-healing in testing

The ability of a test framework to detect a failure, diagnose its root cause (e.g., a moved button, a new pop-up), and rewrite the test steps automatically so the test can pass without human editing.

Failure classifier

A rule engine that categorizes test failures into device, timeout, network, or business errors, routing only business failures to the AI diagnosis path to avoid wasting LLM calls on infrastructure flakiness.

Confidence threshold (0.5)

A cutoff below which the VLM refuses to act. If the model is less than 50% sure it has identified the correct UI element, it returns Action: Null rather than risk a mis-click.

Wiki knowledge base (testing context)

A structured store of per-module UI knowledge (page layouts, common pop-ups, navigation flows) that is injected into LLM prompts during test generation, diagnosis, and runtime execution to improve accuracy.

Deep-thinking mode

An execution mode where the VLM decomposes a complex test step into sub-goals, tracks progress with visual markers, and compares before/after screenshots to verify each sub-goal completed.

Source: juejin.cn ↗ Google Translate ↗ Backup ↗