VLM-Driven UI Testing That Heals Itself Across iOS, Android, and HarmonyOS
Cross-platform UI testing has been a three-script tax that grows with every release. A single VLM-driven script that self-heals across platforms collapses the maintenance burden and makes CI pipelines viable for teams shipping to Android, iOS, and HarmonyOS simultaneously.
Dewu's ai_uitester swaps traditional UI element locators for a VLM that reads screenshots, understands page semantics, and decides what to tap next. The same JSON test script runs unchanged on Android, iOS, and HarmonyOS because the model sees pixels, not platform-specific view trees. When a test fails, a rule engine filters out device and network issues before handing business failures to the VLM for diagnosis and repair. The system can detect that a pop-up silently blocked earlier steps, roll back the execution pointer, insert a conditional dismissal action, and resume, all without human intervention.
A six-phase pipeline converts hundreds of descriptive test cases from a legacy platform into executable scripts. An LLM enhancement stage injects module-level Wiki knowledge so generated steps match real UI flows, and a five-level fallback matching strategy keeps bad knowledge out of prompts. The Wiki is consumed at enhancement time, during self-healing diagnosis, at runtime for on-demand page lookups, and in a feedback loop that tunes matching over time.
Production data from the Dewu app shows a 90% reduction in test-case conversion time, a 70% drop in debugging effort, and an 80% self-healing success rate. The architecture bets that the UI will always change, so it builds for adaptation rather than resistance, treating the knowledge base as living infrastructure that improves with every execution.
Treating the UI as a visual surface rather than a widget tree is the architectural bet that makes cross-platform unification trivial. The model doesn't care whether a button is a UIButton or a android.widget.Button; it cares what the button looks like and what it says.
Self-healing that returns a complete step list instead of an incremental diff is a deliberate trade-off: higher token cost, but it avoids the index-drift bugs that plague diff-based repair after multiple fixes.
The confidence threshold of 0.5 encodes a safety-first philosophy that is rare in AI testing tools. Most systems optimize for coverage; this one optimizes for not breaking the app by clicking the wrong thing.
Wiki quality directly gates three metrics (generation accuracy, self-healing success rate, and execution pass rate), which means the knowledge base is not documentation — it is a runtime dependency with production consequences.
The five-level fallback matching strategy is a quiet piece of defensive engineering that prevents hallucinated knowledge from poisoning prompts, and it is more important than the LLM call itself.
Running the same script on HarmonyOS alongside Android and iOS is a signal that Huawei's platform is being treated as a first-class target, not an afterthought, which matters for teams with a China-market footprint.