A 500ms-Latency 3D AI Companion You Can Interrupt Mid-Sentence
Most 3D avatar solutions push full video streams from the cloud, burning bandwidth and limiting where they can run. A parameter-stream approach that renders locally on the browser cuts costs and latency enough to make embodied AI agents practical for kiosks, in-car screens, and web apps—not just high-end demos. The SDK also solves the hard synchronization problems (lip-sync, gesture timing, interruption) that teams otherwise spend months rebuilding.
A new JS SDK from Mofa Nebula replaces high-bandwidth video streams with a parameter-stream architecture: the cloud sends lightweight motion and voice data, and the browser renders the 3D avatar locally. The result is a talking digital human with roughly 500ms end-to-end latency that runs on ordinary web devices, not just powerful GPUs.
Integration takes three steps—load the script, initialize with an App ID and Secret, and call `speak()`. The SDK handles TTS, lip-sync, facial expressions, and body movement automatically. State-control APIs (`idle()`, `interactiveidle()`, `think()`) let the avatar switch between listening, processing, and speaking, while `speak()` accepts SSML to trigger named actions like waving or dancing mid-sentence.
Hooking this up to DeepSeek's chat API produces a fully interactive AI companion. The demo uses a non-streaming call for simplicity, but the SDK supports true streaming by passing `is_start` and `is_end` flags across multiple `speak()` invocations. An interrupt button calls `interactiveidle()` to stop the current utterance immediately, and `onVoiceStateChange` callbacks let application logic react when speech begins or ends. Browser microphone access and a speech-recognition service would close the loop for voice-in, voice-out interaction.
Parameter-stream architectures shift the cost model for 3D avatars from per-minute video bandwidth to a one-time resource download plus tiny data packets, which is what makes sub-second latency feasible on a plain browser.
The SDK's state machine—idle, thinking, speaking, interrupted—mirrors the turn-taking mechanics of human conversation. Getting this right is what separates a convincing companion from a janky text-to-speech demo.
SSML-triggered actions are a pragmatic bridge between pure NLP and embodied expression: the language model doesn't need to output joint angles, just semantic intent tags that the renderer already knows how to animate.
Streaming speech synthesis with interruption is a deceptively hard problem. The `is_start`/`is_end` flag pattern lets the SDK buffer and concatenate audio chunks without glitching, while `interactiveidle()` provides a clean abort path.
Voice-in, voice-out closes the loop but introduces ASR latency and error. The article's decision to defer that to a separate piece is honest—speech recognition in noisy environments is its own deep engineering challenge.
Building this with Trae (an AI coding assistant) for UI iteration while hand-writing the SDK integration logic is a realistic picture of 2025 development: AI accelerates boilerplate, but wiring real-time, stateful APIs still demands human judgement.