跪拜 Guibai
← All articles
Frontend · Backend · Artificial Intelligence

A 500ms-Latency 3D AI Companion You Can Interrupt Mid-Sentence

By 石小石Orz · · 371 views · 1 likes
Read original on juejin.cn ↗ Google Translate ↗ Alt translation

Most 3D avatar solutions push full video streams from the cloud, burning bandwidth and limiting where they can run. A parameter-stream approach that renders locally on the browser cuts costs and latency enough to make embodied AI agents practical for kiosks, in-car screens, and web apps—not just high-end demos. The SDK also solves the hard synchronization problems (lip-sync, gesture timing, interruption) that teams otherwise spend months rebuilding.

Summary

A new JS SDK from Mofa Nebula replaces high-bandwidth video streams with a parameter-stream architecture: the cloud sends lightweight motion and voice data, and the browser renders the 3D avatar locally. The result is a talking digital human with roughly 500ms end-to-end latency that runs on ordinary web devices, not just powerful GPUs.

Integration takes three steps—load the script, initialize with an App ID and Secret, and call `speak()`. The SDK handles TTS, lip-sync, facial expressions, and body movement automatically. State-control APIs (`idle()`, `interactiveidle()`, `think()`) let the avatar switch between listening, processing, and speaking, while `speak()` accepts SSML to trigger named actions like waving or dancing mid-sentence.

Hooking this up to DeepSeek's chat API produces a fully interactive AI companion. The demo uses a non-streaming call for simplicity, but the SDK supports true streaming by passing `is_start` and `is_end` flags across multiple `speak()` invocations. An interrupt button calls `interactiveidle()` to stop the current utterance immediately, and `onVoiceStateChange` callbacks let application logic react when speech begins or ends. Browser microphone access and a speech-recognition service would close the loop for voice-in, voice-out interaction.

Takeaways
Loading a 3D avatar and making it speak requires only a script tag, an SDK constructor with an App ID and Secret, and a call to `sdk.speak()`.
End-to-end latency is about 500ms because the cloud sends lightweight motion and voice parameters, not a video stream; the browser renders the avatar locally.
The SDK manages TTS, lip-sync, facial expressions, and body movement automatically—no separate animation pipeline is needed.
State-control methods (`idle()`, `interactiveidle()`, `think()`) let the avatar visibly switch between listening, processing, and speaking.
SSML passed to `speak()` can trigger named actions like Hello or Welcome, so the avatar gestures in sync with its speech.
Streaming AI output works by calling `speak()` repeatedly with `is_start` and `is_end` flags; the first call sets `is_start=true`, the last sets `is_end=true`.
Calling `interactiveidle()` immediately stops the current utterance, giving users a working interrupt button.
`onVoiceStateChange` fires `voice_start`/`start` and `voice_end`/`end` events so application logic can react to speech boundaries.
Browser support covers Chrome 90+, Edge 90+, Safari 15+, and Firefox 90+, making the SDK viable on web, PC clients, in-vehicle systems, and large screens.
Hardcoding API keys in the frontend works for a demo; production apps should proxy requests through a backend to avoid leaking secrets.
Conclusions

Parameter-stream architectures shift the cost model for 3D avatars from per-minute video bandwidth to a one-time resource download plus tiny data packets, which is what makes sub-second latency feasible on a plain browser.

The SDK's state machine—idle, thinking, speaking, interrupted—mirrors the turn-taking mechanics of human conversation. Getting this right is what separates a convincing companion from a janky text-to-speech demo.

SSML-triggered actions are a pragmatic bridge between pure NLP and embodied expression: the language model doesn't need to output joint angles, just semantic intent tags that the renderer already knows how to animate.

Streaming speech synthesis with interruption is a deceptively hard problem. The `is_start`/`is_end` flag pattern lets the SDK buffer and concatenate audio chunks without glitching, while `interactiveidle()` provides a clean abort path.

Voice-in, voice-out closes the loop but introduces ASR latency and error. The article's decision to defer that to a separate piece is honest—speech recognition in noisy environments is its own deep engineering challenge.

Building this with Trae (an AI coding assistant) for UI iteration while hand-writing the SDK integration logic is a realistic picture of 2025 development: AI accelerates boilerplate, but wiring real-time, stateful APIs still demands human judgement.

Concepts & terms
Embodied Driving SDK
A JavaScript library that renders a 3D digital human in a browser and drives its speech, facial expressions, and body movements from lightweight cloud-sent parameters rather than a video stream.
Parameter-stream architecture
Instead of sending rendered video frames, the server transmits compact data (motion vectors, audio, expression weights) and the client renders the final image locally, reducing bandwidth and latency.
SSML (Speech Synthesis Markup Language)
An XML-based markup for TTS that can control pronunciation, pauses, and—in this SDK—trigger named avatar actions like waving or dancing during speech.
Streaming TTS with is_start / is_end flags
A pattern for feeding incremental text to a speech synthesizer: the first chunk sets `is_start=true`, intermediate chunks set both flags to false, and the final chunk sets `is_end=true`, allowing the engine to buffer and concatenate audio smoothly.
interactiveidle()
An SDK method that immediately halts the current speech and returns the avatar to a listening-ready posture, implementing user interruption.
Source: juejin.cn ↗ Google Translate ↗ Backup ↗