A 500ms-Latency 3D AI Companion You Can Interrupt Mid-Sentence
AI Embodied Interaction: Implementing a Talking 3D Virtual Companion
Hello everyone, I'm Shi Xiaoshi~
Introduction
The vast majority of 3D virtual avatars on the market are still stuck in the stage of looping pre-made content, with a single mode of interaction and a lack of genuine understanding and feedback capabilities. Simply connecting a large language model only solves text-based Q&A and cannot achieve synchronized responses involving voice, expressions, and movements. If a complete embodied interaction system is built, one that can interface with mainstream large models for dialogue understanding and reasoning while relying on native 3D motion generation and on-device real-time multimodal rendering capabilities, the virtual avatar can evolve from a "stiff digital shell" into a true "AI companion" with perception, expression, and companionship abilities.
Imagine if, on any terminal screen—web, mobile, large display—we could summon a 3D virtual interactive partner with large model capabilities at any time. It could understand your questions, provide real-time responses, and express itself naturally through voice, expressions, and movements. Wouldn't that be fascinating?
This article will combine Trae with the Mofa Nebula SDK to quickly guide you through implementing an intelligent AI companion that can talk, be interrupted, and interact in real time.
AI Companion Effect Demonstration
Refer to the images and video below. After running the frontend service, simply enter the App ID and App Secret on the page and click the "Connect" button to load and create a 3D virtual human. Once connected, you can type questions into the chat panel on the right to interact with it.
Demo effect video link: https://www.bilibili.com/video/BV1rtjo6MEJe
Judging from the final video effect, this 3D virtual agent already possesses complete real-time interaction capabilities. After the system links with a large model to complete semantic understanding and content generation, it relies on a proprietary parameter stream architecture and core AI end-rendering and solving technologies. Lightweight motion and voice parameters are distributed from the cloud, and the terminal locally completes the full 3D avatar generation. As a result, the virtual human can not only understand user questions and give natural, coherent answers but also synchronously complete embodied expression through voice, expressions, and movements, upgrading the interaction from simple text-based Q&A to a more realistic multimodal companionship experience.
More importantly, this 3D virtual human's response speed is fast enough, basically achieving low-latency feedback of about 500ms, making the overall interactive experience smooth and natural. The cost of implementing such a web AI companion that can understand questions, respond in real time, and complete natural expression is extremely low, running smoothly on ordinary web devices.
The following text will introduce how to quickly build a minimum runnable demo using Trae and the Mofa Nebula Embodied Driving SDK.
Core Technology: Mofa Nebula Embodied Driving SDK
Implementing a 3D virtual human interaction from scratch is not easy; it requires connecting a whole complex chain:
- 3D avatar loading and end-side rendering;
- Text-to-speech (TTS);
- Lip-sync, expression, and body movement synchronization;
- Broadcast concatenation after streaming AI replies;
- User interruption, thinking, listening, and idle state switching;
- Low latency and multi-terminal compatibility
Fortunately, a mature solution already exists for these complex capabilities: Mofa Nebula Embodied Driving SDK.
Mofa Nebula is an AI Embodied Interaction Agent Open Platform built by Mofa Technology, providing end-to-end capabilities such as AI avatar generation, multimodal perception, large model agent cognition, real-time 3D embodied expression, and robot motion control. It also supports a single SDK covering three major terminals: screen terminals, humanoid service robots, and AR/VR, differentiating it from traditional high-bandwidth video streaming solutions.
With its embodied driving SDK, we can upgrade AI's expression from "text replies" to "3D multimodal interaction": based on text input, it generates voice, expressions, and movements in real time, driving the 3D digital human to complete natural expression.
Its core features include:
- Real-time 3D digital human rendering and driving: Web, large screen, in-vehicle, and various other browser terminals can load and render 3D digital humans in real time;
- Speech synthesis and lip-sync: Supports text/SSML broadcasting and automatically completes voice, lip-sync, and expression synchronization;
- Multi-state behavior control: Supports switching between idle, interactive idle, broadcasting, and other states;
- Widget component display: Supports displaying content such as subtitles, images, and videos;
- Event callbacks and log debugging: Supports custom event callbacks, facilitating business logic integration and troubleshooting.
Browser Requirements for the Embodied Driving SDK
Currently, the embodied driving SDK is available in a JS version. This means that as long as the terminal supports a browser kernel, it can integrate a 3D virtual human with AI interaction capabilities, suitable for various scenarios such as web pages, PC clients, in-vehicle systems, and large screens. The browser version requirements supported by the SDK are as follows:
Practical Tutorial: Quickly Build a 3D Virtual Human Project with Trae
Obtain App ID and App Secret
To use the embodied driving SDK to quickly create an interactive AI avatar, you first need to log in to the Mofa Nebula Console, create a driving application in the Application Center, configure the character, voice, scene, and performance style, and obtain the App ID and App Secret required for subsequent SDK integration.
Refer to the GIF below. After logging into the console, go to the 'Driving Application' tab under 'Application Management', click 'Start Creating', fill in the application name, and confirm. You will then enter the avatar selection page, where you can choose a suitable 3D digital human avatar based on your business scenario.
After selecting the avatar, you can continue to configure the scene, voice, and performance style. Once you confirm the configuration is correct, click 'Save', and the system will automatically complete the creation of the driving application.
Now, click the 'Access SDK' button to view and copy the App ID and App Secret for use in the subsequent web demo integration.
Core Code Implementation
The integration method for the embodied driving SDK is very simple. The overall process can be divided into three steps: Introduce the SDK, Initialize the Instance, Call the Broadcast Method.
First, introduce the SDK on the page:
<script src="https://media.xingyun3d.com/xingyun3d/general/litesdk/[email protected]"></script>
Then create an SDK instance and complete initialization. Here, you need to replace appId and appSecret with the content copied from the Mofa Nebula Console earlier:
const sdk = new XmovAvatar({
containerId: "#sdk", // Required: Digital human mount container
appId: "your_appid", // Required: Application AppID
appSecret: "your_appsecret", // Required: Application AppSecret
gatewayServer: "https://nebula-agent.xingyun3d.com/user/v1/ttsa/session", // Required: Service interface address
// SDK event callback, convenient for debugging
onMessage(message) {
console.log("SDK message:", message);
},
});
// Initialize the SDK, load digital human resources
await sdk.init({
onDownloadProgress(progress) {
console.log(`Resource loading progress: ${progress}%`);
},
});
After initialization is complete, call the speak method to make the digital human speak:
sdk.speak("Hello, Shi Xiaoshi, I am your AI companion~", true, true);
If you are just broadcasting a complete sentence, the last two parameters are usually both passed as true.
Next, we can paste the above integration logic into Trae's AI dialog box and let it quickly generate a minimal Vue 3-based demo for us.
Below is a minimal usage demo generated by Trae:
<template>
<!-- Digital human mount container -->
<div id="sdk"></div>
</template>
<script setup>
import { onMounted, onBeforeUnmount } from "vue";
// Save SDK instance
let sdk = null;
onMounted(async () => {
// Create SDK instance
sdk = new window.XmovAvatar({
containerId: "#sdk", // Digital human mount container
appId: "Your AppID", // Replace with the AppID from the console
appSecret: "Your AppSecret", // Replace with the AppSecret from the console
gatewayServer: "https://nebula-agent.xingyun3d.com/user/v1/ttsa/session",
// SDK event callback, convenient for viewing running status
onMessage(message) {
console.log("SDK message:", message);
},
});
// Initialize SDK, load digital human resources
await sdk.init({
onDownloadProgress(progress) {
console.log(`Resource loading progress: ${progress}%`);
},
});
// After initialization, let the digital human broadcast a sentence
sdk.speak("Hello, I am your web AI companion, nice to meet you.", true, true);
});
onBeforeUnmount(() => {
// Destroy SDK when the page is unmounted to release resources
if (sdk) {
sdk.destroy();
}
});
</script>
<style scoped>
#sdk {
width: 800px;
height: 450px;
background: #000;
border-radius: 12px;
overflow: hidden;
}
</style>
After starting the project, the page will render the 3D digital human and automatically broadcast a welcome message after initialization is complete.
Connecting to DeepSeek to Give the AI Companion Intelligence
Accessing DeepSeek: Getting AI Replies
The previous demo already allows the digital human to broadcast fixed text. Next, we will separately integrate DeepSeek to give the page AI reply capabilities. For now, we won't consider complex streaming output and will only use the simplest non-streaming interface: the user inputs a sentence, the frontend requests DeepSeek, and after receiving the complete reply, it hands it over to the digital human for broadcasting.
First, prepare a DeepSeek API Key, then define it in the code:
// DeepSeek API Key, written directly in the frontend for demo convenience
const DEEPSEEK_API_KEY = "Your DeepSeek API Key";
Then encapsulate a simple request method:
async function askDeepSeek(question) {
const response = await fetch("https://api.deepseek.com/chat/completions", {
method: "POST",
headers: {
"Content-Type": "application/json",
Authorization: `Bearer ${DEEPSEEK_API_KEY}`,
},
body: JSON.stringify({
model: "deepseek-chat",
messages: [
{
role: "system",
content:
"You are a web AI companion. Please answer user questions concisely, naturally, and colloquially in Chinese.",
},
{
role: "user",
content: question,
},
],
stream: false,
}),
});
const data = await response.json();
// Extract the reply content returned by DeepSeek
return data.choices?.[0]?.message?.content || "Sorry, I haven't figured out how to answer yet.";
}
This method does only one thing: sends the user's question to DeepSeek and returns the AI's text reply.
You can first use a simple piece of code to verify if the interface is working properly:
async function testDeepSeek() {
const answer = await askDeepSeek("Please introduce yourself in one sentence");
console.log("DeepSeek reply:", answer);
}
If the console can print the reply normally, it means DeepSeek is connected.
Handing DeepSeek's Reply to the Digital Human for Broadcasting
Once DeepSeek is connected, we just need to pass the answer it returns to sdk.speak(), and the digital human can speak the AI reply.
The core logic is as follows:
async function askAndSpeak() {
if (!sdk) {
alert("Please initialize the digital human first");
return;
}
const question = inputText.value.trim();
if (!question) {
alert("Please enter a question first");
return;
}
try {
// After the user asks, put the digital human into a thinking state
sdk.think();
// Request DeepSeek to get an answer
const answer = await askDeepSeek(question);
// Hand the AI reply to the digital human for broadcasting
sdk.speak(answer, true, true);
} catch (error) {
console.error("DeepSeek request failed:", error);
// When an error occurs, the digital human can also broadcast an error prompt
sdk.speak("Sorry, the AI service is temporarily unavailable. Please try again later.", true, true);
}
}
To make the demo easier to understand, the DeepSeek API Key is written directly in the frontend here. This is not recommended in actual development; formal projects should forward requests through a backend interface to avoid key leakage.
Advanced API: State Control, Streaming Broadcast, and Event Listening
After completing the minimal demo, if you want to upgrade it to a truly interactive AI companion, you'll need to use some advanced APIs. Here, we focus on introducing a few of the most commonly used capabilities: state switching, streaming broadcast, interruption, volume control, and event listening.
Digital Human State Switching
In real interactions, the digital human shouldn't stay in one state all the time but should switch states based on user behavior and the AI response process. For example: idle when the user isn't speaking, thinking when the user asks a question, broadcasting when the AI replies, and returning to interactive idle when the user interrupts.
Common state APIs are as follows:
// Normal idle state
sdk.idle();
// Interactive idle state, often used to interrupt the current broadcast
sdk.interactiveidle();
// Enter offline mode, no credits consumed in this state
sdk.offlineMode();
// Switch from offline mode back to online mode
sdk.onlineMode();
In the demo, the most commonly used are idle() and interactiveidle():
// After initialization, put the digital human into an idle state
await sdk.init();
sdk.idle();
// When the user clicks the "Interrupt" button, interrupt the current broadcast
function interrupt() {
sdk.interactiveidle();
}
Among them, interactiveidle() is very suitable for handling "interruption" scenarios. For example, if the digital human is broadcasting a long text and the user wants to ask a new question, you can first call it to return the digital human to the interactive idle state before starting the next round of dialogue.
Connecting to Large Model Streaming Output
If you just want to broadcast a fixed text, you can call it like this:
sdk.speak("Welcome to Mofa Nebula", true, true);
But in AI companion scenarios, large models usually return content in a streaming manner. In this case, you can call speak() multiple times, using the is_start and is_end flags to tell the SDK which segment the current text belongs to.
// First segment: is_start is true, is_end is false
sdk.speak("Hello, I am your AI companion, ", true, false);
// Middle segment: both is_start and is_end are false
sdk.speak("I can chat with you, tell stories, ", false, false);
// Last segment: is_start is false, is_end is true
sdk.speak("and also help you answer questions.", false, true);
Parameter description:
sdk.speak(ssml, is_start, is_end);
ssml: Broadcast text, can also be SSML;is_start: Whether it is the first segment of this round of broadcasting;is_end: Whether it is the last segment of this round of broadcasting.
It should be noted that during streaming broadcasting, it is recommended to accumulate a small segment of text before starting to call speak() to avoid the digital human waiting frequently due to text segments being too short. Also, after the previous round of speak(..., true) ends, it is not recommended to immediately start the next round of broadcasting consecutively. It's better to perform a state switch via interactiveidle() first.
Using SSML to Trigger Actions
speak() can not only accept plain text but also SSML. Through SSML, the digital human can perform specified actions during broadcasting, such as welcoming, waving, dancing, etc.
For example, to trigger a Hello action when the digital human says a welcome message:
const ssml = `
<speak>
<ue4event>
<type>ka</type>
<data><action_semantic>Hello</action_semantic></data>
</ue4event>
Welcome to the Nebula Embodied 3D Digital Human Platform, nice to meet you.
</speak>
`;
sdk.speak(ssml, true, true);
If you need to trigger actions based on semantics, you can also use ka_intent:
const ssml = `
<speak>
Warmly
<ue4event>
<type>ka_intent</type>
<data><ka_intent>Welcome</ka_intent></data>
</ue4event>
welcome everyone to today's sharing session.
</speak>
`;
sdk.speak(ssml, true, true);
This type of capability is very suitable for scenarios like virtual hosts, welcome pages, and guided tours, allowing the digital human not just to "speak" but to perform actions in coordination with semantics.
Listening to Broadcast Status
In real projects, we often need to know when the digital human starts speaking and when it finishes. You can listen to the audio playback status through onVoiceStateChange.
const sdk = new XmovAvatar({
containerId: "#sdk",
appId: "your_appid",
appSecret: "your_appsecret",
gatewayServer: "https://nebula-agent.xingyun3d.com/user/v1/ttsa/session",
// Listen to digital human broadcast status
onVoiceStateChange(status) {
console.log("Digital human voice status:", status);
// Start speaking
if (status === "voice_start" || status === "start") {
console.log("Digital human started broadcasting");
}
// Broadcast ended
if (status === "voice_end" || status === "end") {
console.log("Digital human broadcast ended");
}
},
});
Listening to SDK Messages and Errors
onMessage is a very important callback during debugging. SDK error messages and running messages can be output through it.
const sdk = new XmovAvatar({
containerId: "#sdk",
appId: "your_appid",
appSecret: "your_appsecret",
gatewayServer: "https://nebula-agent.xingyun3d.com/user/v1/ttsa/session",
onMessage(message) {
console.log("SDK message:", message);
// message usually contains fields like code, message, timestamp
if (message.code) {
console.warn("SDK error code:", message.code);
console.warn("SDK error message:", message.message);
}
},
});
Common errors can be simply understood as a few categories:
10001: Container does not exist, usuallycontainerIdis written incorrectly or the DOM hasn't rendered yet;10002: Socket connection exception, possibly a network or service connection issue;10003: Session creation failed, checkappId,appSecret, and application configuration first;30001: Background image loading failed;40001: Audio decoding failed;50001 / 50002: Offline/Online state change;50003 / 50004: Network retry or network disconnected.
During the development phase, it is recommended to always keep onMessage logging, as this will make troubleshooting much faster.
Volume and Debug Information
Finally, there are a few more practical methods for development and debugging.
Volume control:
// Mute
sdk.setVolume(0);
// Half volume
sdk.setVolume(0.5);
// Maximum volume
sdk.setVolume(1);
Show or hide debug information:
// Show debug info
sdk.showDebugInfo();
// Hide debug info
sdk.hideDebugInfo();
During the joint debugging phase, you can temporarily turn on debug information; turn it off before going live to keep the page clean.
Using Trae to Refine the Interactive Experience
At this point, our demo already has basic capabilities: the page can load a 3D virtual human, get AI replies through DeepSeek, and then hand them over to the digital human for broadcasting. Next, I continued to combine the common APIs of the embodied driving SDK and let Trae help me with multiple rounds of code iteration to gradually refine the page's interactive experience.
This step mainly added several capabilities:
- Support for entering
App IDandApp Secreton the page, eliminating the need to manually modify the code; - Added connection status display to easily determine if the SDK initialization was successful;
- Added a chat panel for conversing with the AI virtual companion;
- When requesting DeepSeek, put the digital human into a "thinking" state;
- After the AI replies, call the
speakmethod to drive the digital human to broadcast; - Added an "interrupt" capability, allowing the user to stop the current broadcast;
After multiple rounds of dialogue, debugging, and fixing with Trae, the final interactive effect interface is as follows:
In actual use, you only need to enter the App ID and App Secret at the top of the page and click "Connect" to complete initialization. After a successful connection, enter a question in the input box, and the virtual assistant will call DeepSeek to get an answer and perform voice broadcasting through the 3D digital human avatar.
For specific interaction, refer to the video below:
Demo effect video link: https://www.bilibili.com/video/BV1rtjo6MEJe
As you can see, the virtual human's response has almost no delay, and its expressions and movements are relatively natural. If you are not satisfied with the current avatar, you can create a new one in the Mofa Nebula Console and replace the App ID and App Secret.
Adding Voice Dialogue Functionality
In the example above, we have completed basic text input, AI reply, and digital human broadcasting capabilities. However, in many real-world scenarios, such as in-vehicle large screens, mall guide screens, and intelligent customer service terminals, voice dialogue is often the more natural mode of interaction.
Implementing voice dialogue capability is actually not complicated: the frontend can call the browser's microphone permission to capture user voice in real time, then convert the voice to text through a speech recognition service; subsequently, the recognition result is sent to the large model to get a reply, which is finally handed over to the Mofa Nebula SDK to drive the 3D virtual human to broadcast. This forms a complete voice interaction loop:
Due to space limitations, this article does not expand on the specific implementation of speech recognition and real-time audio capture. I will write a separate article later to break down this part.
If you are interested in 3D virtual humans, AI companions, embodied intelligent interaction, voice dialogue, and related areas, you can follow my column; I will continue to share more related practices in the future.
Summary
As seen in this article, the cost of implementing a talking, interactive 3D virtual character using the Mofa Nebula Embodied Driving SDK is very low. Its support for browser and Android SDKs gives it strong terminal compatibility, allowing various screens and applications such as web, mobile, large screens, and in-vehicle systems to have the opportunity to integrate AI embodied interaction agents.
This also means that AI is no longer confined to text dialogues but can achieve more natural expression through 3D avatars, voice, expressions, and movements, bringing an interactive experience closer to real companionship.
Of course, this article only presents a minimal runnable demo, primarily used to verify the complete chain from 3D digital human loading and voice broadcasting to large model replies. In the future, capabilities such as streaming replies, voice input, custom subtitles, and motion control can be further expanded to make this AI companion more natural, intelligent, and user-friendly.
The demo in this article is open source; comment anything to get it.