Backend · Vue.js · Open Source

CyberVerse: A One-Person Open-Source Framework for Real-Time AI Video Chat Agents

By 稀土熊猫君 · Jul 1, 2026

Read original on juejin.cn ↗ Google Translate ↗ Alt translation

Real-time AI avatars have been stuck behind multi-GPU server requirements and fragmented toolchains. CyberVerse collapses the stack into a single consumer-GPU application with a plugin model, making photorealistic interactive agents accessible to individual developers and small teams who previously could only experiment with offline demos.

Summary

Built solo over three months, CyberVerse integrates local models like FlashHead and LiveAct alongside commercial APIs from Baidu and iFlytek to drive photorealistic digital humans. The system runs full-duplex, end-to-end video calls on a consumer RTX 5090, handling WebRTC streaming, audio-video sync, and seamless transitions between idle and speaking states. A plugin architecture decouples the digital human base, TTS, ASR, and LLM so users can swap components freely.

Beyond conversation, a two-layer agent design (main Agent plus SubAgent built on pi Agent) lets the avatar execute tasks rather than just chat. A memory module gives each character persistent personality and context. The workspace now bundles character creation from a reference image, persona editing, offline talking-head video generation, and real-time calls into a single interface, with the option to disable the visual avatar and use it as a voice-only agent.

Takeaways

— CyberVerse reached 1,300 GitHub stars in two months as a solo-maintained project with only two community PRs.

— Real-time video calls run on an RTX 5090 using the 1.3B-parameter FlashHead model, avoiding the five H200 GPUs required by the earlier FlashTalk model.

— The framework supports four digital-human backends: local models FlashHead and LiveAct, plus Baidu Xiling and iFlytek commercial APIs.

— LLM integration covers OpenAI, Qwen, and Doubao, each pluggable as the avatar's speech recognition, reasoning, and voice synthesis layers.

— A main Agent / SubAgent split handles conversation flow and complex task execution respectively, with pi Agent powering the SubAgent.

— Offline video generation supports both text-driven and audio-driven pipelines at higher quality than real-time mode.

— Character memory and persona editing give each avatar persistent identity across sessions, with the option to run voice-only without a visual avatar.

Conclusions

Consumer-grade real-time digital humans crossed a threshold when FlashHead shrunk the model to 1.3B parameters, but the integration work (WebRTC, A/V sync, agent coordination) remains the harder engineering problem.

Plugin architectures for TTS, ASR, and LLM components are becoming table stakes for avatar frameworks; the differentiator is how cleanly the real-time pipeline handles state transitions between idle and active modes.

Solo open-source maintenance at this complexity level is sustainable partly because the project treats itself as a personal long-term practice rather than chasing community growth metrics.

Offline generation and real-time calling in one workspace points to a convergence pattern: users want to author and interact with the same character through different modalities without switching tools.

The two-layer agent design (main for conversation, SubAgent for tasks) mirrors patterns emerging in voice-assistant architectures, suggesting digital-human frameworks and voice-agent frameworks are on a collision course.

Concepts & terms

Full-duplex real-time communication

A communication mode where audio and video streams flow in both directions simultaneously, without the turn-taking delays of half-duplex systems. Essential for natural-feeling video calls where interruption and overlap feel human.

FlashHead

A 1.3B-parameter audio-driven digital human model from Soul-AILab that generates talking-head video from audio input. It trades some visual fidelity against FlashTalk for the ability to run inference on a single consumer GPU like the RTX 5090.

Main Agent / SubAgent architecture

A two-tier agent design where a primary agent handles direct user interaction and conversation routing, while subordinate agents execute longer-running or more complex tasks. This separation prevents task execution from blocking conversational responsiveness.

pi Agent

A lightweight, extensible agent framework used as the SubAgent core in CyberVerse. Valued for its simplicity and modularity, it handles multi-step task execution behind the digital human's conversational interface.

WebRTC

A set of browser APIs and protocols enabling real-time peer-to-peer audio, video, and data streaming without plugins. CyberVerse uses it to establish low-latency video call connections between the user and the digital human.

From the discussion

The conversation centers on hardware requirements. Enthusiasm for the framework's potential is tempered by the steep RTX 5090 recommendation. A question about running a smaller 1.3B model on a laptop RTX 5060 gets a direct rebuttal: the 5060 lacks sufficient VRAM, and video models demand more than typical LLMs.

— The framework's concept has strong potential for derivative products.

— The RTX 5090 requirement is a high barrier to entry.

— An RTX 5060 laptop GPU cannot run the framework due to insufficient VRAM.

— Video models have fundamentally different and higher hardware demands compared to LLMs.

Featured comments

羽晞 1 likes

Bro, impressive. This approach feels like it could spawn a lot of products to play with, but the RTX 5090 requirement is a bit steep. A 1.3B model should be able to run on a laptop 5060, right?

稀土熊猫君

Yes, the compute requirements are still a bit high right now. The 5060 doesn't have enough VRAM to run it. Video models are still quite different from LLMs.

See top comments, translated →

Source: juejin.cn ↗ Google Translate ↗ Backup ↗