CyberVerse: A One-Person Open-Source Framework for Real-Time AI Video Chat Agents
Real-time AI avatars have been stuck behind multi-GPU server requirements and fragmented toolchains. CyberVerse collapses the stack into a single consumer-GPU application with a plugin model, making photorealistic interactive agents accessible to individual developers and small teams who previously could only experiment with offline demos.
Built solo over three months, CyberVerse integrates local models like FlashHead and LiveAct alongside commercial APIs from Baidu and iFlytek to drive photorealistic digital humans. The system runs full-duplex, end-to-end video calls on a consumer RTX 5090, handling WebRTC streaming, audio-video sync, and seamless transitions between idle and speaking states. A plugin architecture decouples the digital human base, TTS, ASR, and LLM so users can swap components freely.
Beyond conversation, a two-layer agent design (main Agent plus SubAgent built on pi Agent) lets the avatar execute tasks rather than just chat. A memory module gives each character persistent personality and context. The workspace now bundles character creation from a reference image, persona editing, offline talking-head video generation, and real-time calls into a single interface, with the option to disable the visual avatar and use it as a voice-only agent.
Consumer-grade real-time digital humans crossed a threshold when FlashHead shrunk the model to 1.3B parameters, but the integration work (WebRTC, A/V sync, agent coordination) remains the harder engineering problem.
Plugin architectures for TTS, ASR, and LLM components are becoming table stakes for avatar frameworks; the differentiator is how cleanly the real-time pipeline handles state transitions between idle and active modes.
Solo open-source maintenance at this complexity level is sustainable partly because the project treats itself as a personal long-term practice rather than chasing community growth metrics.
Offline generation and real-time calling in one workspace points to a convergence pattern: users want to author and interact with the same character through different modalities without switching tools.
The two-layer agent design (main for conversation, SubAgent for tasks) mirrors patterns emerging in voice-assistant architectures, suggesting digital-human frameworks and voice-agent frameworks are on a collision course.
The conversation centers on hardware requirements. Enthusiasm for the framework's potential is tempered by the steep RTX 5090 recommendation. A question about running a smaller 1.3B model on a laptop RTX 5060 gets a direct rebuttal: the 5060 lacks sufficient VRAM, and video models demand more than typical LLMs.
Bro, impressive. This approach feels like it could spawn a lot of products to play with, but the RTX 5090 requirement is a bit steep. A 1.3B model should be able to run on a laptop 5060, right?
Yes, the compute requirements are still a bit high right now. The 5060 doesn't have enough VRAM to run it. Video models are still quite different from LLMs.