← Back to the summary

CyberVerse: A One-Person Open-Source Framework for Real-Time AI Video Chat Agents

A couple of days ago, I came across a very interesting question on Zhihu: "What kind of open-source project can one person build?"

This question really resonated with me, because I also started an open-source project this year (not an awesome-xxx list or a skill project), and I deeply felt how difficult it is for one person to build an open-source project. Even with AI assistance, there are still many obstacles a developer must overcome. The project has been open-sourced for over two months and has now reached 1.3K stars on GitHub. If you're also planning to do open source, you might want to hear my story.

To sum up the project in one sentence: I built an open-source real-time digital human Agent framework. With just one photo, you can generate a digital human that can video chat with you.

I know many people might be a bit put off by the term "digital human." But the kind of digital human I'm talking about here might be a little different from what you imagine.

Origin: Why I wanted to build a real-time digital human

Early 2026 saw an explosion in AI video generation. At the time, I casually used Xiaoyunque to generate a video of "Tifa," and I thought the result was stunning. I thought to myself, what if one day I could break the fourth wall and have a video call with Tifa? She could understand the world I live in, and I could listen to her talk about more than just the Final Fantasy lore. I even posted about this on my Moments.

Screenshot_20260627_154135

Getting Started

The turning point came in February, when by chance I discovered an open-source digital human model — FlashTalk. This is an audio-driven digital human model, and the most attractive thing about it was that it achieved better results than mainstream digital human models while also being capable of real-time inference. But this came at a cost: achieving real-time inference required 5 H200 GPUs. Coincidentally, I happened to have a friend who could lend me H200 GPUs at the time. So I spent some time researching this model and gradually realized that my wish might actually come true.

Eventually, I had to return my friend's GPUs. Just as I was struggling without access to GPUs, the open-source community dropped another new model — FlashHead (also from the FlashTalk team). This is a 1.3B model, and this time it didn't require professional-grade GPUs; an RTX 5090 could run it. Although the quality couldn't match FlashTalk, I could finally afford to play with it.

So I started building, constructing an application on top of FlashHead. The core functionality of this project is to achieve full-duplex, end-to-end real-time video calls. On top of this core gameplay, I expanded other more convenient features, such as a nice-looking UI, character management, and character definition. I also adopted a modular design, where the digital human base, TTS, ASR, and LLM are all implemented as plugins, making it easy for users to customize a digital human avatar. Then came the memory module: how to make a customized character more vivid, more personalized, and more lifelike?

Progress: From a model demo to a complete application

After nearly 3 months of iteration, CyberVerse now integrates two local digital human models, FlashHead and LiveAct, as well as Baidu Xiling and iFlytek Digital Human. These four models are the best I can currently find among open-source and commercial solutions. At the same time, CyberVerse also integrates large models like OpenAI, Qwen, and Doubao to serve as the digital human's ears, brain, and voice.

Inspired by OpenClaw and Hermes Agent, I started trying to combine the digital human with an Agent. This way, the digital human is not just a paper cutout that chats with you, but also a little helper that can get things done for you. In the overall Agent architecture design, I adopted a two-layer design of a main Agent + SubAgent. The main Agent is responsible for responding to the user, while the SubAgent handles more complex tasks. Currently, I am using pi Agent as the core of the SubAgent; I like pi Agent's simplicity and high extensibility.

Recently, I also added an offline video generation feature, making CyberVerse more like a one-stop digital human workstation: character selection, character editing, offline generation, and real-time calls are all placed in the same system. You can create your own character with just a reference image; if you don't want a digital human avatar, you can also turn off the digital human module and use it as a pure voice Agent.

How far I've taken it alone

Character Selection

After entering CyberVerse, you can see a character library. Each card corresponds to a digital human character.

Note: The characters here are only for demo purposes, are not bundled with CyberVerse, and are not used for commercial purposes.

Character Editing

The character editing page allows you to set the avatar source, character name, character description, voice model, persona style, and more.

Workspace / Offline Generation

CyberVerse supports offline video generation, making it convenient for users to create talking-head videos. The biggest advantage of offline video generation is that you don't have to worry about real-time constraints, allowing you to generate higher-quality videos. It supports both text-driven and audio-driven generation.

Real-time Call

Finally, the most important feature of this project — real-time video calls! This part took a lot of effort to handle WebRTC, audio-video synchronization, transitions between idle video and speaking clips, and coordination between the main Agent and SubAgent.

The real feeling of doing open source alone

The project has been updated for over two months now, and I've basically been working on it alone. I've received 2 PRs contributed by the community.

PixPin_2026-06-26_17-30-33

What is the experience of building a project alone? It's like quietly playing a piano piece by yourself, occasionally having someone pass by and stop to watch for a moment, without the pressure to rush into the next performance, everything comes from the heart. Sometimes I do wish I had someone to develop it with me. Every time I posted a tweet, I used to say that everyone is welcome to submit PRs and issues. Now I no longer say that; developing slowly and leisurely by myself is also quite nice. Let me set a small goal: keep updating this project for a year.

I've put the project link in the comments section, feel free to grab it if you need it.

Comments

Top 2 of 3 from juejin.cn, machine-translated. The original thread is authoritative.

羽晞 1 likes

Bro, impressive. This approach feels like it could spawn a lot of products to play with, but the RTX 5090 requirement is a bit steep. A 1.3B model should be able to run on a laptop 5060, right?

稀土熊猫君

Yes, the compute requirements are still a bit high right now. The 5060 doesn't have enough VRAM to run it. Video models are still quite different from LLMs.

稀土熊猫君

Project address: https://github.com/dsd2077/CyberVerse