Clone Any Voice for Free: Confucius 4-TTS Runs Locally on Your Mac

This is Cang He's 556th original article!

Hello everyone, I'm Cang He.

The other day, in the dead of night, I was scrolling through YouTube and came across a new video from my favorite blogger.

Her voice was so pleasant that I got a bit excited. I couldn't help myself and spent some time moving Nana into my WeSight.

Now, whenever I start a task in WeSight, the desktop pet can play Nana's voice and report the task progress in real time. It tells me what she's doing, so after sending a task, I no longer have to keep staring at the screen.

For simple tasks, Nana says she'll start thinking, then check the results, and finally tells me the task is complete with a cheerful tone.

For example, when using Claude Code to code in WeSight, it's even more fun, haha.

This feeling is really great, especially when vibe coding late at night—it makes me feel less lonely, and the warm reminders help me focus on other things.

The most impressive part is that you can even customize a local TTS model to enjoy token-free usage.

For the local TTS model, I used Confucius4-TTS. It's only 1.3B in size and supports zero-shot voice cloning without reference text, making it very suitable for local deployment.

Now, you just need to enable the desktop pet in WeSight (Settings → General → Desktop Pet).

Then enable custom voice and set your favorite voice for each pet.

Currently, it supports two modes: you can configure MiniMax's API or choose to share WeSight's MiniMax API key, customize your own API, or even select a local TTS configuration.

You can also upload a reference audio clip and quickly clone a highly accurate voice in just a few seconds.

For example, I uploaded an audio clip of Nana's voice-over, clicked "Start Cloning," and was able to use that voice for WeSight's desktop pet. That's what you saw in the two videos at the beginning of the article.

To be honest, I initially chose local TTS deployment not only to save tokens but also to make good use of the DGX Spark at my company.

I did some research: models that are too large can't run locally, and many smaller ones don't perform well. Before finalizing the technical choice, I ran a lot of tests on Confucius 4-TTS. Look:

You all remember this video, right? I wanted to clone the little girl's voice.

I connected my local Mac to the DGX host via Remotion, and the DGX locally deployed the Confucius 4-TTS open-source model to handle voice cloning.

Then I got the cloned audio:

Out of boredom, I put this dubbed audio into the original video—it was pretty funny, haha.

My wife saw me messing around and wanted to try it too. She first recorded her voice—here's the original audio:

Now, using her voice, after cloning it with my local Confucius 4-TTS model, I had her speak the Three Character Classic in authentic English.

Not satisfied, I had her introduce Wuhan in Japanese.

It sounds quite similar, you know? The intonation and emotion are pretty spot-on—not something you'd expect from a local model.

Confucius 4-TTS supports 14 languages in total: Chinese, English, Japanese, Korean, German, French, Spanish, Indonesian, Italian, Thai, Portuguese, Russian, Malay, and Vietnamese, with more to come.

This is the original voice of my favorite blogger, Nana. I tried cloning her into different languages.

First, a Luxembourgish sales pitch:

Then a Korean-accented sales pitch:

Isn't that nice? I think this is really useful for cross-border e-commerce and content creation—no need to hire people to record voice-overs in different languages.

I also cloned my own voice. After cloning, I had myself play an AI morning news broadcast.

I usually use Xiao Tuan Tuan's voice pack for Amap navigation. I really wanted to clone Xiao Tuan Tuan's voice, so I casually recorded an audio clip, but there was some noise from the car.

Probably due to background noise, the cloned result didn't meet my expectations.

When using the Confucius 4-TTS model for local cloning, make sure to record clean audio without any noise—this will yield better results.

The biggest advantage of local deployment is saving tokens. Actually, deployment isn't that troublesome—you can even run it on your Mac.

First, you need to pull the open-source code locally.

git clone https://github.com/netease-youdao/Confucius4-TTS.git
cd Confucius4-TTS

Then build a conda environment. If you already have one, you don't need to create a new one—just activate it.

conda create -n confuciustts python=3.10 -y
conda activate confuciustts

A conda environment is simply a separate "room" for this project, installing its own dependencies without conflicting with other projects on your computer.

Next, install the dependencies:

pip install -r requirements.txt

This will take some time. Just follow the instructions to install the dependencies.

Once done, you can run the following code to test, remembering to modify your reference audio file path:

python example.py \
    --prompt_wav path/to/reference.wav \
    --text "Hello, this is a test of zero-shot voice cloning." \
    --lang en \
    --out output.wav \
    --config config/inference_config.yaml

Of course, you don't have to test directly on the DGX; you can use Tailscale to execute tests remotely on your Mac.

Finally, use FastAPI to wrap the model service and expose it to WeSight—that's it.

I looked at the technical architecture of Confucius 4-TTS. It uses a "speech encoder + LLM" architecture, with a two-stage process: Text to Semantic (text to semantic tokens) + Semantic to Acoustic (semantic tokens to mel spectrograms).

Then it uses the BigVGAN vocoder to output the final audio.

To be honest, local TTS deployment allows you to use it without worrying about costs—that's quite comfortable.

And Confucius 4-TTS is only 1.3B, runs locally, and is under the Apache 2.0 license, so commercial use is fine. This is really friendly for individual developers and content creators.

If you want your Agent to "speak," I highly recommend giving it a try. Of course, you can also experience it through WeSight.

I noticed that Confucius 4-TTS was open-sourced by Youdao. Remember that WeSight is actually a secondary development based on Youdao's open-source LobsterAI.

Youdao's AI now feels like it's "quietly doing big things." They don't hold press conferences to show off; they just keep releasing open-source projects one after another—TTS, Agent frameworks, multimodal—all under Apache 2.0, ready to use.

Alright, if you found this useful, please give a like. Your support is my biggest motivation to keep tinkering.

Whose voice do you most want to clone? Let's chat in the comments.