What Is Voice Cloning?
Voice cloning uses AI to learn the characteristics of a specific voice from audio samples, then generates new speech in that voice from any text input. The result is synthetic speech that sounds like the original speaker — with the same tone, cadence, and pronunciation patterns.
Most voice cloning services operate in the cloud. You upload your voice samples to their servers, the model trains remotely, and you generate speech through their API. This means your voice data sits on someone else’s infrastructure, subject to their data retention policies and security practices.
SoundWorks takes a different approach. VibeVoice runs the entire pipeline — training, inference, and output — on your local machine. Nothing is uploaded. Nothing leaves your device.
Before You Start
You will need:
- SoundWorks installed on Windows 10 or 11
- An NVIDIA GPU with at least 6GB VRAM (recommended for reasonable inference speed)
- 30 seconds of clear voice recordings in WAV or MP3 format
CPU mode is available but inference will take significantly longer.
Step 1: Prepare Your Voice Samples
Record yourself reading text in a natural, consistent tone, preferably the same tone you’d like to hear later. You only need around 30 seconds of audio total. A few tips for quality samples:
- Use a decent microphone (even a good headset works)
- Record in a quiet room
- Speak at your normal pace and volume
- Avoid long pauses or filler words
- Record multiple short clips and glue them together into one using SoundWorks
Save your recordings as WAV files at 44.1kHz or higher, although 22kHz works as good as 44.1. Remember, the golden principle of AI generation: “Quality in - quality out”. SoundWorks accepts MP3 as well, but WAV preserves the most detail for training.
Step 2: Import Samples into SoundWorks
If VibeVoice is not installed, you can install it through Options window. Just go to Options, select the TTS Engines tab, and click on “Install VibeVoice TTS”. Once it’s installed, click “Configure VibeVoice TTS” the same way. Installation is covered in detail in separate document and video.
Open SoundWorks and select Main Window from launcher. In main window, go to “Tools - AI - VibeVoice TTS”. Then either select profile, create profile, or simply select the voice sample, model and other parameters. In Model Configuration specify bfloat16 quantization (which is selected by default) to achieve real-time VibeVoice generation.
Step 3: Use Profiles
SoundWorks saves your training settings into profiles, so you could easily select working preset later. Saved profiles appear as voices in SoundWorks’ main window. Different voices require different models; for example, you may notice that faster 7B model is often as good as large 20B model, but maybe particular voice or project requires that little extra effort. Larger models produce more natural speech but require more VRAM and training time:
| Model Size | VRAM Required | Inference Time (GPU) | Quality |
|---|---|---|---|
| 1.5B parameters | 4GB | Faster than Real-time | Good |
| 7B parameters | 6GB | Real-time | Very good |
| 20B parameters | 12GB+ | Near Real-time | Excellent |
VibeVoice doesn’t require training, it learns from your samples on the fly.
You can adjust several parameters during the profile creation. Usually you don’t have to, because default values are supposed to provide the best balance between perceived quality and performance, but you still have access to all of the settings under the hood:
- Model Path: Where your model files are downloaded. This way, you can specify different model for different voices (e.g. 7B vs 20B)
- Device: Whether you’d like to use your Nvidia card or CPU (which would be much slower)
- Single or multiple actor samples: You may clone one or up to 4 speakers to use in the same project
- Diffusion Steps: 1 to 100, with default being 25. Like with all LLM models, more steps mean higher quality but exponentially longer time
- CFG Scale: 1.0 to 3.5, how close result should be to your sampled voice
- Seed: Put any value there to randomize the variation of your voice. Some values may work best, but usually any will do
- Attention Type: It’s best to leave it at “Auto”, but you can change it to what works best for you
- Model Quantization: Greatly affects the result and some models may require specific quantization method. For best results with both 7B and 20B models, keep it at default bfloat16, which provides the best speed and quality.
- Queue Mode: Might be the most important setting, as it turns on the mode when SoundWorks keeps the VibeVoice models in memory for a few minutes between requests, making sequential requests lightning fast. It is also a requirement for “VibeGlue” — the mode at which SoundWorks can generate audio of unlimited length with stable high quality. You can specify the idle timeout, which specifies for how long the models will occupy your memory while no inference is happening. This way you won’t waste your memory if you leave SoundWorks opened in the background.
- Multi-Speaker Setup: You can specify up to four speakers and mention them in your text, so that different voice will be used for phrases belonging to different speaker.
Tip: Start with the smaller You can specify up to four speakers and mention them in your text so that a different voice will be used for phrases belonging to different speakers.model to test your setup and sample quality. You can always train a larger model later with the same samples.
Step 4: Generate Speech
In main window, enter your text and click Process. You can monitor progress in real-time. The training process runs entirely on your GPU — no internet connection is used.
# You can also verify no network activity during this process
# On Windows, use Resource Monitor > Network tab
# SoundWorks should show 0 bytes sent/received SoundWorks is not just a simple GUI for TTS engines, but actually works differently. For example, it has the “VibeGlue” mode for VibeVoice, which allows you to generate audio from texts of any length, e.g., whole audiobooks. Usually, TTS engines have limits for the length of text because quality decreases with each next word. This is especially true for VibeVoice, as it’s an LLM model. However, SoundWorks solves this issue.
Another great VibeVoice feature of SoundWorks is the ability to create multiple audio clips in a row without unloading VibeVoice models from memory, which saves you approximately 10 seconds per inference.
VibeVoice doesn’t support SSML markup, therefore SSML options are disabled while VibeVoice voice is selected.
Step 5: Export and Use
Export your generated audio automatically as MP3, or convert it to FLAC, AC3, AAC, Ogg, Opus, WMA or other format using SoundWorks internal batch converter. The output is ready to use in any project — podcasts, videos, presentations, audiobooks, or any other application that needs spoken audio.
Privacy Considerations
Every step of this process happens on your machine:
- Voice samples are read from your local disk
- Model training runs on your local GPU
- Generated speech is saved to your local disk
- No network requests are made at any point
VibeVoice doesn’t create a separate voice model, therefore there is no risk of it being leaked. No one else has access unless you explicitly share the sample files.
Important: If you are cloning someone else’s voice, make sure you have their explicit permission. Voice cloning technology should be used responsibly and ethically. Models used by SoundWorks are not censored.
Common Issues
Inference is very slow. Check that SoundWorks is using your GPU, not CPU mode. Open the Settings panel and verify GPU acceleration is enabled. If you have multiple GPUs, select the correct one. Then check quantization method, make sure it’s bfloat16.
Generated speech sounds robotic. This usually means insufficient training data. Try adding more diverse speech samples — different sentences, different emotional tones, different speaking speeds. Also low CFG Scale may cause unrealistic results. Change it to 3 or higher.
What’s Next
Once you have a working voice model, you can use it across SoundWorks features:
- Slide-to-Video: Narrate presentations with your cloned voice
- Subtitle Studio: Generate audio from subtitle scripts
- Batch processing: Generate speech for multiple text files at once
Voice cloning is one of the most powerful features in SoundWorks, and it runs entirely on your hardware with zero privacy compromises. Download SoundWorks free and try it yourself.