What Is Local TTS?
Local text-to-speech runs entirely on your hardware. No audio is sent to the cloud, no API keys are needed, and there are no usage limits or per-character charges. SoundWorks includes three local TTS engines, each with different capabilities.
Supported Local Engines
Silero TTS
Silero is a lightweight, fast TTS engine that runs on CPU without GPU requirements. It supports multiple languages with pre-trained voice models and delivers consistent quality at high speed. Ideal for batch processing where you need reliable output without GPU overhead.
- Max text length: 4,096 characters per request
- Sample rate: 48 kHz
- Device: CPU (no GPU required)
- Multiple voice models across languages
- Speed control for adjustable playback rate
IndexTTS2
IndexTTS2 is an advanced zero-shot TTS system with voice cloning and emotion control. Based on industrial-level controllable speech synthesis research, it produces expressive, natural-sounding speech with fine-grained control over delivery.
- Max text length: 8,192 characters per request
- Voice cloning: Clone any voice from a reference audio sample
- Emotion control: Adjust emotional expression through emotion vectors
- Duration control: Fine-tune speech timing and pacing
- Device: CPU or CUDA (GPU acceleration supported)
Qwen3 TTS
Qwen3-TTS is the newest engine, based on Alibaba’s Qwen3-TTS-12Hz models. It offers three generation modes and covers 10 languages with 9 premium built-in speakers.
- Model variants: 1.7B parameters (higher quality) and 0.6B parameters (faster)
- Three modes:
- Custom Voice: Use predefined speakers with optional style instructions
- Voice Design: Generate a voice from a text description (e.g., “a warm, elderly male narrator”)
- Voice Clone: Clone any voice from reference audio
- 10 languages: Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, Italian
- Device: CPU or CUDA (multi-GPU supported)
- Profile system: Save and switch between voice configurations
Why Use Local TTS?
Complete privacy. Your text never leaves your machine. Generate speech from confidential scripts, unpublished manuscripts, business documents, or personal content without any data exposure.
No usage limits. Generate as much speech as your hardware can handle. No per-character charges, no monthly quotas, no throttling. Process entire audiobooks without worrying about API costs.
No internet dependency. Work offline on flights, in studios without network access, or in secure environments. The engines run independently of any external service.
GPU or CPU. IndexTTS2 and Qwen3 support GPU acceleration for faster generation, but all three engines work on CPU. If you have a GPU, processing is significantly faster. If not, CPU mode works fine — it just takes longer.
How It Works
Step 1: Choose an engine. Select Silero for speed and simplicity, IndexTTS2 for emotion control and voice cloning, or Qwen3 for multilingual coverage and voice design.
Step 2: Configure voice. Select a pre-trained voice, clone a voice from reference audio, or design a custom voice using text descriptions (Qwen3 only).
Step 3: Enter text. Type or paste your script. For long content, SoundWorks automatically handles text splitting and chunk assembly.
Step 4: Generate. Click generate. Audio is created locally and saved to your chosen output format. For large texts, VibeGlue technology handles unlimited-length generation seamlessly.
Frequently Asked Questions
Which engine should I use? Silero for fast, reliable output on CPU. IndexTTS2 for voice cloning with emotion control. Qwen3 for multilingual content or when you want to design a voice from a text description.
Do I need a GPU? No. All engines work on CPU. GPU acceleration (NVIDIA CUDA) makes IndexTTS2 and Qwen3 significantly faster but is not required.
How much disk space do the models need? Silero models are small (under 100 MB each). IndexTTS2 and Qwen3 models range from 1 GB to several GB depending on the variant. SoundWorks downloads models on first use.
Can I use locally generated speech commercially? Check the license terms for each model. Silero and Qwen3 models have permissive licenses. IndexTTS2 licensing may vary by version.
Can I switch between local and cloud engines? Yes. All TTS engines share the same text input workflow. Switch engines at any time without re-entering your script.