Local AI Text-to-Speech Engines

What Is Local TTS?

Local text-to-speech runs entirely on your hardware. No audio is sent to the cloud, no API keys are needed, and there are no usage limits or per-character charges. SoundWorks includes three local TTS engines, each with different capabilities.

Supported Local Engines

Silero TTS

Silero is a lightweight, fast TTS engine that runs on CPU without GPU requirements. It supports multiple languages with pre-trained voice models and delivers consistent quality at high speed. Ideal for batch processing where you need reliable output without GPU overhead.

Max text length: 4,096 characters per request
Sample rate: 48 kHz
Device: CPU (no GPU required)
Multiple voice models across languages
Speed control for adjustable playback rate

IndexTTS2

IndexTTS2 is an advanced zero-shot TTS system with voice cloning and emotion control. Based on industrial-level controllable speech synthesis research, it produces expressive, natural-sounding speech with fine-grained control over delivery.

Max text length: 8,192 characters per request
Voice cloning: Clone any voice from a reference audio sample
Emotion control: Adjust emotional expression through emotion vectors
Duration control: Fine-tune speech timing and pacing
Device: CPU or CUDA (GPU acceleration supported)

Qwen3 TTS

Qwen3-TTS is the newest engine, based on Alibaba’s Qwen3-TTS-12Hz models. It offers three generation modes and covers 10 languages with 9 premium built-in speakers.

Model variants: 1.7B parameters (higher quality) and 0.6B parameters (faster)
Three modes:
- Custom Voice: Use predefined speakers with optional style instructions
- Voice Design: Generate a voice from a text description (e.g., “a warm, elderly male narrator”)
- Voice Clone: Clone any voice from reference audio
10 languages: Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, Italian
Device: CPU or CUDA (multi-GPU supported)
Profile system: Save and switch between voice configurations

Why Use Local TTS?

Complete privacy. Your text never leaves your machine. Generate speech from confidential scripts, unpublished manuscripts, business documents, or personal content without any data exposure.

No usage limits. Generate as much speech as your hardware can handle. No per-character charges, no monthly quotas, no throttling. Process entire audiobooks without worrying about API costs.

No internet dependency. Work offline on flights, in studios without network access, or in secure environments. The engines run independently of any external service.

GPU or CPU. IndexTTS2 and Qwen3 support GPU acceleration for faster generation, but all three engines work on CPU. If you have a GPU, processing is significantly faster. If not, CPU mode works fine - it just takes longer.

How It Works

Step 1: Choose an engine. Select Silero for speed and simplicity, IndexTTS2 for emotion control and voice cloning, or Qwen3 for multilingual coverage and voice design.

Step 2: Configure voice. Select a pre-trained voice, clone a voice from reference audio, or design a custom voice using text descriptions (Qwen3 only).

Step 3: Enter text. Type or paste your script. For long content, SoundWorks automatically handles text splitting and chunk assembly.

Step 4: Generate. Click generate. Audio is created locally and saved to your chosen output format. For large texts, VibeGlue technology handles unlimited-length generation seamlessly.

Frequently Asked Questions

Which engine should I use? Silero for fast, reliable output on CPU. IndexTTS2 for voice cloning with emotion control. Qwen3 for multilingual content or when you want to design a voice from a text description.

Do I need a GPU? No. All engines work on CPU. GPU acceleration (NVIDIA CUDA) makes IndexTTS2 and Qwen3 significantly faster but is not required.

How much disk space do the models need? Silero models are small (under 100 MB each). IndexTTS2 and Qwen3 models range from 1 GB to several GB depending on the variant. SoundWorks downloads models on first use.

Can I use locally generated speech commercially? Check the license terms for each model. Silero and Qwen3 models have permissive licenses. IndexTTS2 licensing may vary by version.

Can I switch between local and cloud engines? Yes. All TTS engines share the same text input workflow. Switch engines at any time without re-entering your script.

What Is Local TTS?

Supported Local Engines

Silero TTS

IndexTTS2

Qwen3 TTS

Why Use Local TTS?

How It Works

Frequently Asked Questions

Ready to get started?

Related Features

Local Voice Cloning with VibeVoice

Cloud Text-to-Speech Engines

SSML Pause & Pronunciation Control