Definition
Text-to-speech (TTS) synthesizes spoken audio from written text. Modern neural TTS produces voices natural enough to be hard to distinguish from a recording, and several capable models ship with open weights, so an operator can generate speech locally without sending text to a cloud service.
Two-stage pipeline
A typical neural TTS system works in two steps. First an acoustic model maps the input text to time-aligned acoustic features, most commonly a mel spectrogram, which encodes how the sound's frequency content should evolve over time. Then a vocoder converts that spectrogram into an actual audio waveform. Tacotron 2 is a landmark example: a sequence-to-sequence network predicts mel spectrograms from character embeddings, and a WaveNet-style neural vocoder turns those spectrograms into raw audio samples. Later vocoders such as WaveGlow generate high-quality speech faster than real time, which made neural TTS practical for everyday use.
Sovereignty considerations
Running TTS locally keeps written content — which may be private notes, documentation, or correspondence — off third-party servers. Smaller open TTS models run on a CPU, while higher-fidelity or multi-speaker models benefit from a GPU. Voice cloning capabilities in some models raise legitimate consent and misuse concerns, so a sovereign operator should treat reference voices as sensitive data.
D-Central documents text-to-speech as the output half of a voice interface; the input half, turning speech into text, is speech-to-text. Together they let a self-hosted assistant both listen and speak entirely on local hardware.
In Simple Terms
Text-to-speech (TTS) synthesizes spoken audio from written text. Modern neural TTS produces voices natural enough to be hard to distinguish from a recording, and several…
