Definition
Speech-to-text, also called automatic speech recognition (ASR), converts spoken audio into written text. The best-known open-weight example is OpenAI's Whisper, an encoder-decoder transformer released with downloadable weights that can transcribe and translate speech across many languages. Because the weights are public, Whisper can run entirely on local hardware, making private, offline transcription practical.
How Whisper works
Whisper takes a simple end-to-end approach. Input audio is split into 30-second chunks and converted into a log-Mel spectrogram, a frequency-versus-time representation of the sound. That spectrogram is fed to a transformer encoder that turns it into context-aware embeddings, and a transformer decoder predicts the corresponding text token by token. Special tokens steer the same model to perform language identification, timestamping, multilingual transcription, and translation into English. Whisper was trained on roughly 680,000 hours of multilingual, multitask audio collected from the web, which is what gives it broad robustness to accents and background noise.
Model sizes and self-hosting
Whisper ships in several sizes — tiny (39M parameters), base (74M), small (244M), medium (769M), and large (1.55B) — so an operator can trade accuracy against speed and memory. Smaller variants run comfortably on a CPU or modest GPU, enabling voice notes, meeting transcripts, or command input to be processed without any audio ever leaving the machine.
D-Central documents speech-to-text as the input half of a voice pipeline; the output half, generating spoken audio from text, is handled by text-to-speech. Transcribed text can then be embedded for semantic search.
In Simple Terms
Speech-to-text, also called automatic speech recognition (ASR), converts spoken audio into written text. The best-known open-weight example is OpenAI’s Whisper, an encoder-decoder transformer released with…
