Whisper Large v3
OpenAI · Whisper family · Released November 2023
OpenAI's November 2023 open ASR model — 1.55B params, MIT-licensed, the open reference for multilingual speech-to-text.
Model card
| Developer | OpenAI |
|---|---|
| Family | Whisper |
| License | MIT |
| Modality | audio |
| Parameters (B) | 1.55 |
| Context window | 0 |
| Release date | November 2023 |
| Primary languages | en,fr,de,es,it,pt,zh,ja,ko,ar,ru,hi |
| Hugging Face | openai/whisper-large-v3 |
| Ollama | Not on Ollama registry |
Whisper Large v3 released: OpenAI’s ASR workhorse, MIT-licensed
OpenAI just released Whisper Large v3 — the third generation of its open-weight automatic speech recognition (ASR) model, under the MIT License. Weights are on Hugging Face at openai/whisper-large-v3 as of today, announced at the OpenAI Whisper repo. The model is a 1.55-billion-parameter encoder-decoder transformer trained on approximately 5 million hours of multilingual supervised audio, and it’s the reference open model for transcription, translation, and language identification across dozens of languages.
Whisper is the quiet workhorse of the open-weight AI ecosystem. Large language models get the cultural attention; Whisper gets deployed. Podcast transcription pipelines, meeting-notes tools, accessibility tooling, subtitling, voice interfaces for self-hosted assistants — all of it runs on Whisper. The v3 release is a meaningful capability upgrade over v2 (December 2022) and further cements Whisper as the transcription default for plebs who want to keep audio data off other people’s servers. Below: what’s new in v3, the VRAM and CPU math, and the pleb workflow for self-hosted transcription.
What’s in the weights
Whisper is a standard encoder-decoder transformer adapted for audio. Audio is converted to a log-mel spectrogram (a 2D representation of sound frequency over time), chunked into 30-second windows, and fed through the transformer encoder. The decoder generates text tokens autoregressively, conditioned on the encoder output and a set of special task tokens at the start of the sequence. Those task tokens are what make Whisper multitask: the same weights handle transcription, translation to English, and language identification, selected by prompt rather than by separate models.
The lineage: Transformer (Vaswani et al., 2017) → speech encoder-decoder research (wav2vec, VALL-E, various) → Whisper v1 (September 2022, trained on 680K hours) → Whisper v2 (December 2022, same architecture, improved training) → Whisper Large v3 today.
Key specs for v3:
- 1.55B parameters, encoder-decoder transformer
- Training data: ~5M hours of multilingual supervised audio (vs ~680K for v1), significantly scaled up from the v1/v2 corpus
- Mel spectrogram: 128 mel bins (up from 80 in v2), for finer frequency resolution
- Audio chunking: 30-second windows, with overlapping stitching for continuous audio
- Languages: 99 supported, with a new Cantonese addition in v3
- Tasks: transcription (source-language), translation (to English), language identification
- License: MIT — fully permissive, commercial use unrestricted
The v3 improvements, per OpenAI’s release notes: reduced errors across most languages relative to v2 (10–20% average error rate reduction depending on language), better handling of non-English transcription (particularly strong gains in low-resource languages), and improved timestamp accuracy. The 128-mel spectrogram input is the main architectural change — everything else is training data and training recipe refinement.
Benchmark snapshot — word error rate
Whisper is evaluated with word error rate (WER), the standard ASR metric. Lower is better. WER counts word-level edits (insertions, deletions, substitutions) needed to transform the transcript into the ground truth, normalized by the reference length. The WER numbers below come from OpenAI’s v3 release notes and the HF model card:
- LibriSpeech test-clean: ~1.8% WER. This is near-human-level on clean, prepared English speech.
- LibriSpeech test-other: ~3.6% WER. Noisier English conditions — still strong.
- Common Voice 15: WER varies by language. English in the 5–7% range on diverse speakers and accents; low-resource languages meaningfully improved over v2.
- FLEURS (multilingual): broad gains over v2 across the benchmark’s language coverage, with v3 setting new lows for the open-weight reference on many language pairs.
The practical takeaway for plebs: on clean English audio (podcasts with good mics, prepared lectures, clear voice-over), Whisper Large v3 produces transcripts that are nearly publication-ready with minimal proofreading. On noisy English or on non-English content, WER climbs but remains usable for most workflows. For languages where v2 was marginal (Welsh, Amharic, Javanese, and similar low-resource tongues), v3 is meaningfully better.
Sovereign pleb implications — why this stays the default
Whisper has been the open-weight ASR default for a year now, and v3 extends rather than disrupts that position. The reasons it wins over alternatives for self-hosted use:
- MIT license — no asterisks, no revenue thresholds, no use-case restrictions. You can deploy Whisper in any product at any scale.
- Modest VRAM footprint — 1.55B parameters is small by 2023 standards and trivial by frontier-LLM standards
- Language coverage — 99 languages from a single set of weights
- Tool ecosystem — whisper.cpp, faster-whisper, WhisperX, and the native OpenAI repo all support v3 rapidly after release
- CPU-runnable — unusual for modern open models; Whisper runs credibly on CPU, opening deployment options GPUs don’t
The VRAM and CPU math for v3:
- Whisper Large v3 FP16 on GPU: about 3GB VRAM. Runs on anything from a GTX 1060 upward. On a used RTX 3090, transcription is 20–50x faster than real-time.
- Whisper Large v3 INT8 via faster-whisper (CTranslate2): about 1.5GB VRAM. Real-time or faster on a modest laptop GPU. The recommended deployment for most pleb workflows.
- Whisper Large v3 on CPU via whisper.cpp: 1.5–3GB RAM with Q5 quantization. A modern 8-core CPU transcribes at roughly real-time speed — usable for batch workflows and surprisingly practical for a 1.5B model.
- Smaller Whisper variants (base, small, medium): still available at 74M, 244M, 769M parameters. For plebs running on Raspberry Pi class hardware or embedded devices, these are the sensible targets.
Deployment patterns plebs actually use in 2023:
- Podcast and video transcription: faster-whisper on a GPU box, batch-processing show archives into searchable text. A single 3090 can transcribe a 1-hour podcast in 2–3 minutes.
- Meeting notes for home labs: Whisper + a small LLM (Llama 3 or Mistral 7B) for summarization = fully self-hosted meeting transcripts without Zoom or Otter seeing the audio.
- Voice interfaces for self-hosted assistants: faster-whisper as the STT frontend for Home Assistant voice, Rhasspy, or custom voice pipelines. Paired with a TTS model (Piper, XTTS) and an LLM, this is a complete voice-enabled assistant stack running entirely on the Hashcenter.
- Subtitling and accessibility: WhisperX adds word-level timestamps and speaker diarization on top of Whisper, producing subtitle files with speaker labels. For content creators, this is the open-source subtitle pipeline.
- Archival and search: transcribing large archives of phone calls, radio recordings, or family video into searchable text corpora. The privacy argument here is the whole case for self-hosting — audio archives hold serious personal data that doesn’t belong on someone else’s cloud.
For plebs running inference-as-heater builds, Whisper doesn’t fill a GPU the way an LLM does — batch transcription is intermittent, not sustained. But it pairs well as a secondary workload on a Hashcenter already running LLMs: the same hardware handles audio and text without needing dedicated ASR infrastructure.
How to run it today
Three paths, each suited to a different deployment context:
- faster-whisper (CTranslate2 backend) — the pleb default for GPU-based transcription. Install with
pip install faster-whisper, pass the model namelarge-v3, and it downloads and runs. Dramatically faster than the reference OpenAI implementation. github.com/SYSTRAN/faster-whisper. - whisper.cpp — the CPU-friendly C++ implementation, with optional GPU acceleration via Metal (Apple Silicon) or CUDA. Best for embedded deployments, laptop use, or CPU-only servers. github.com/ggerganov/whisper.cpp.
- WhisperX — faster-whisper plus word-level timestamps and speaker diarization. The right choice for subtitle generation or any workflow where you need to know not just what was said but when and by whom. github.com/m-bain/whisperX.
Weights download automatically on first use for all three tools — you don’t need to fetch from HF manually unless you’re deploying to an air-gapped environment. For production deployments, the self-hosted AI troubleshooting guide covers the common Whisper-specific issues (VAD tuning, long-form audio stitching, language detection failures).
What comes next
OpenAI has not published a roadmap for Whisper v4, and there’s no particular reason to expect one imminently — v3 is the kind of release that solidifies the open-weight ASR default for a couple of years. Likely developments are on the community side: continued distillation work (distil-whisper from Hugging Face produces a 6x faster variant with minor quality cost), further language-specific fine-tunes, and integration into more downstream tools.
The competitive landscape for open-weight ASR: Meta’s SeamlessM4T is the main peer, with different trade-offs (multilingual translation focus, larger model). NVIDIA’s Canary and various academic systems exist but don’t have Whisper’s combination of permissive license, broad tool support, and multi-task capability. For plebs, Whisper Large v3 is the clear reference choice going into 2024.
Bigger picture: a 1.55B MIT-licensed model that can transcribe 99 languages on a laptop is the kind of tool that makes self-hosted AI stacks actually useful. LLMs get the headlines, but for most plebs the audio pipeline is where the rubber meets the road — podcasts, calls, meetings, voice commands. Whisper Large v3 handles all of that today, on hardware plebs already own, under a license that doesn’t restrict what they can build. See the Sovereign AI for Bitcoiners Manifesto for the broader case, the pleb’s guide to self-hosted AI for how Whisper fits into a multimodal home stack, and Bitcoin space heater for the hardware side. Pull the weights, pick the tool that fits your deployment, own your audio.
Recommended hardware
Runs on 8 GB VRAM or Apple Silicon 16 GB unified — a used 3060 or an M1/M2 Mac handles this fine.
Get it running
-
01
Install Ollama →
Ten-minute local LLM runtime. One binary, zero cloud.
-
02
Give it a web UI →
Open-WebUI turns Ollama into a self-hosted ChatGPT.
-
03
Understand quantization →
GGUF Q4/Q8/FP16 — which weights fit your GPU, explained.
Further reading: the Sovereign AI for Bitcoiners Manifesto for why sovereign inference matters, and From S19 to Your First AI Hashcenter for repurposing your mining rack into a Hashcenter that runs models like this one.
