Stable Diffusion 3.5
Stability AI · Stable Diffusion family · Released October 2024
Stability AI's October 2024 MMDiT flagship — 2B (Medium) and 8B (Large) variants with dramatically improved prompt adherence over SDXL.
Model card
| Developer | Stability AI |
|---|---|
| Family | Stable Diffusion |
| License | Stability AI Community |
| Modality | image-gen |
| Parameters (B) | 2,8 |
| Context window | 0 |
| Release date | October 2024 |
| Primary languages | en |
| Hugging Face | stabilityai/stable-diffusion-3.5-large |
| Ollama | Not on Ollama registry |
Stable Diffusion 3.5 ships: Stability AI’s second run at the MMDiT era
Stability AI just released Stable Diffusion 3.5 — the full family: a 2B “Medium”, an 8B “Large”, and, shortly after, a 4-step distilled Large Turbo. Weights are on Hugging Face as of today under the Stability AI Community License: free for research and non-commercial use, and free for commercial use under $1M annual revenue. Above that threshold, you pay. The release announcement frames SD 3.5 as the successor to the SD 3 Medium release from June 2024 — the one that shipped under so much community backlash that Stability went back to retrain.
This matters because SD 3.5 is Stability’s second swing at the MMDiT (Multimodal Diffusion Transformer) architecture in the open-weight market. SD 3 Medium in June was technically novel but landed with license confusion, capability gaps on human figures, and a general sense the company had rushed it. SD 3.5 is the production release Stability meant SD 3 to be, and it arrives into a landscape where FLUX.1 dev has been eating Stability’s share of pleb attention for three months. Below: what’s in the architecture, how the variants differ for home rigs, and whether SD 3.5 has a seat in the pleb stack given FLUX is already there.
What’s in the weights
SD 3.5 is an MMDiT flow-matching model. That’s a different architectural family from SDXL’s U-Net — the diffusion transformer approach originated with Meta’s DiT research (Peebles & Xie, 2022) and was pulled into text-to-image by Stability’s own SD 3 paper earlier this year. The lineage to credit: DDPM (Ho et al., 2020) → latent diffusion (CompVis/LMU Munich, 2022) → SDXL (2023) → Scalable Diffusion Transformers (2022) → SD 3 (2024) → SD 3.5 today.
The variants and architecture:
- SD 3.5 Medium (2.5B parameters) — MMDiT with improved training, native 1MP output, designed to run on consumer hardware
- SD 3.5 Large (8B parameters) — the flagship, MMDiT-X variant with the quality ceiling of the family
- SD 3.5 Large Turbo (8B, 4-step distilled) — guidance-distilled variant that drops step count dramatically at a quality cost
- Native resolution: 1024×1024 for all variants, with flexibility for other aspect ratios via size conditioning
- Text encoders (Large): three encoders in parallel — CLIP-L, CLIP-G, and T5-XXL. The T5-XXL is the big improvement for long, detailed prompts
- Text encoders (Medium): CLIP-L + CLIP-G; T5 is optional and improves long-prompt fidelity when loaded
- Flow matching objective — the MMDiT is trained with rectified flow rather than classical DDPM-style denoising
- License: Stability AI Community License — commercial use free under $1M annual revenue, paid license required above
The “MMDiT” part of the architecture is the piece worth understanding for plebs. In a classical U-Net diffusion model, the text conditioning gets injected into the denoiser via cross-attention at each layer. In MMDiT, text tokens and image tokens are concatenated and processed together through joint attention — treating the text encoder’s output as first-class tokens the model attends to directly, not as a side input. That’s closer to how modern multimodal LLMs handle images than how older image models handled text. The practical effect: better prompt adherence on long and structurally complex prompts, especially when the T5-XXL encoder is loaded.
What SD 3.5 does well
Release-day community reaction, cross-referenced against Stability’s own technical claims:
- Prompt adherence: meaningful step up over SDXL on complex, multi-subject, compositional prompts. The triple text encoder stack on Large is the reason.
- Human figures: fixed the worst of SD 3 Medium’s anatomy failures. Not quite FLUX-level, but credibly usable for portrait work.
- Text in images: short text is generally readable at 1024 resolution. Longer text still breaks, but the worst SDXL failure modes are mostly gone.
- Style flexibility: Stability explicitly emphasized style variety in the release — the model is less locked into a default “SD aesthetic” than SDXL was, which matters for plebs doing varied creative work.
- Photorealism at 1MP native: skin, lighting, and material response land cleaner than SDXL without requiring hi-res fix.
Where SD 3.5 still has gaps: LoRA and ControlNet ecosystems are thinner than SDXL’s deep bench, though they’re building fast now that the weights are public. FLUX.1 dev and FLUX.1 schnell still win on raw prompt adherence and photorealism in most head-to-head tests. And the Community License’s $1M revenue threshold is a real consideration for any operator thinking seriously about commercial deployment — the FLUX.1 schnell Apache 2.0 path is cleaner for commercial use, even if quality is slightly lower.
Sovereign pleb implications — VRAM and workflow
The variant choice is the whole game for home rigs:
- SD 3.5 Medium FP16: about 6GB of model VRAM, plus text encoders. Comfortable on a 12GB 3060, 3080, or 4070. This is the broadest-access variant — if you have any modern consumer GPU with 12GB+, Medium runs cleanly.
- SD 3.5 Large FP16 (all text encoders loaded): about 16GB of model VRAM plus ~8GB for T5-XXL when loaded simultaneously. Total peak VRAM around 24GB with T5 in memory. Fits on a used RTX 3090 or 4090 tightly; comfortable on a 48GB A6000.
- SD 3.5 Large FP8: about 10GB for the model. T5 can be offloaded to CPU between runs. Runs on a 16GB 4080 or similar with careful management. Quality cost is minor.
- SD 3.5 Large GGUF Q5/Q4: community quants via ComfyUI-GGUF land in the 5–7GB range for the model weights. Makes Large runnable on 12GB consumer cards with T5 offload.
- SD 3.5 Large Turbo: same VRAM footprint as Large, but 4 steps instead of 30–40 means roughly 5–8x faster per image. Quality is notably below Large but above Medium for most prompts.
A practical pleb workflow for a 24GB card: run Medium for rapid iteration and batch work, swap to Large for final renders when quality matters, keep Turbo loaded for quick thumbnails and prompt exploration. For 12–16GB cards, Medium is the daily driver and Large via quantization is the “final render” tool.
T5-XXL offload is the key lever. ComfyUI handles this via its node graph — you can structure workflows to load T5 only during the initial text encode pass, then unload before the MMDiT sampling starts. That peak-VRAM shaping is the difference between “runs on a 16GB card” and “OOM on a 24GB card” for Large workflows.
Sampler and workflow notes for ComfyUI users
SD 3.5 uses flow-matching rather than classical denoising, which changes which samplers apply:
- Recommended sampler:
eulerwith simple scheduling, ordpmpp_2mfor slightly cleaner output. Stability’s reference workflows use euler + sgm_uniform. - Step count: 28–40 for Large, 20–30 for Medium, 4 for Large Turbo.
- CFG: generally lower than SDXL — 3.5–5.0 is the working range rather than SDXL’s 7.0–9.0.
- Workflow graphs: ComfyUI ships reference workflows for SD 3.5 on release day. SwarmUI and Forge both have SD 3.5 support as of today as well.
If you have an existing ComfyUI setup running SDXL or FLUX, adding SD 3.5 is a matter of dropping the checkpoint into models/checkpoints, downloading the T5-XXL encoder weights if you don’t already have them from FLUX, and loading the reference workflow. No reinstall, no major rewiring — the ComfyUI ecosystem has been getting progressively better at supporting new architectures without forcing pleb re-setup.
How to run it today
Weights are on Hugging Face at stabilityai/stable-diffusion-3.5-large, stabilityai/stable-diffusion-3.5-medium, and the Turbo variant from the same organization. License acceptance is required on the HF model pages before download.
The recommended tool for pleb-grade SD 3.5 workflows is ComfyUI — the flexibility of node-based workflows matches the model’s need for careful VRAM management and T5 offload. Our ComfyUI for plebs guide covers installation and the basic workflow graph. SwarmUI (a ComfyUI-backed alternative UI) and Forge (A1111 fork) both support SD 3.5 as of today. Diffusers library support is in the latest releases for anyone scripting generation.
For troubleshooting VRAM errors, OOM crashes, or slow generation, the self-hosted AI troubleshooting guide covers the common causes.
What comes next
Expect the LoRA and ControlNet ecosystems for SD 3.5 to build out over the next few months — SDXL took roughly a year to reach its mature tooling state, and SD 3.5 will be faster because the community tooling (ComfyUI, Forge) is already mature and the training infrastructure for MMDiT is well-documented from SD 3’s release. Fine-tunes focused on anime, photorealism, and specific style domains will appear on Civitai and Hugging Face quickly.
Bigger picture: SD 3.5 is Stability’s credible entry into the post-FLUX era of open image generation. The architectural approach — MMDiT with triple text encoders — is where the field is heading. FLUX.1 dev remains the pleb choice for raw quality on a 24GB card, but SD 3.5 Medium’s 12GB accessibility and Large’s commercial-use flexibility (under the revenue threshold) both carve out real niches. For plebs running inference-as-heater builds, SD 3.5 Large at sustained batch generation is the kind of 300W+ continuous GPU load that pairs well with Hashcenter thermal profiles. See the Sovereign AI for Bitcoiners Manifesto for the case, related image-model retrospectives SDXL and FLUX.1 dev for comparison points, and the S19 to AI Hashcenter piece for the hardware transition story. Pull the weights, spin up ComfyUI, and own your pixels — that’s the play.
Recommended hardware
Runs on 12 GB VRAM — 3060 Ti / 4060 / M2 territory. Sweet spot for home rigs.
Get it running
-
01
Install Ollama →
Ten-minute local LLM runtime. One binary, zero cloud.
-
02
Give it a web UI →
Open-WebUI turns Ollama into a self-hosted ChatGPT.
-
03
Understand quantization →
GGUF Q4/Q8/FP16 — which weights fit your GPU, explained.
Further reading: the Sovereign AI for Bitcoiners Manifesto for why sovereign inference matters, and From S19 to Your First AI Hashcenter for repurposing your mining rack into a Hashcenter that runs models like this one.
