Command R+
Cohere · Command family · Released April 2024
Cohere's April 2024 RAG-native flagship — 104B dense, first-class grounded citation and tool use, CC-BY-NC 4.0.
Model card
| Developer | Cohere |
|---|---|
| Family | Command |
| License | CC-BY-NC |
| Modality | text |
| Parameters (B) | 104 |
| Context window | 128000 |
| Release date | April 2024 |
| Primary languages | en,fr,de,es,it,pt,ja,ko,zh,ar |
| Hugging Face | CohereForAI/c4ai-command-r-plus |
| Ollama | ollama pull command-r-plus |
Command R+ ships: Cohere’s 104B RAG-native open-weight flagship
Cohere just released Command R+ — a 104-billion parameter dense transformer, open-weight on Hugging Face under CC-BY-NC 4.0 (non-commercial), specifically engineered for retrieval-augmented generation, tool use, and enterprise-grade multilingual work. The release blog frames it as the largest open-access Command model yet and the production-grade sibling to the smaller Command R (35B) released a month earlier. Weights are on CohereForAI/c4ai-command-r-plus as of today.
Command R+ is a different kind of release than the Mistral-style “here are the weights, figure out what to do with them” approach. It’s an opinionated model, trained with specific production workflows as first-class citizens: grounded citation output for RAG, native function/tool calling with structured JSON schemas, and broad multilingual coverage across 10 languages that Cohere considers commercially important. For plebs whose work involves retrieval against their own document corpora — the archetypal “point an LLM at my Markdown notes” or “build a local assistant that knows my codebase” use case — Command R+ is a model trained specifically for that job. Below: what’s in the weights, the benchmark snapshot, and the honest pleb verdict on whether 104B is worth the VRAM cost when Llama 3 is coming any week now.
What’s in the weights
Command R+ is a dense decoder-only transformer at 104B parameters. The Cohere Command lineage: the original Command (2023, closed) → Command R (March 2024, 35B, open-weight under CC-BY-NC) → Command R+ today. Cohere has been iterating on enterprise-grade instruct training since its founding in 2019, and Command R+ is the scaled-up version of the RAG-native instruction-tuning work they’ve been publishing for years. Credit to the broader transformer lineage (Transformer, Vaswani et al., 2017) and to the enterprise NLP research community that defined the retrieval-augmented generation paradigm.
Key specs:
- 104B parameters, dense (not MoE)
- Context window: 128K tokens — long enough for substantial document retrieval workflows
- Grouped-Query Attention for efficient inference at this scale
- Languages: primary support for English, French, Spanish, Italian, German, Portuguese, Japanese, Korean, Arabic, and Chinese
- Native grounded generation: structured output format for citing retrieved documents with inline references
- Native tool calling: JSON-schema-based function calling with multi-step tool use support
- License: CC-BY-NC 4.0 (non-commercial research use); commercial deployment requires a Cohere license
The grounded-generation mode is the headline capability. When you pass Command R+ a query plus a set of retrieved documents, it responds in a format that includes structured citation markers — not just “according to the documents” hand-waving, but actual inline reference numbers tied to specific document IDs. That’s the output format RAG systems want to produce anyway; Command R+ bakes it into the model rather than requiring prompt-engineering gymnastics or post-processing. For production RAG use cases (customer support agents, legal document retrieval, internal knowledge bases), that’s a meaningful capability improvement over general-purpose models.
Native tool calling is similarly well-thought-out. You pass a list of available tools with JSON schemas, and the model produces structured function-call output when it wants to use a tool. Multi-step tool use (plan → call tool → reason over result → call another tool → answer) is a trained capability rather than a prompting trick. For agentic workflows, that’s the difference between “usually works” and “reliably works.”
Benchmarks at release
Cohere’s published numbers are sparser than many competitors — the HF model card publishes MMLU at 75.7 (from the Open LLM Leaderboard evaluation), and Cohere positions the model primarily via RAG-specific and tool-use-specific benchmarks rather than the standard reasoning suite. From the Hugging Face model card:
- MMLU: 75.7 — respectable but not frontier; behind Llama 3 70B’s 82.0 tier and well behind the 104B parameter count might suggest if you were expecting pure reasoning gains
- Multilingual: Cohere’s own evaluations show strong cross-lingual transfer on the 10 supported languages, though the comparison set varies across languages
- RAG benchmarks: Cohere publishes RAG-specific evaluations showing Command R+ leading open-weight peers on faithful citation and multi-hop retrieval tasks
- Tool use: strong performance on the ToolBench-style evaluations where multi-step agent workflows matter
The honest read: Command R+ is not a pure-reasoning frontier model. A 104B dense trained for general reasoning would likely score higher on MMLU. Cohere has deliberately traded some general reasoning capacity for specialized training on RAG, tool use, and multilingual workflows. For the use cases the model was designed for, it’s at the top of the open-weight heap today. For generic chat or reasoning benchmarks, Llama 3 70B (arriving later this year) will likely be a stronger choice.
Sovereign pleb implications — honest VRAM reality
This is not a consumer-home-rig model for most plebs. 104B dense parameters is serious hardware. The VRAM math:
- fp16: ~208GB — multi-node or multi-A100 territory
- Q8: ~104GB — 2x A100 80GB, or an 8x RTX 3090 rig with careful tensor parallelism
- Q4_K_M (GGUF): ~60GB — dual 24GB 3090s (48GB total) with CPU offload, or a 4x 24GB rig without. Mac Studio M2 Ultra 128GB runs this at Q4 with unified memory.
- Q3/Q2 (aggressive quant): ~35–50GB — possible on a 2x 24GB rig, but quality degrades and speed drops
See the GGUF quant guide for trade-offs. At 104B, Q4 is a meaningful quality hit compared to higher precision — not devastating, but noticeable on complex RAG workflows where the model needs to reason carefully across retrieved documents.
The practical pleb deployment patterns:
- Mac Studio M2 Ultra 128GB: runs Command R+ at Q4 comfortably via Ollama or LM Studio. The Apple Silicon path is genuinely attractive for this model — unified memory and reasonable power draw, at the cost of slower inference than a dedicated GPU rig.
- Multi-GPU workstation (2x A6000 or 4x 3090): the serious pleb rig. Runs Command R+ at Q8 with headroom. This is Hashcenter-scale hardware, not desktop territory.
- Cloud A100 on demand: the realistic compromise for plebs who need Command R+ occasionally but don’t want to buy the hardware. Hourly rentals make experimentation affordable.
For most plebs, the honest answer is: Command R (the 35B sibling, same CC-BY-NC license) is the model to run, not Command R+. Command R at Q5 fits on a single 24GB card, delivers most of the RAG-native capabilities at meaningfully lower VRAM cost, and leaves headroom for other workloads. Command R+ is the model for plebs who’ve made the multi-GPU jump or bought a Mac Studio specifically for local inference.
The CC-BY-NC license is the other big consideration. Commercial deployment requires a Cohere license — you cannot legally run Command R+ as the backbone of a paid product without negotiating terms. For research, personal use, and non-revenue projects, CC-BY-NC is fine. For any commercial Hashcenter deployment, this is a blocker. Mistral’s Apache 2.0 releases or later Llama releases (with their own quirks around acceptable use) are the permissive-license alternatives.
Real-world verdict for plebs
Command R+ is worth the VRAM if all of the following apply:
- You’re running a serious RAG workload where citation quality matters (legal research, technical documentation assistants, audit trails)
- You have 48GB+ of VRAM available or run Apple Silicon with 128GB+ unified memory
- Your use case is non-commercial or you have a Cohere license
- Multilingual coverage across Cohere’s 10 languages is a meaningful requirement rather than a nice-to-have
It’s not the right choice if:
- You’re doing general chat, coding, or reasoning work — smaller general-purpose models (Llama-class 70B, Mistral’s releases) will serve better per watt
- You only have a single 24GB card — Command R 35B is the sensible choice at that scale
- You’re building a commercial product on open-weight infrastructure — the non-commercial license is a real constraint
- Your retrieval corpus is small enough that context-window stuffing beats RAG — smaller long-context models can handle that directly
For inference-as-heater builds, 104B dense at Q4 on a multi-GPU rig is a substantial sustained heat load — 600–900W continuous for a 2x 3090 + A6000 configuration running Command R+ at reasonable throughput. That’s meaningful space heating. For ASIC-to-AI Hashcenter conversions serving non-commercial internal workloads, Command R+ is a credible flagship model, though Llama 3 (arriving any week now) will likely take that slot on pure merit once it lands.
How to run it today
Weights are on CohereForAI/c4ai-command-r-plus. Ollama registry entry is live:
ollama pull command-r-plus
New to Ollama? The 10-minute Ollama install guide covers setup. For chat UI, Open WebUI pairs with Ollama cleanly. LM Studio loads GGUF quants directly — Bartowski’s Q4 and Q5 quants appeared on HF within hours of release. For production deployments targeting the RAG-native capabilities specifically, Cohere’s own deployment documentation and the Cohere Python SDK are the canonical path (though that’s more relevant for Cohere-hosted API customers than self-hosters).
Hitting issues? The self-hosted AI troubleshooting guide covers multi-GPU tensor parallelism and large-model loading problems.
What comes next
Cohere’s open-weight release cadence has been roughly quarterly for Command R, and Command R+ continues that pattern. Expect a Command R/R+ refresh cycle over the next year. More immediately: the RAG-specific benchmarks the model was optimized for are a small market today but growing fast, and Cohere’s bet is that enterprise customers will increasingly want open-weight options for sovereignty and control. For plebs, that’s useful — enterprise demand funds the training runs that produce the open-weight artifacts we benefit from.
Bigger picture: Command R+ is one more layer of decentralization in the open-weight landscape, specifically in the RAG and tool-use segment where general-purpose models have historically underperformed. The CC-BY-NC license is the clear limitation — it’s decentralization-lite, not the full Apache 2.0 story — but for plebs whose work is non-commercial, Command R+ is a capability step forward. See the Sovereign AI for Bitcoiners Manifesto for the case, related retrospectives Mixtral 8x7B (the sparse-MoE alternative at a different scale) and Mistral 7B (Apache-2.0 baseline) for the license-landscape comparison, and Used RTX 3090 for LLMs plus Bitcoin space heater for the hardware side of running anything this large at home. Pull the weights if you have the rig; reach for Command R if you don’t.
Benchmarks tracked
Benchmark history
Last benchmarked: April 4, 2024 Needs refresh
| Benchmark | Score | Source | Measured |
|---|---|---|---|
| MMLU | 75.7 | vendor_blog ✓ | April 4, 2024 |
Recommended hardware
Multi-GPU rig or cloud territory. For most plebs, the 70B distillation is plenty.
Get it running
-
01
Install Ollama →
Ten-minute local LLM runtime. One binary, zero cloud.
-
02
Give it a web UI →
Open-WebUI turns Ollama into a self-hosted ChatGPT.
-
03
Understand quantization →
GGUF Q4/Q8/FP16 — which weights fit your GPU, explained.
Further reading: the Sovereign AI for Bitcoiners Manifesto for why sovereign inference matters, and From S19 to Your First AI Hashcenter for repurposing your mining rack into a Hashcenter that runs models like this one.
