Skip to content

Bitcoin accepted at checkout  |  Ships from Laval, QC, Canada  |  Expert support since 2016

AI Sovereignty

DeepSeek DualPath Explained: The Storage Bottleneck Behind Open-Weight AI Sovereignty

· · ⏱ 6 min read

In February 2026 a team from DeepSeek-AI, Peking University and Tsinghua University quietly published a paper that almost nobody outside the inference-engineering world noticed. It is not a new model. It is not a chatbot. It is plumbing — the kind of unglamorous datacenter plumbing that decides whether the open-weight AI you can actually download stays cheap enough to serve at scale. The paper is called DualPath: Breaking the Storage Bandwidth Bottleneck in Agentic LLM Inference (arXiv 2602.21548), and it is a small, important piece of why the open side of the AI world keeps closing the gap with the rented, closed side.

The short answer: DualPath (arXiv 2602.21548) is a datacenter serving system from DeepSeek-AI, not a model. It fixes a storage-bandwidth bottleneck in long, multi-turn agent inference by loading KV-cache through idle decode-side network cards, reporting up to 1.87x offline and 1.96x average online throughput versus DeepSeek’s own internal baseline. The sovereignty win is the open research plus MIT open weights — not running this at home.

Key takeaways

  • DualPath is infrastructure, not a model. It is a scale-out serving system for agentic LLM inference — the layer that runs an already-trained model for thousands of users at once.
  • The bottleneck it attacks is storage I/O, not GPU compute. In long, multi-turn agent workloads the slow part is shuttling the KV-cache in and out of storage, and one half of the cluster’s network cards sit idle while the other half drowns.
  • The trick is a second load path. DualPath also pulls cache through the decode engines’ idle storage NICs, then hands it to the prefill engines over fast RDMA — with a scheduler choosing the path live.
  • The measured gains: up to 1.87x offline and an average 1.96x online throughput — but always against DeepSeek’s own unmodified internal framework (“Basic”), not a public baseline.
  • Why a sovereign Bitcoiner should care: open research + MIT-licensed open weights means the model you download can be owned, audited and run forever — it cannot be geofenced or sunset out from under you the way rented API access can.

The problem in plain terms: storage is the wall, not the GPU

If you have only ever used AI through a chat box, the mental model is simple: you type, a big expensive GPU thinks, you get an answer. That model is fine for a single short question. It falls apart the moment you point the same machinery at a long-running agent — an AI that reads a giant codebase, calls tools, takes twenty turns, and keeps the whole conversation in its head the entire time. At that scale, the GPU is often not the thing you are waiting on. You are waiting on storage.

To see why, you need two pieces of jargon, and they are worth learning because they explain almost everything about how modern inference behaves.

KV-cache: the model’s working memory

When a language model reads your prompt, for every token it computes a pair of internal vectors — a “key” and a “value” — that summarise what that token means in context. Collectively this is the KV-cache. The cache is what lets the model generate the next word without re-reading the entire conversation from scratch each time. The catch: the cache grows with context length. A million-token agent session (DeepSeek-V4’s default context is a full 1,000,000 tokens) produces an enormous KV-cache — far too big to keep live in GPU memory across thousands of concurrent users. So it gets written out to fast distributed storage and read back when that conversation comes around again.

Here is the mining-world analogy. Think of the GPU as your ASIC hashboard — the thing doing the actual work — and the KV-cache as a stack of share-history and pool-state you have to load before the board can do anything useful. If the board can crunch hashes far faster than you can feed it the state it needs off the disk, you are no longer compute-bound. You are I/O-bound. The expensive silicon idles while it waits for bytes to arrive over the wire. That is precisely the situation DeepSeek measured in long agentic inference: throughput is dominated by KV-cache storage I/O, not by compute.

Prefill vs decode, and why they get split apart

Inference happens in two phases. Prefill is the bulk read: the model ingests the whole prompt (or reloads a conversation’s KV-cache) in one big parallel pass. Decode is the trickle: the model emits the answer one token at a time, each step depending on the last. These two phases have completely different appetites — prefill is a bandwidth sprint, decode is a long, latency-sensitive jog — so large serving systems now disaggregate them, running prefill on one pool of GPUs and decode on another. This is “PD separation,” and it is standard practice at the frontier.

Disaggregation is good for compute efficiency, but it creates an ugly asymmetry in the storage network. When a returning conversation needs its KV-cache reloaded, that load lands on the prefill engines. Their storage network cards (NICs) get saturated hauling massive caches off persistent storage. Meanwhile the decode engines — busy emitting tokens, but not pulling big caches off disk — have storage NICs sitting almost idle. One side of the cluster is choking; the other side’s pipes are empty. That asymmetry is what caps system throughput. You bought all this network bandwidth and you are only allowed to use half of it.

How DualPath works: open the second lane

DualPath’s core idea is almost embarrassingly intuitive once the problem is framed correctly: if the prefill engines’ storage NICs are jammed and the decode engines’ storage NICs are idle, route some of the traffic through the idle ones. The cleverness is in making that physically work without breaking the prefill/decode split that exists for good reasons.

The system runs two load paths and chooses between them dynamically:

  • Path 1 — Storage → Prefill (the traditional route). The KV-cache for already-seen (“hit”) tokens is read directly from persistent storage into the prefill engine’s buffer. This is the conventional path, and when the prefill-side pipes have headroom, it is the right one.
  • Path 2 — Storage → Decode → Prefill (the novel route). Instead of forcing every cache read through the congested prefill NICs, DualPath loads the KV-cache into the decode engines using their otherwise-idle storage NICs, then transfers it across to the prefill engines over high-bandwidth RDMA on the compute network. RDMA — remote direct memory access — lets one machine read another’s memory over InfiniBand without bothering the CPUs, so this hand-off is fast.

An adaptive scheduler sits on top and decides, in real time, which path each load should take — balancing storage-NIC queue lengths and GPU compute load so neither side of the cluster becomes the bottleneck. It is, in effect, a load balancer for KV-cache traffic that finally lets the system use all of its network bandwidth instead of half.

A simple text diagram of the asymmetry DualPath fixes:

BEFORE (PD-disaggregated, single path)
  Persistent storage (3FS SSDs)
        |  cache reads
        v
  [PREFILL NICs] ===> SATURATED  (bottleneck)
  [DECODE  NICs] ---> idle        (wasted bandwidth)

AFTER (DualPath, two paths + scheduler)
  Persistent storage (3FS SSDs)
        |               |
        v               v
  [PREFILL NICs]   [DECODE NICs]  <- both now carry cache
        ^                |
        |   RDMA over    v
        +---- compute network -----+
        (scheduler picks the path live)

The numbers — and the honest asterisk

DeepSeek reports up to 1.87x improvement in offline inference throughput and an average 1.96x improvement in online serving throughput, without SLO violations (SLO = the latency service-level the system promises users; the point is the speed-up does not come at the cost of slower responses).

Now the asterisk, which matters and which we will not bury: both numbers are measured against “Basic” — DeepSeek's own unmodified internal inference framework, not against an external or public baseline. So the honest framing is “up to 1.87x offline / average 1.96x online versus DeepSeek's own internal baseline.” It is a real, useful engineering gain on a real cluster; it is not a claim that DualPath is 1.96x faster than whatever you are running today. Anyone who rounds this to “DeepSeek doubled inference speed” has dropped the part that makes it trustworthy.

The hardware it actually runs on

This is firmly datacenter territory. Each node carries 8 NVIDIA Hopper GPUs with dual 400 Gbps NICs, wired together with InfiniBand and RDMA, backed by distributed SSD storage (DeepSeek's 3FS). Large-scale experiments reach up to roughly 1,152 GPUs — well past a thousand. The published per-configuration split codes (things like “48P96D” for the offline run) are illustrative of how many GPUs are assigned to prefill versus decode; treat the round “1000+ GPUs” figure as the solid one. The point for our purposes: nobody runs DualPath on a home rig. It is the serving layer of a fleet.

One caveat worth stating plainly, as with any systems-research result: these gains were demonstrated on the specific cluster DeepSeek tested — the Hopper nodes, the InfiniBand/RDMA fabric, and the 3FS distributed storage described above. A serving optimization whose whole trick is rebalancing traffic across network cards is only ever as good as the network it runs on; outside a comparably provisioned datacenter, the neat multipliers should not be assumed to hold. That is normal, and worth knowing before anyone treats 1.96x as a law of nature.

The papers, and the V4 moment

DualPath did not arrive in a vacuum. It is one rung on a ladder of open DeepSeek research, and reading the ladder explains why the open-weight ecosystem keeps getting more capable.

DeepSeek-V3 (arXiv 2412.19437) is the foundation: a 671B-total / 37B-active Mixture-of-Experts model. “Mixture-of-Experts” means only a fraction of the network's parameters fire for any given token — 37 billion of the 671 billion — which is how a model that huge stays affordable to run. V3 introduced Multi-head Latent Attention (MLA) and the DeepSeekMoE design, auxiliary-loss-free load balancing, a Multi-Token Prediction objective, FP8 mixed-precision training, and 14.8T pre-training tokens. It is MIT-licensed.

DeepSeek-V3.2 (arXiv 2512.02556, released around December 2025) is the V3-family refresh DualPath was actually benchmarked on. The DualPath paper states its size as 660B — note that, not 671B; 671B is the older V3/V3.1 total, and V3.2 is quoted at 660B in the paper. V3.2's headline feature is DeepSeek Sparse Attention (DSA), a near-linear approach to long context using a lightweight “lightning indexer” plus fine-grained token selection — so the model attends to the tokens that matter instead of paying a quadratic cost over the whole window. DualPath was also tested on a downscaled internal DeepSeek 27B and on dense Qwen2.5-32B, to show the technique generalises beyond one architecture.

And then, on April 24, 2026, the payoff: DeepSeek released and open-sourced the V4 preview. DeepSeek-V4-Pro is 1.6T total / 49B active; DeepSeek-V4-Flash is 284B total / 13B active — both Mixture-of-Experts, both with a 1 million-token default context across DeepSeek's official services. The official Hugging Face V4-Pro model card describes a hybrid compressed-attention scheme — CSA (Compressed Sparse Attention) + HCA (Heavily Compressed Attention) — and states that at 1M context V4-Pro requires only 27% of the single-token inference FLOPs and 10% of the KV cache compared with DeepSeek-V3.2. CNBC confirmed the open-source release that same day, noting developers can download, run locally, and modify it in most cases.

One honest boundary, because precision is the whole point of a reference site: DualPath is a serving-side paper about V3.2 and Qwen, not about V4. The DualPath paper does not mention V4, CSA, HCA, or those 27%/10% efficiency figures. We are placing them side by side because they are part of the same open-research arc — serving systems and model architectures pushing the cost of open AI down together — not because DualPath produced V4's numbers. Conflating them would be exactly the kind of error this site exists to correct.

Why this is an AI-sovereignty story

Here is the part that actually matters to a sovereign Bitcoiner, and it has nothing to do with whether 1.96x is the right multiplier.

There are two ways to get access to a frontier AI model. You can rent it — call somebody's API, pay per token, and depend on that company keeping the lights on, keeping your jurisdiction supported, and not retiring the model you built on. Or you can own it — download the weights, a file, onto disk you control, and run it on hardware you control, forever. DeepSeek's entire stack — V4-Pro, V4-Flash, V3.2, V3.1, R1 — ships under the MIT license on both code and weights. Commercial use, modification, redistribution, fine-tuning, and distillation are all permitted, with no fee and no usage restriction. And research like DualPath is published openly, so the techniques that make these models cheap to serve are not a trade secret — they are in the literature for anyone to read and reproduce.

That ownership has four properties a rented API can never give you, and it is worth being precise rather than sloganeering about them:

  • Access that cannot be revoked. A downloaded weights file keeps working whether or not the company that made it still serves you.
  • Auditability. Open weights and open research mean the thing can be inspected, probed, and understood — not treated as a sealed box you have to trust.
  • Reproducibility. An open paper plus open weights means a result can be re-run and verified by someone other than the vendor.
  • Ownership. The file is yours. It can be air-gapped, archived, fine-tuned on your own data, and run a decade from now with no account, no subscription, and no permission.

Contrast that, factually and without theatrics, with the documented behaviour of rented closed models. This is not an attack on any lab — in several of these cases the lab had the least control of anyone in the chain — it is a description of the access model itself, which is a tenancy:

  • Export-control revocation. On June 12, 2026, the US government issued an export-control directive (reported June 14, 2026) ordering Anthropic to suspend all access to Claude Fable 5 and Mythos 5 for foreign nationals worldwide. Anthropic stated it had to abruptly disable both models for all customers to ensure compliance. We cover that event in depth in our cornerstone piece on the Claude Fable 5 ban; the ongoing status remains developing as of that reporting.
  • Ownership/identity-based bans. An Anthropic terms-of-service update on September 5, 2025 prohibits service to companies more than 50% owned by entities in China, Russia, Iran, or North Korea — regardless of where they physically operate. That is a shift from blocking by IP address to blocking by who owns you.
  • Regional restrictions. Both OpenAI and Anthropic publish supported-country lists; access from an unsupported region can get an account blocked.
  • Vendor-controlled deprecation. Rented models get sunset on a calendar. GPT-4.5 launched February 2025, had its API turned off in July 2025, and exits ChatGPT in June 2026; GPT-4o, 4.1, 4.1-mini and o4-mini were retired from ChatGPT on February 13, 2026; GPT-5.1 on March 11, 2026; o3 is slated to follow on August 26, 2026. A saved open-weight file does not get a retirement date.

This is the same thesis a Bitcoiner already holds about money, applied to compute: a permission is something that can be revoked, and the durable answer is to own the stack. You can read our fuller treatment of that argument in building sovereign AI in Canada and across the AI hub. DualPath fits into this story not as a sovereignty tool you use, but as evidence that the open side of the field is doing serious, frontier-grade systems work in public — which is exactly what keeps open weights worth owning.

What you can actually run at home (and what you can't)

Let us be scrupulously clear, because this is where hype usually creeps in. You cannot run DualPath. It is a thousand-GPU serving system. You also, realistically, cannot run the full flagship models in their native precision — the full 671B-class model at 4-bit quantization needs roughly 376–404GB of memory (sources vary), and the full BF16 footprint on disk is around 715GB. That is server and workstation territory, not a gaming PC. V4-Pro at 1.6T total is even further out of home reach.

What you can own and run is the part of the ecosystem built exactly for this. There are two honest tiers:

Tier 1 — the full model, quantized, on a high-memory box

Community quantization (notably Unsloth's dynamic quants) shrinks the full flagship enough to self-host on a single high-RAM machine, slowly. The 1.58/1.66-bit build lands around 162–170GB and can run on a single 24GB GPU plus 128GB system RAM with Mixture-of-Experts offloading. A 2-bit build is around 245–251GB and wants roughly 226GB of combined RAM+VRAM, generating at about 5 tokens/second — usable for batch work, not for snappy chat. The rule of thumb: your VRAM plus RAM should roughly equal the quant file size. Large unified-memory Apple Silicon machines are the cleanest single-box path for these big quants; reported throughput figures for them (for example, ~40 tokens/sec for a V3.2 int4 build on high-end Apple Silicon) should be treated as approximate, not spec.

Tier 2 — the distilled models (the realistic pleb tier)

This is where most people should actually live. DeepSeek distilled R1's reasoning into six dense models — 1.5B, 7B, 14B and 32B on a Qwen2.5 base, and 8B and 70B on a Llama 3 base — trained on roughly 800k R1 reasoning samples. These run on ordinary tooling: Ollama, llama.cpp, vLLM, or SGLang, with a default Q4_K_M GGUF. Honest VRAM tiers at Q4_K_M:

Distilled model Approx. VRAM (Q4_K_M) Realistic hardware
7B (Qwen2.5) ~6–8GB RTX 3060
8B (Llama 3) ~6–8GB RTX 3060
14B (Qwen2.5) ~12–16GB RTX 4070
32B (Qwen2.5) ~24GB RTX 3090 / 4090 (best single-GPU reasoning)
70B (Llama 3) ~40–48GB Dual RTX 3090 / 64GB+ Apple Silicon

If you want the full walkthrough — choosing a model, sizing VRAM, installing Ollama or llama.cpp, and the Canadian data-residency angle — we already wrote it. See our companion guide on running DeepSeek locally in Canada, and the tooling around it: the local-LLM VRAM calculator, the GPU comparison for local LLMs, and the local-LLM model database. This article is the “why the open ecosystem is strong” companion to that “how to run it” guide — we are not going to re-do the how-to here.

Where D-Central fits

We are a Bitcoin-mining shop, not an AI lab, and we are not going to pretend otherwise. What we care about — and have always cared about — is the same thing DeepSeek's open weights enable: owning your stack instead of renting it. A miner who runs their own node, holds their own keys, and hashes on hardware they control already understands the difference between a permission and a possession. Open-weight AI is one more layer of that same instinct: a model file you own behaves like cold storage; an API key behaves like an exchange account.

We stand on the shoulders of the people doing this work — DeepSeek-AI and the academic groups publishing in the open, the quant authors at Unsloth, the llama.cpp and Ollama maintainers, and everyone shipping open weights. None of that is ours and we would not claim it is. Our job is to translate it for the plebs: to tell you honestly what a 1.96x-vs-internal-baseline number means, what runs on your RTX 3090 and what needs a server, and where ownership genuinely beats convenience. If you want to see how this connects to the wider self-custody picture — Bitcoin, mesh, Nostr, local AI — start at the sovereignty hub, and if you are weighing whether your mining hardware can pull double duty, our breakdown of Bitcoin ASICs vs GPUs for AI compute is the place to start.

Frequently asked questions

Is DualPath a new DeepSeek AI model I can download?

No. DualPath (arXiv 2602.21548) is an inference-serving system — datacenter infrastructure for running models efficiently — not a model. There are no DualPath weights to download. The models you can download are separate releases like DeepSeek-V3.2 and the V4 preview, all MIT-licensed.

How much faster is DualPath, really?

DeepSeek reports up to 1.87x offline throughput and an average 1.96x online throughput, with no SLO (latency-promise) violations. The important caveat: both numbers are measured against DeepSeek's own unmodified internal framework, called “Basic,” not against an external or public baseline. So it is a genuine engineering gain on their stack, not a claim about how it compares to whatever you run.

What is a KV-cache, and why does it bottleneck inference?

The KV-cache is the model's working memory — the key/value vectors it stores for every token so it doesn't have to re-read the whole conversation each step. In long, multi-turn agent sessions the cache grows huge and gets written to and read from distributed storage. At scale, moving that cache in and out becomes the slow part, so throughput is limited by storage I/O rather than by GPU compute.

Can I run DualPath on my home rig?

No. DualPath runs on clusters of NVIDIA Hopper GPUs (8 per node, dual 400 Gbps NICs, InfiniBand/RDMA, distributed SSD storage), scaling to over a thousand GPUs — up to roughly 1,152. It is a fleet-scale serving optimization. The home-runnable part of the ecosystem is the quantized and distilled open weights, run via Ollama, llama.cpp, or vLLM.

Does DualPath have anything to do with DeepSeek-V4?

Only as part of the same open-research arc. The DualPath paper was benchmarked on DeepSeek-V3.2 (stated as 660B), a downscaled internal 27B, and Qwen2.5-32B — it does not mention V4, its CSA/HCA attention, or V4's efficiency figures. V4 (released and open-sourced April 24, 2026) is a separate model release. We present them together because both are open work pushing the cost of open AI down, not because one produced the other's results.

If the models are open-weight, why does any of this datacenter research matter to sovereignty?

Because open weights are only worth owning if the broader open ecosystem stays capable. Public, frontier-grade serving research like DualPath is evidence that the open side of AI is doing serious systems work in the open — which keeps open models competitive with rented closed ones. The sovereignty win is the combination: open research plus MIT-licensed open weights you can download, audit, and run with no possibility of revocation.

What can a regular person actually run instead?

The R1 distilled dense models are the realistic tier: 7B and 8B fit on an RTX 3060 (~6–8GB), 14B on an RTX 4070 (~12–16GB), 32B on an RTX 3090 or 4090 (~24GB), and 70B on dual 3090s or 64GB+ Apple Silicon (~40–48GB), all at Q4_K_M. For step-by-step setup, see our guide on running DeepSeek locally in Canada and our local-LLM VRAM calculator.

ASIC Troubleshooting Database 650+ error codes with step-by-step fixes. Diagnose and repair your miner.
Try the Calculator

Bitcoin Mining Experts Since 2016

ASIC Repair Bitaxe Pioneer Open-Source Mining Space Heaters Home Mining

D-Central Technologies is a Canadian Bitcoin mining company making institutional-grade mining technology accessible to home miners. 2,500+ miners repaired, 350+ products shipped from Canada.

About D-Central →

Related Posts

Start Mining Smarter

Whether you are heating your home with sats, building a Bitaxe, or scaling up — D-Central has the hardware, repairs, and expertise you need.

Browse Products Talk to a Mining Expert