Definition
RWKV (Receptance Weighted Key Value, pronounced "RwaKuv") is a sequence-modeling architecture that combines the parallelizable training of transformers with the cheap, constant-memory inference of recurrent neural networks. It replaces dot-product self-attention with a linear-attention formulation that can be written in two mathematically equivalent ways: a parallel "transformer" form for fast training across a whole sequence, and a recurrent form for token-by-token generation that carries a fixed-size state instead of a growing key-value cache.
Why sovereign operators care
Because RWKV inference needs constant memory and constant compute per token regardless of context length, it can run long-context language models on modest, self-owned hardware rather than rented datacenter GPUs. There is no KV cache to balloon as the conversation grows, which makes RWKV attractive for edge devices, home servers, and any setting where you want a capable model that you fully control. The architecture has been trained to 14 billion parameters and was, at release, the largest dense RNN ever trained, performing on par with similarly sized transformers.
How it is built
An RWKV model stacks identical residual blocks, each containing a time-mixing module (the RWKV analogue of attention, which mixes information across tokens) and a channel-mixing module (a feed-forward stage that mixes across feature dimensions). The receptance, weight, key, and value terms that name the architecture govern how much past state flows forward, giving the model a gated, decaying memory rather than the all-pairs comparison of softmax attention. Later generations (the project reached RWKV-7 "Goose") refine these update rules for stronger in-context recall.
RWKV sits alongside other efficient designs that trade quadratic attention for linear-time recurrence. For background on the broader family, see linear attention and gated linear attention.
In Simple Terms
RWKV (Receptance Weighted Key Value, pronounced “RwaKuv”) is a sequence-modeling architecture that combines the parallelizable training of transformers with the cheap, constant-memory inference of recurrent…
