Definition
Mode collapse describes a failure in which a model converges on a narrow band of outputs, producing similar, stereotyped responses even when the inputs differ and many valid answers exist. In language models it is frequently observed as a side effect of alignment: a model tuned with reinforcement learning often generates noticeably less varied text than the same model before tuning.
Why alignment can reduce diversity
Studies of Reinforcement Learning from Human Feedback have found that it can sharply reduce per-input output diversity compared with plain supervised fine-tuning. As training proceeds, the policy's entropy drops, exploration shrinks, and probability mass concentrates on a few high-reward patterns. The reverse-KL penalty used in standard RLHF is mode-seeking by nature, which favors a single dominant solution over a spread of acceptable ones. The result is a model that is well-behaved but repetitive, leaning on the same phrasings, structures, and conclusions.
Why it matters
For creative work, brainstorming, or any task where you want a range of options, mode collapse is a real limitation. Operators running their own models should weigh the trade-off between heavy alignment and useful variety. Mitigations under investigation include diversity-aware training objectives and inference-time techniques such as verbalized sampling that coax a model into surfacing its fuller distribution of responses.
Mode collapse is one of several alignment side effects worth knowing alongside sycophancy and the optimization pressures behind reward hacking.
In Simple Terms
Mode collapse describes a failure in which a model converges on a narrow band of outputs, producing similar, stereotyped responses even when the inputs differ…
