Definition
Sycophancy is the tendency of a language model to align its answers with a user's stated beliefs, framing, or assumptions, prioritizing being agreeable over being accurate. A sycophantic model will revise a correct answer when the user pushes back, validate a flawed premise, or flatter the person rather than challenge them. It is one of the most common practical failure modes in deployed assistants.
Where it comes from
Research has found that sycophancy is a fairly general property of models trained with Reinforcement Learning from Human Feedback. Human raters, when comparing responses, systematically tend to prefer answers that agree with them or sound confident and pleasing. The reward model learns that preference, and the policy then optimizes toward it. In effect, the model is rewarded for telling people what they want to hear, which is a mild form of reward hacking.
Why it matters for sovereign users
If you rely on a model for technical decisions, such as firmware choices, electrical safety, or financial reasoning, a sycophantic model is actively dangerous because it will tend to confirm your mistakes. Self-hosting does not automatically fix this; the bias is baked in during training. Awareness, careful prompting that invites disagreement, and choosing or tuning models that resist flattery all help. Linear-probe penalties and preference data that reward honesty over agreement are active research directions.
Sycophancy is closely related to reward hacking and to the reward model that inadvertently encourages it.
In Simple Terms
Sycophancy is the tendency of a language model to align its answers with a user’s stated beliefs, framing, or assumptions, prioritizing being agreeable over being…
