Definition
Indirect prompt injection is an attack in which an adversary plants malicious instructions inside external content that a large language model later ingests as part of its context, rather than typing the attack directly into the chat box. When an AI agent retrieves a web page, reads an email, opens a document, or pulls a record during tool use, any attacker-controlled text in that source can hijack the model's behavior. NIST and the OWASP Gen AI Security Project both rank prompt injection as the top risk for LLM applications, and the indirect variant is the more dangerous half because the victim never knowingly invites the attacker in.
Why it matters for sovereign AI
The instructions do not need to be human-readable. Common concealment tricks include white text on a white background, zero-width Unicode characters, and HTML comments. Once parsed, the payload can exfiltrate conversation history, trigger unauthorized tool calls, or rewrite the agent's goals. Real-world zero-click exploits against production assistants have already chained retrieval and tool use to leak data without any user action.
Reducing the blast radius
Defense in depth is the only durable answer: treat all retrieved content as untrusted, constrain what tools an agent may call, separate instruction channels from data channels, and require human confirmation for sensitive actions. For anyone running models on their own hardware, isolating the inference environment and limiting outbound network access shrinks what a successful injection can reach.
This threat sits alongside other input-manipulation attacks; see our entries on data poisoning and adversarial examples for related ways untrusted data subverts a model.
In Simple Terms
Indirect prompt injection is an attack in which an adversary plants malicious instructions inside external content that a large language model later ingests as part…
