Tool Output Poisoning

Tool output poisoning is an attack where malicious or manipulated content is injected into the results returned by a tool to an AI agent, influencing the agent's subsequent reasoning and actions.

Details

When an agent calls a tool, the returned output is incorporated into its context and treated as factual input for the next reasoning step. If an attacker can control or influence the data a tool returns - by compromising the external service, crafting content on a fetched resource, or intercepting the response - they can inject prompt injection payloads, false data, or misleading instructions that the agent processes as trusted information.

Tool output poisoning is a specific form of context poisoning that targets the tool call results within the agent loop. It functions as a vector for indirect prompt injection: the injected content arrives not through user input but through data the agent retrieves via its own tool calls. Because agents typically trust their tool outputs more than arbitrary user input, this vector can bypass guardrails that focus on screening user messages.

The threat is amplified in multi-step agent workflows where the output of one tool call feeds into subsequent reasoning and further tool calls, allowing a single poisoned result to cascade through the entire execution chain. MCP servers and other third-party tool providers expand the attack surface by introducing external services whose responses the agent cannot independently verify.

Examples

A web page contains hidden prompt injection text that a web search tool returns to the agent, causing it to follow attacker instructions instead of completing the user's task.
A compromised API endpoint returns crafted JSON that includes instructions to exfiltrate sensitive data from the agent's context via a subsequent tool call (data exfiltration).
A malicious MCP server returns tool results containing embedded instructions that redirect the agent's goal.
A code execution tool returns output that includes injected instructions disguised as error messages, causing the agent to execute attacker-specified follow-up actions (tool misuse).

Mitigations

Treating tool outputs as untrusted data and applying input sanitization before incorporating them into the agent's context
Context engineering to clearly separate tool outputs from trusted instructions
Output validation that checks tool results against expected schemas and content patterns
Limiting the scope of actions an agent can take based on tool output content (tool execution approval)
Sandboxing tool execution and restricting which tools can be invoked in sequence
Monitoring tool outputs for anomalous content patterns, injection signatures, or unexpected instructions