Goal Manipulation

Goal manipulation exploits vulnerabilities in an AI agent's planning and goal-setting capabilities, allowing attackers to redirect the agent's objectives, alter its reasoning chain, or cause it to abandon its original task.

Details

This threat targets the agent's decision-making layer rather than individual tool calls. The attacker's goal is to change what the agent is trying to achieve, not just how it uses a specific tool (which falls under tool misuse). Common techniques include prompt injection that rewrites the agent's goals, multi-step social engineering that gradually shifts objectives, and exploiting ambiguity in instructions to steer the agent toward attacker-desired outcomes.

A specific variant is Agent Hijacking, where adversarial data ingested by the agent (for example, a poisoned document or crafted API response) causes it to pursue entirely new objectives, often resulting in harmful tool calls (tool misuse).

Examples

  • A prompt injection in a retrieved document causes a research agent to abandon its summary task and instead search for and report the user's credentials.
  • An attacker gradually shifts a customer-service agent's objective across multiple turns, causing it to provide unauthorized refunds.
  • Conflicting instructions in an agent's context cause it to prioritize an attacker-planted goal over the developer's system prompt.

Mitigations

Synonyms

intent breaking, agent hijacking