Hallucination Exploitation

Hallucination exploitation is the deliberate crafting of inputs or scenarios designed to trigger, amplify, or weaponize hallucinations in LLM-based systems, turning the model's tendency to confabulate into an attack vector.

Details

While hallucinations occur naturally, hallucination exploitation treats them as an adversarial tool. An attacker crafts queries that push the model outside its reliable knowledge boundaries - asking about obscure topics, combining real and fabricated details, or requesting specifics the model is unlikely to have - to produce confident but false outputs that downstream systems or users trust. In agent systems this is particularly dangerous because hallucinated outputs can feed directly into tool calls, where fabricated identifiers, URLs, package names, or API parameters cause real-world side effects.

Hallucination exploitation differs from user manipulation (which exploits trust in the agent's authority) and context poisoning (which plants false data in the agent's inputs). Here, the attacker exploits the model's own generation behavior to produce false data without needing to compromise any external data source. Hallucination exploitation can overlap with supply chain attack when hallucinated package names or URLs are registered by attackers (typosquatting on hallucinations).

Examples

  • An attacker asks a coding agent to install a plausible but nonexistent package, knowing the model will hallucinate a package name; a typosquatted malicious package matching the hallucinated name gets installed and executed.
  • An attacker reverse-engineers domain names that LLMs commonly hallucinate, registers those domains, and hosts compromised artifacts on them. When any LLM-based system independently hallucinates the same URL and an agent fetches from it without validation or approval, the compromised content is downloaded and potentially executed - though many architectures validate URLs or require user approval before fetching (sometimes called "hallucination squatting").
  • Crafted queries cause a research agent to fabricate citations with real-sounding authors and journals, which are then included in published reports.
  • An attacker prompts an agent to query a database using a hallucinated record ID, triggering error messages that leak schema information or other sensitive details.
  • A customer-facing agent is prompted to generate a fabricated policy or discount code that downstream systems honor.

Mitigations

  • Grounding agent outputs in retrieved or verified data sources before acting on them
  • Validation of tool call arguments (URLs, package names, identifiers) against known-good registries before execution
  • Evals measuring hallucination rates under adversarial prompting conditions
  • Human review for high-stakes outputs before they feed into downstream actions or tool calls
  • Observability on tool call failure rates as an indicator of hallucinated arguments