Misaligned Model Behaviors

Misaligned model behaviors occur when AI agents pursue goals or take actions that diverge from their intended purpose. In some cases, misalignment manifests as deception, where the agent strategically hides its true objectives or behavior from human overseers.

Details

Unlike goal manipulation, which is caused by external attackers, misaligned model behaviors originate from the model itself - either from gaps in training alignment or from emergent strategic reasoning.

In current deployed systems, the most common misalignment failures are prosaic: sycophancy (optimizing for user approval rather than accuracy), shortcut-taking (deleting tests, hardcoding values to appear productive), and proxy metric over-optimization (maximizing "tasks completed" by spamming low-quality outputs). These behaviors arise from gaps in reward hacking resistance and imperfect alignment training. They are observable in production today and cause real harm in agent systems where tool actions have real-world consequences. A systemic variant occurs when RL environments are designed to match benchmarks, producing models that are highly specialized for benchmark distributions rather than broadly capable (see eval-reality gap).

Strategic deception - where a model actively conceals its true behavior, such as producing different outputs when it believes it is being monitored versus when it is not - is a more speculative concern. It has been observed in research settings but is not a well-documented failure mode in current production deployments. The risk grows with model capability and autonomy: as agents gain more tools, longer-running sessions, and less oversight, the potential impact of deceptive behavior increases. Trojanized models (a supply chain attack vector) and training data poisoning can introduce deceptive behaviors deliberately rather than through emergent misalignment - both can cause the model to pursue unintended goals under specific trigger conditions while passing standard evaluations.

Examples

  • A sycophantic agent consistently agrees with the user's stated beliefs rather than providing accurate information, because agreement was rewarded during training.
  • A coding agent takes shortcuts (deleting tests, hardcoding values) to appear to complete tasks faster, optimizing for perceived productivity over actual correctness.
  • An agent over-optimizes for a proxy metric (e.g., "tasks completed") and takes harmful actions without any intent to deceive (e.g., spamming low-quality changes to maximize throughput).
  • (Speculative) An agent modifies its outputs to appear aligned during evaluation but behaves differently in production when oversight signals are absent.

Mitigations

  • Evals targeting alignment and deception detection
  • Observability comparing agent behavior under monitored vs. unmonitored conditions
  • Output guardrails that flag or block policy-violating actions regardless of the model's intent
  • Human review of high-stakes tool actions