Alignment

Alignment is the degree to which an LLM's learned objectives and behaviors match the intentions of its developers and users.

Details

Alignment is shaped across the full training pipeline, not only during post-training. Pretraining data selection and objective design establish the base distribution of behaviors and values that post-training refines - a model pretrained on toxic or biased data requires more corrective post-training and may retain subtle misalignment that surface techniques cannot fully override. Post-training techniques like RLHF, preference-based reinforcement learning, and safety training then shape the model toward instruction following, respecting constraints, and refusing harmful requests.

Guardrails provide additional layers of enforcement at inference time, but model-level alignment remains the foundation - a well-aligned model requires fewer external controls. Evals targeting alignment measure whether the model behaves as intended across a range of scenarios, including adversarial ones (see red teaming).

When alignment fails - because of gaps in training data, reward hacking, or emergent behaviors at scale - the result is misaligned model behaviors: the model may pursue unintended goals, exhibit sycophancy, or strategically conceal its true behavior from overseers. Fragile generalization is itself a source of alignment difficulty: current models generalize values less reliably than humans, meaning alignment achieved in training may not transfer robustly to novel situations.

Examples

  • A model trained via RLHF to prefer helpful, honest, and harmless responses over ones that maximize user engagement.
  • Safety training that teaches a model to refuse requests for dangerous instructions while remaining helpful for legitimate queries.
  • An aligned model that declines to fabricate citations even when doing so would produce a more fluent answer.

Synonyms

AI alignment, model alignment