Reinforcement Learning

Reinforcement learning (RL) is a training paradigm where a model learns to select actions that maximize cumulative reward through trial-and-error interaction with an environment, rather than learning from labeled examples.

Details

In the context of LLMs, RL is primarily applied during post-training to shape model behavior using reward signals. RLHF uses a reward model trained on human preference data, and is a core mechanism for alignment and instruction following. RL has also been used to train reasoning capabilities, where the reward signal comes from verifiable outcomes (e.g., correct answers to math or code problems) rather than human preferences. RL environment design introduces significant degrees of freedom: choosing which tasks and distributions to train on shapes what the model learns, and environments inspired by evals can inadvertently produce reward hacking.

RL-trained behaviors - instruction following, reasoning depth, safety refusals - are baked into model weights and define the baseline behavior that prompt engineering and context engineering build on.

Reinforcement Learning

Details

Synonyms