Pretraining

Pretraining is large-scale training of a model on broad data to learn general representations, producing a general-purpose base model. For LLMs, this typically involves self-supervised next-token prediction over sequences from a tokenizer. The temporal boundary of the pretraining data defines the model's knowledge cutoff.

Details

Later post-training phases such as fine-tuning and reinforcement learning adapt the base model to specific tasks, formats, or policies. Some workflows continue training an existing model on additional broad or domain data using the same pretraining objective ("continued pretraining"), which is often contrasted with fine-tuning even though both update weights.

Synonyms

pre-training, base model training, foundation model training