Pretraining
Pretraining is large-scale training of a model on broad data to learn general representations, producing a general-purpose base model. For LLMs, this typically involves self-supervised next-token prediction over sequences from a tokenizer. The temporal boundary of the pretraining data defines the model's knowledge cutoff.
Details
Later post-training phases such as fine-tuning and reinforcement learning adapt the base model to specific tasks, formats, or policies. Some workflows continue training an existing model on additional broad or domain data using the same pretraining objective ("continued pretraining"), which is often contrasted with fine-tuning even though both update weights.
Synonyms
pre-training, base model training, foundation model training