Synthetic Data

Synthetic data is training or evaluation data generated by AI models rather than collected from real-world sources.

Details

In LLM workflows, synthetic data is commonly produced by prompting a capable model to generate examples for a specific task, format, or domain. Uses include augmenting scarce training data for fine-tuning, generating eval datasets with known ground-truth answers, producing teacher outputs for distillation pipelines, and creating adversarial examples for red teaming. Synthetic data can also be used during post-training to scale preference data or instruction-response pairs beyond what human annotation can provide.

Quality control is critical: synthetic data can amplify model biases, introduce subtle errors, or cause "model collapse" when models are repeatedly trained on their own outputs. Effective pipelines filter, deduplicate, and validate synthetic examples, often using model-as-a-judge scoring, other evals, or human review to maintain data quality.

Examples

  • Using a frontier model to generate thousands of instruction-response pairs for fine-tuning a smaller model.
  • Generating adversarial prompt-response pairs to build a safety eval dataset.
  • A distillation pipeline where the teacher model generates solutions to coding problems that the student model trains on.
  • Creating synthetic customer support conversations to fine-tune a domain-specific chatbot.

Synonyms

AI-generated data, model-generated data