Chunking
Chunking is the process of splitting documents or other content into smaller segments (chunks) so they can be individually converted to embeddings, indexed, and retrieved in a RAG pipeline.
Details
Embedding models have input length limits, and retrieval quality depends on each indexed unit carrying focused, coherent information. Chunking bridges the gap between arbitrarily long source documents and the fixed-size vectors stored in a vector database. Chunk size also affects generation: chunks included in the context at inference time consume tokens and must fit within the model's context size.
Common strategies include fixed-size chunking (splitting by token or character count), semantic chunking (splitting at paragraph, section, or sentence boundaries), recursive chunking (applying progressively finer splits until a target size is reached), and overlapping windows (repeating a portion of text across adjacent chunks to preserve boundary context). Smaller chunks improve retrieval precision but lose surrounding context; larger chunks preserve context but dilute relevance signals and consume more of the generation budget.
Examples
- Splitting a knowledge base article into 512-token fixed-size chunks with 64-token overlap for embedding and retrieval.
- Using markdown headings to split a technical document into section-level chunks that each cover a coherent subtopic.
- Recursively splitting a long PDF first by section, then by paragraph, until every chunk falls below a target token count.
Synonyms
text chunking, document chunking, text splitting