Tokenizer

A tokenizer is the encoding/decoding scheme (algorithm plus vocabulary and special tokens) that converts raw text into token IDs and converts token IDs back into text for an LLM.

Details

Tokenizers define what counts as a token, which affects how context size is measured, how text is represented during training (including pretraining), and how outputs from inference are turned back into strings. Different models can use different tokenization schemes (e.g., BPE, WordPiece, unigram; byte-level vs Unicode-aware).

Examples

  • Byte-pair encoding (BPE) tokenizers that split words into common subword pieces.
  • Special tokens used to separate chat messages or indicate end-of-sequence.

Synonyms

tokenization, text encoder (overlapping)