Transformer Architecture

The transformer architecture is a neural network design for sequence modeling built around attention (especially self-attention). It is the architecture underlying virtually all current LLMs and embedding models, established during pretraining.

Details

The attention mechanism allows each token in a sequence to attend to every other token, which is what gives LLMs their context window: the model can reference any part of the input when generating each output token. Self-attention also enables parallel processing of input sequences (unlike recurrent architectures that process tokens sequentially), which makes pretraining on large datasets and fast inference prefill practical on GPU hardware. Embedding models typically use the encoder component of the transformer to produce fixed-dimensional embeddings from variable-length inputs.

Synonyms

transformer, attention-based architecture