Transformer

The Transformer model (Vaswani et al., 2017) replaces recurrent layers with self-attention, so each token can see all other tokens in the context window in parallel, which is ideal for GPUs.

This enabled scaling to billions of parameters and laid the foundation for large language models, multimodal variants, and modern deep learning. In practice, what matters for most users is solid prompt engineering, not matrix math details.

Key characteristics

Is the architecture that made modern language and many multimodal models practical at scale.
Uses self-attention, allowing models to weigh token relationships more efficiently than older sequence models.
Explains why concepts like tokens, context windows, and scaling are central in current AI systems.