AI Fundamentals

What Is a Transformer (AI)? Definition, Architecture, and Why It Changed AI

One-Sentence Definition

A transformer is a neural network architecture that uses self-attention to process entire sequences in parallel, enabling the massive language and vision models that define modern AI.

How It Works

The transformer was introduced in the 2017 Google paper "Attention Is All You Need." Before transformers, sequence models like RNNs processed tokens one at a time, creating a bottleneck for long texts. The transformer's key innovation is self-attention: every token in a sequence can directly attend to every other token, regardless of distance, in a single computation step.

Here is the intuition. When the model reads the sentence "The cat sat on the mat because it was tired," self-attention lets the token "it" look back at "cat" and "mat" simultaneously and learn that "it" refers to "cat." This happens across multiple attention heads, each focusing on different types of relationships -- syntactic, semantic, positional.

A transformer block stacks self-attention with feed-forward layers and normalization. Modern LLMs chain dozens or hundreds of these blocks. GPT-style models are decoder-only transformers: they predict the next token given all previous tokens. BERT-style models are encoder-only: they build rich representations of input text for classification and search. The original architecture was encoder-decoder, still used in translation and summarization models like T5.

Scaling transformers up -- more parameters, more data, more compute -- has reliably produced more capable models. This scaling law is the empirical observation driving the multi-billion-dollar training runs behind GPT-4, Claude, and Gemini.

Why It Matters

The transformer is arguably the single most important architecture in the history of AI. It powers every major LLM (GPT-4, Claude, Gemini, Llama), most modern vision models (Vision Transformer / ViT), and multimodal systems that handle text, images, and audio together. Understanding what a transformer does -- parallel attention over sequences -- is the key to understanding why current AI systems are so capable and so expensive to train.

Key Takeaway

The transformer architecture replaced sequential processing with parallel self-attention, unlocking the scale that makes today's large language models and vision systems possible.

Part of the AI Weekly Glossary.