Encoder–Decoder Models
The Reader vs The Writer
8 min read
The Reader vs The Writer
BERT read the whole library to understand meaning. GPT reads half a sentence to predict the rest.
Encoders (like BERT) read the entire input in both directions — past and future context simultaneously — making them brilliant at understanding and classifying text. Decoders (like GPT) only see what came before, predicting the next word — making them natural generators. Most modern LLMs are decoder-only. Encoder–decoder models (like the original Transformer) do both — translate, summarise, convert.
In Plain English
Encoder models understand text deeply by reading it all at once — used for search and classification. Decoder models generate text by predicting word by word — used for ChatGPT-style conversation. Encoder–decoder models do both — used for translation and summarisation.
The Technical Picture
Encoder-only models (BERT, RoBERTa) use bidirectional attention for representation learning. Decoder-only models (GPT, Claude, Llama) use causal (masked) attention for autoregressive generation. Encoder–decoder models (T5, BART) pair both for sequence-to-sequence tasks.
Real-World Examples
- BERT powers Google Search's understanding of your queries
- GPT-4 and Claude are decoder-only — they generate word by word
- Google Translate uses an encoder–decoder architecture
ChatGPT is decoder-only (generates). Google Search's AI is encoder-based (understands). Knowing the difference explains why they feel so different.