advanced

Encoder–Decoder Models

The Reader vs The Writer

8 min read

The Analogy

The Reader vs The Writer

BERT read the whole library to understand meaning. GPT reads half a sentence to predict the rest.

Encoders (like BERT) read the entire input in both directions — past and future context simultaneously — making them brilliant at understanding and classifying text. Decoders (like GPT) only see what came before, predicting the next word — making them natural generators. Most modern LLMs are decoder-only. Encoder–decoder models (like the original Transformer) do both — translate, summarise, convert.

In Plain English

Encoder models understand text deeply by reading it all at once — used for search and classification. Decoder models generate text by predicting word by word — used for ChatGPT-style conversation. Encoder–decoder models do both — used for translation and summarisation.

The Technical Picture

Encoder-only models (BERT, RoBERTa) use bidirectional attention for representation learning. Decoder-only models (GPT, Claude, Llama) use causal (masked) attention for autoregressive generation. Encoder–decoder models (T5, BART) pair both for sequence-to-sequence tasks.

Real-World Examples

BERT powers Google Search's understanding of your queries
GPT-4 and Claude are decoder-only — they generate word by word
Google Translate uses an encoder–decoder architecture

Key Takeaway

ChatGPT is decoder-only (generates). Google Search's AI is encoder-based (understands). Knowing the difference explains why they feel so different.

Related Concepts

Transformer Architecture

Large Language Models (LLMs)

Foundational Models

Back to All Concepts