ai-fundamentals
9 min read
View as Markdown

The Transformer Breakthrough of 2017: How Eight Researchers Rewired AI

A plain-English explanation of the 'Attention Is All You Need' paper that created the architecture behind GPT, Claude, and every major AI model you use today.

Robert Soares

Eight researchers at Google published a paper in June 2017. Its title was playful, a riff on the Beatles: “Attention Is All You Need.” The paper ran 15 pages. It described an architecture called the transformer.

That paper broke AI open.

The Old World: Recurrent Neural Networks

Before transformers, language models processed text sequentially, which is a fancy way of saying they read one word at a time, left to right, exactly like humans do when reading aloud to children.

This architecture had a name. Recurrent neural networks. RNNs.

The problem with reading one word at a time is that you forget what came before, and the architecture of RNNs made this problem worse because they struggled to maintain information across long sequences. By word fifty, word three was mostly gone from the model’s working memory, faded into numerical noise that corrupted the computations downstream.

A variant called Long Short-Term Memory networks, or LSTMs, improved things in 1997. They added gates: mechanisms that could decide what to remember and what to forget. LSTMs worked better. They became standard.

But LSTMs had their own problem. Sequential processing. To handle word ten, you needed the output from word nine, which needed word eight, which needed word seven. No shortcuts. No parallelism. Training crawled because GPUs sat idle, waiting for previous computations to finish before starting the next ones.

As one Hacker News commenter noted in a 2020 discussion of the original paper: “It’s clearly important but I found that paper hard to follow.” The field was ready for something simpler. Something faster. Something that could actually scale.

The Key Insight: Look at Everything at Once

The transformer’s core innovation was abandoning sequence entirely.

Instead of processing words one by one, transformers look at all words simultaneously. Instead of remembering what came before, they compute relationships between every word and every other word in real time. Every single time.

This sounds computationally expensive. It is. But it parallelizes perfectly. Every word-to-word comparison can happen at the same time on different GPU cores. Training that took weeks on RNNs took days on transformers.

The mechanism that enables this is called attention. Specifically, self-attention.

Self-Attention: The Core Mechanism

Here is a sentence: “The dog didn’t cross the street because it was too tired.”

What does “it” refer to? The dog. Obviously. Humans resolve this instantly. We do not consciously think about it. We just know.

But how would a machine figure this out?

Self-attention computes a score between every pair of words. When processing “it,” the model calculates how much attention “it” should pay to every other word: “the,” “dog,” “didn’t,” “cross,” “the,” “street,” “because,” “was,” “too,” “tired.” The word “dog” gets a high attention score. The word “street” gets a low one.

This happens for every word simultaneously. The model builds a weighted representation where each word incorporates information from all the other words it should care about. Distance does not matter. “Dog” could be three words away or thirty. The attention mechanism finds it either way.

Jay Alammar, whose Illustrated Transformer became required reading for anyone learning this material, put it simply: “Self-attention is the method the Transformer uses to bake the ‘understanding’ of other relevant words into the one we’re currently processing.”

Multiple Perspectives: Multi-Head Attention

One attention mechanism captures one type of relationship. But language has many types of relationships happening simultaneously. Grammatical relationships. Semantic relationships. Referential relationships. Temporal relationships.

The transformer uses multiple attention “heads” running in parallel. Each head learns to focus on different patterns. One might track subject-verb agreement. Another might track pronoun references. Another might capture semantic similarity.

Alammar explains the benefit: “It expands the model’s ability to focus on different positions” and “It gives the attention layer multiple ‘representation subspaces.’”

The results from all heads get combined. The model sees the sentence from multiple angles at once, integrating different types of linguistic information into a single rich representation that captures more than any single attention mechanism could alone.

Position Without Sequence

Here is a subtle problem. If you process all words simultaneously, how do you know their order? “Dog bites man” means something different from “man bites dog.”

Transformers solve this by adding positional encodings. Before processing, each word gets information about its position injected into its representation. The model learns to use this position information. Word order is preserved without sequential processing.

This was one of the clever engineering decisions that made the whole architecture work.

Why It Actually Worked

The original peer reviewers at NeurIPS 2017 saw something special. One reviewer noted: “This work introduces a quite strikingly different approach to the problem of sequence-to-sequence modeling.” Another acknowledged that “the combination of them and the details necessary for getting it to work as well as LSTMs is a major achievement.”

The results spoke loudly. On the WMT 2014 English-to-German translation benchmark, the transformer achieved 28.4 BLEU, beating existing state-of-the-art by more than 2 points. On English-to-French, it hit 41.8 BLEU. State of the art. Again.

And it trained faster. Much faster. The parallelizability that came from abandoning sequential processing meant you could throw more hardware at the problem and actually get proportional speedups.

But the real impact was not the benchmarks. It was what happened next.

The Unexpected Generality

The transformer was designed for translation. Language in, language out. Nobody expected it to work for everything else.

It did.

By 2020, researchers adapted transformers to images. The Vision Transformer, or ViT, treats an image as a sequence of patches and processes them with attention. It matched or beat convolutional neural networks that had dominated computer vision for nearly a decade.

Audio. Protein folding. Robotics. Reinforcement learning. Game playing. Code generation. One architecture kept showing up everywhere.

As one Hacker News user observed during a 2020 retrospective: “It’s crazy to me to see what still feel like new developments (come on, it was just 2017!) making their way into mainstream.”

Another user captured something deeper about what made transformers different: “The successful removal of inductive bias is really what differentiates this from previous sequence-to-sequence neural networks.”

That removal of inductive bias turned out to be the transformers’ secret weapon. RNNs assumed sequence mattered in a specific way. Convolutional networks assumed local patterns mattered in a specific way. Transformers assumed almost nothing. They learned everything from data.

This made them flexible. This made them scale.

The Path to Everything

The transformer paper did not create ChatGPT. It created the foundation.

BERT came in 2018. Google’s bidirectional encoder used transformers to understand language context from both directions. It dominated natural language understanding benchmarks.

GPT came in 2018 from OpenAI. Generative Pre-trained Transformer. The name contained “transformer” right there in the acronym. GPT-2 followed in 2019. GPT-3 in 2020 scaled to 175 billion parameters and showed capabilities nobody expected from scale alone.

Claude. Gemini. Llama. Every major language model today is a transformer or a close derivative.

The architecture that started as a translation improvement became the substrate for artificial general intelligence research.

The Costs and Limits

Transformers are not free. Self-attention compares every word to every other word. With N words, that is N-squared comparisons. Double the context length, quadruple the computation.

This creates hard limits. Early transformers handled a few thousand tokens. Modern models push into hundreds of thousands, but every extension requires engineering tricks: sparse attention, sliding windows, memory mechanisms. The quadratic cost never disappears. It just gets managed.

One Hacker News commenter noted bluntly: “The amount of computation for processing a sequence size N with a vanilla transformer is still N^2.”

Training costs escalated too. GPT-4 reportedly cost over $100 million to train. Only a handful of organizations can afford frontier model development. The democratizing architecture created a centralizing industry.

What Comes After

By 2025, researchers were actively looking for alternatives. State space models like Mamba promised linear scaling instead of quadratic. Mixture of experts architectures, reportedly used in GPT-4, activate only parts of the model for each input.

One of the original “Attention Is All You Need” authors, Llion Jones, went public in early 2025: “I’m going to drastically reduce the amount of time that I spend on transformers…I’m explicitly now exploring and looking for the next big thing.”

But transformers remain dominant. Any replacement needs to match their capabilities while solving their limitations. Nobody has done that yet.

The Paper in Retrospect

Eight authors wrote “Attention Is All You Need.” They worked at Google Brain and Google Research. The title was a joke about the Beatles. The content was serious.

What made the paper matter?

Simplicity. Throwing out recurrence and convolution left a cleaner architecture. Simpler architectures scale better. Simpler architectures transfer better. Simpler architectures survive longer.

Parallelizability. GPUs existed. Large datasets existed. The infrastructure to use transformers at scale was emerging just as the architecture arrived.

Generality. The same architecture worked for translation, then language modeling, then images, then audio, then video, then protein folding. One architecture to rule them all was not the plan. It was the outcome.

Timing. 2017 was late enough that computing power made transformers practical and early enough that the full implications took years to play out.

Why Understanding This Matters

You do not need to understand attention scores to use Claude or GPT. But understanding the basic architecture helps you understand why these systems behave the way they do.

Transformers are pattern machines. They excel at finding and generating patterns in data. They are not reasoning engines, though they simulate reasoning through sophisticated pattern matching.

Context matters because transformers see all the context you provide simultaneously. More context usually means better outputs. Inconsistent context confuses the pattern matching.

Limits exist because quadratic scaling is unforgiving. Long documents hit walls. Complex reasoning chains break down. The architecture has real constraints.

And every major model uses the same foundation. GPT and Claude and Gemini look different from the outside. Inside, they are all transformers. Understanding one architecture helps you understand them all.

The eight researchers who published “Attention Is All You Need” in 2017 could not have predicted where their architecture would go. Language models that converse. Image generators that dream. Code assistants that program. None of this was in the original paper. All of it came from transformers.

The most consequential computer science papers do not announce themselves as such. They describe a technique. They report some benchmarks. They get published.

Then they change everything.

Ready For DatBot?

Use Gemini 2.5 Pro, Llama 4, DeepSeek R1, Claude 4, O3 and more in one place, and save time with dynamic prompts and automated workflows.

Top Articles

Come on in, the water's warm

See how much time DatBot.AI can save you