You’re deep into a conversation with an AI assistant. You’ve explained your project, given context about your company, listed your constraints. Thirty minutes in, you ask a follow-up question and the AI responds like you never told it any of that.
What happened? The AI didn’t crash. It didn’t get confused. It simply ran out of room to remember.
This frustrating experience comes down to two concepts most people haven’t heard of: tokens and context windows. Understanding them won’t make the problem disappear, but it will help you work with AI more effectively and hit those walls less often.
What Are Tokens, Exactly?
An AI model doesn’t read text the way you do. It doesn’t see words. It sees tokens.
A token is a chunk of text. Sometimes it’s a whole word. Sometimes it’s part of a word. Sometimes it’s just a few characters. The AI breaks everything down into these pieces before it can process anything.
Here’s how it works. The word “apple” is common enough that it’s a single token. But “hamburger” gets split into three: “ham,” “bur,” and “ger.” The word “unbelievable” might become “un,” “believ,” and “able.”
Common words stay whole. Rare words get chopped up. This happens through a process called tokenization, and different AI models use slightly different approaches. OpenAI’s newer models use a tokenizer called o200k_base. Older GPT-4 models use cl100k_base. The specific tokenizer affects exactly how text gets split, though the basic principle stays the same.
According to OpenAI, about 100 tokens equals roughly 75 English words. A single token is approximately 4 characters. So when you’re typing a 1,000-word document, you’re actually using around 1,333 tokens.
Why does this matter? Because every AI model has a limit on how many tokens it can handle at once. Your prompt, the AI’s response, and everything in between all count against that limit.
Different languages tokenize differently too. English is relatively efficient. Non-English text often produces more tokens per word, which means the same information takes up more space. A Spanish phrase like “Cómo estás” uses 5 tokens for just 10 characters.
The Context Window: Your AI’s Working Memory
The context window is the total amount of text an AI can consider at once. Think of it as the AI’s working memory. Everything in the conversation needs to fit inside this window, or parts get pushed out.
Here’s where the numbers get interesting.
Gemini 2.5 Pro and Gemini 3 offer 1 million tokens. That’s roughly 750,000 words. You could fit several novels in there.
Claude models from Anthropic typically offer 200,000 tokens, with some versions recently expanded to 1 million tokens for certain use cases.
GPT-5 from OpenAI comes with a 400,000 token context window.
And Meta’s Llama 4 Scout pushed things to an unprecedented 10 million tokens. That’s approximately 7,500 pages of text.
A million tokens sounds like plenty. So why do AI tools still forget things?
Why AI “Forgets” Earlier Parts of Conversations
Here’s the part that surprises most people. Having a large context window doesn’t mean the AI uses all of it equally well.
Research from Stanford and other institutions published a study called “Lost in the Middle” that examined how language models actually use long contexts. The findings were striking.
Models show a U-shaped performance curve. They’re good at remembering information at the very beginning of the context and the very end. But information in the middle? It often gets overlooked.
In their experiments, GPT-3.5-Turbo’s performance dropped by more than 20% when key information was placed in the middle of the input rather than at the beginning or end. In some cases, the model performed worse than if it had no context at all.
This happens because of how attention mechanisms work in these models. The AI doesn’t give equal weight to every part of its context. It tends to focus on what came first and what came most recently.
So even if your conversation technically fits within the context window, the AI might effectively “forget” crucial details from earlier in your exchange simply because they’re buried in the middle.
This isn’t a flaw that will be easily patched. It’s a consequence of how transformer architectures process information. The attention mechanism works by assigning weights to different parts of the input, and those weights naturally favor certain positions. Researchers are working on solutions, but for now, position matters.
The Gap Between Advertised and Usable Context
There’s another issue worth understanding. The context window you see advertised isn’t always the context window you get.
Research from Chroma found that “most models break much earlier than advertised. A model claiming 200k tokens typically becomes unreliable around 130k, with sudden performance drops rather than gradual degradation.”
Models don’t gracefully degrade as they fill up. They work fine, then suddenly don’t. The drop-off can be abrupt.
This explains why you might hit strange behavior well before you’d expect to run out of room. The model hasn’t technically run out of context. It’s just that its ability to use that context has degraded past the point of being useful.
Why Context Windows Are Harder Than They Sound
Expanding context windows is one of the active areas of AI research. You might wonder why it’s difficult when these companies have billions of dollars and massive compute resources.
The challenge is computational. The attention mechanism that helps AI understand relationships between words scales quadratically with context length. Double the context, quadruple the compute needed. That’s expensive and slow.
Various techniques help. Sparse attention patterns, sliding window approaches, and architectural innovations have pushed context windows from thousands to millions of tokens. But each expansion comes with tradeoffs in speed, cost, or accuracy.
Google’s Gemini team, when introducing their 1 million token capability, described achieving “near-perfect recall” on needle-in-a-haystack tests. But that benchmark tests whether the model can find a specific piece of planted information. Real-world use is messier. You’re not asking it to find a needle. You’re asking it to synthesize everything it knows about your project while keeping track of constraints you mentioned an hour ago.
Practical Ways to Work Around These Limits
Understanding the problem helps, but you probably want to know what to do about it. Here are approaches that actually work.
Front-Load the Important Stuff
Since AI models pay more attention to the beginning and end of context, put your most critical information first. If you’re asking for help with a specific task, state the key constraints upfront. Don’t bury them in paragraph five of your explanation.
Instead of describing your whole project chronologically, start with what matters most for the current question.
Repeat Key Information
Don’t assume the AI remembers something you said earlier. When you’re deep into a conversation, explicitly remind it of relevant context.
“As I mentioned at the start, we need this to work with Python 3.9” takes a few extra tokens but can save you from getting responses that ignore that constraint entirely.
Summarize Periodically
If you’re in a long working session, pause occasionally to have the AI summarize what you’ve established so far. This does two things. It confirms the AI is tracking correctly, and it creates a compressed version of the context that takes up less room.
You can then start a fresh conversation using just that summary, effectively resetting your context budget.
Use Separate Conversations for Separate Topics
Don’t try to handle everything in one thread. If you’re working on multiple aspects of a project, split them into separate conversations. Each gets its own fresh context window.
Be Strategic About What You Include
Every document you paste, every example you provide, every bit of context uses tokens. Ask yourself: does the AI actually need this for the current task?
If you’re asking it to fix a specific function, you probably don’t need to include your entire codebase. If you want feedback on one paragraph, don’t paste the whole document.
Structure Long Documents
When you do need to include lengthy content, add clear headings and markers. This helps the AI’s attention mechanisms find relevant sections rather than treating everything as one undifferentiated blob.
Put Key Info at the Beginning and End
Given the U-shaped attention pattern, strategically place your most important information where the model will pay most attention. State critical constraints at the start of your prompt. Repeat them at the end if the prompt is long. The middle is where things get lost.
Watch for Signs of Context Loss
Learn to recognize when an AI has lost context. Responses that ignore constraints you established. Answers that contradict earlier parts of the conversation. Sudden drops in relevance or quality. When you notice these, it’s usually not the AI being stubborn. It’s the context window filling up or the middle getting ignored.
The Reasoning Token Tradeoff
One more thing worth knowing. Modern AI models increasingly use “reasoning tokens” for complex problems. These are tokens the model generates internally as it thinks through a problem before giving you an answer.
Reasoning tokens improve quality on difficult tasks, but they use up context too. When a model spends more tokens thinking, fewer are available for remembering your earlier conversation.
You might notice this when asking complex analytical questions. The AI produces a thoughtful response but seems to have lost track of some context. It’s not a bug. It’s a tradeoff. More thinking, less memory.
What Context Limits Tell Us About AI
These constraints reveal something fundamental about how current AI works. Despite impressive capabilities, language models don’t actually understand or remember in the human sense. They process tokens through mathematical operations that happen to produce useful outputs.
The context window is a hard computational boundary. The “lost in the middle” problem reflects how attention mechanisms actually behave. The gap between advertised and usable context shows that benchmark performance doesn’t always match real-world use.
None of this makes AI tools less useful. It just means using them effectively requires understanding what they actually are: very sophisticated text prediction systems with specific, measurable limitations.
Context Windows Will Keep Growing
The trend is clearly toward larger context windows. A few years ago, 4,000 tokens was standard. Now we’re seeing millions. Research continues on making those larger windows more usable, not just bigger.
Techniques like retrieval-augmented generation (where the AI can look up relevant information from a database rather than holding everything in context) and better architectural approaches will likely reduce how often you hit these walls.
Some researchers are working on models that can selectively compress older context, keeping important information while discarding details that no longer matter. Others are exploring ways to give AI models more explicit memory systems, separate from the context window entirely.
But for now, tokens and context windows matter. Every conversation with AI happens inside these constraints. The more you understand them, the better you can work within them.
When your AI assistant forgets something important, it’s not being difficult. It just ran out of room, or the information got lost in the middle. Now you know why, and you can plan accordingly.
The practical takeaways are straightforward. Keep important information at the start. Don’t assume earlier context survived. Summarize and reset when conversations get long. And when the AI seems to have forgotten everything, it probably has. Start fresh, front-load what matters, and you’ll get better results.