Ben •
In July 2025, Chroma published one of the most methodologically careful studies of LLM long-context performance to date. The paper - "Context Rot: How Increasing Input Tokens Impacts LLM Performance" , authored by Kelly Hong, Anton Troynikov, and Jeff Huber - evaluated 18 frontier language models across five distinct experiments, isolating input length as the primary variable.
The headline finding: every single one of the 18 models showed performance degradation as input length increased. Not most. Not some. All of them. The degradation is nonuniform, sometimes sharp, and in at least one case so counterintuitive that the paper itself flags it as deserving further research.
The 18 models span five families: Claude Opus 4, Sonnet 4, Sonnet 3.7, Sonnet 3.5, and Haiku 3.5 from Anthropic; o3, GPT-4.1, GPT-4.1 mini, GPT-4.1 nano, GPT-4o, GPT-4 Turbo, and GPT-3.5 Turbo from OpenAI; Gemini 2.5 Pro, Gemini 2.5 Flash, and Gemini 2.0 Flash from Google; and Qwen3-235B-A22B, Qwen3-32B, and Qwen3-8B from Alibaba.
This post walks through what the paper actually found, what's genuinely surprising, what it does not prove, and what it means for anyone building with or alongside LLMs today.
The phenomenon of LLMs performing worse on long inputs isn't new. The Stanford "lost in the middle" research documented position-dependent attention failures years before Chroma's study. What makes Chroma's technical report specifically important comes down to three things.
Scale. 18 frontier models evaluated systematically is the largest comparative study of this kind. Reaching across all five major LLM families in a single controlled study is what makes the "all 18 degraded" finding genuinely hard to dismiss. This isn't one model family's idiosyncratic weakness - it's a pattern that appears to transcend architecture.
Methodological isolation. Earlier long-context benchmarks conflated input length with task difficulty. When performance degraded, you couldn't tell whether the longer context itself was the problem or whether longer contexts happened to accompany harder tasks. The Chroma study holds task complexity constant while varying input length. That's the key design choice that makes their results interpretable. Models are typically presumed to process context uniformly - treating the 10,000th token as reliably as the 100th. This study tests that presumption directly.
Scope beyond needle-in-a-haystack. The original NIAH benchmark tests one thing: can the model find a specific fact buried in filler text? The Chroma study moves substantially past this, testing distractor interference, semantic relationships between content and query, the structural properties of context, and a realistic conversational QA benchmark. That scope is why the findings have more implications for production systems than a standard NIAH test would.
The full replication codebase is publicly available on GitHub , which means the methodology can be examined and extended.
The analytical core of the Chroma paper is five experiments, each designed to isolate a different dimension of how input composition affects performance as length grows. I'll walk through them in the paper's order, which reflects the logical progression of the methodology.
What was tested: The researchers varied the semantic similarity between planted facts (needles) and the questions used to retrieve them. Some needle-question pairs shared direct vocabulary. Others required the model to infer meaning - the question asked about something described in different words than the needle used to describe it.
What they found: Low-similarity pairs - those requiring semantic inference rather than lexical matching - degrade more rapidly as input length grows. At short context lengths, even low-similarity pairs are handled reasonably well. As length increases, the gap between high- and low-similarity pairs widens sharply.
Why it matters: Lexically identical queries are the minority case in real production systems. A user asks "what did we decide about the offshore expansion?" and the relevant document says "Q3 international revenue strategy finalized." The model has to bridge that semantic gap. According to Chroma's findings , precisely that inferential bridge is what degrades first and fastest as context grows. The failure mode the paper documents is the failure mode that actually matters.
Nuance to preserve: The paper tested 11 distinct needle positions within the context. Contrary to the classic "lost in the middle" prediction, no strong position-based pattern emerged in this particular experiment. The degradation was driven by length and semantic distance, not placement.
What was tested: Starting with a high-similarity needle-question pair, the researchers introduced distractors - topically adjacent content that sounds relevant but doesn't actually answer the question. They compared three conditions: no distractors (baseline), one distractor, and four distractors.
What they found: A single distractor measurably reduces performance compared to the baseline. Four distractors compound the degradation further. Critically, the impact is non-uniform: some distractors are substantially more disruptive than others, regardless of their measured semantic similarity to the needle. The paper documents that distractor quality matters independently of how closely it resembles the target content.
Model-specific behavior: This experiment produced the study's most notable cross-model behavioral difference. Claude models - Sonnet 4 and Opus 4 specifically - showed the lowest hallucination rates under distractor conditions. Their characteristic response to ambiguity was to abstain: to explicitly state that the context doesn't contain a reliable answer rather than generate a confident but incorrect one. GPT models, by contrast, showed the highest hallucination rates under distraction - producing incorrect responses confidently rather than abstaining.
The paper documents this pattern; it doesn't explain the mechanism, and it doesn't constitute a definitive ranking of these models across all tasks. But the behavioral signature is clear and reproducible.
Why it matters: Real-world prompts almost never arrive with a clean, isolated signal. A support ticket prompt includes similar past tickets. A code review includes adjacent functions. A meeting summary prompt includes notes from other workstreams. The paper demonstrates that near-miss content - material topically close to the answer but not actually answering the question - degrades performance in ways that adding more context window doesn't fix. Managing distractors is real engineering work, not premature optimization.
What was tested: The researchers used two thematically distinct haystack types: Paul Graham essays and arXiv academic papers. They then tested what happens when needles are semantically similar to the surrounding haystack versus when they stand out from it.
What they found: The semantic relationship between needle and haystack affects retrieval performance - and the effect is asymmetric. In Paul Graham’s essay haystacks, arXiv-style needles (semantically dissimilar from the surrounding content) performed notably better than Paul Graham-style needles (which blended into the surrounding text). The effect was smaller in arXiv haystacks, suggesting some asymmetry between the two topic domains.
Why it matters: The common working assumption about irrelevant padding context is that it's neutral - that it adds tokens but doesn't change the model's ability to find what it's looking for. Chroma's findings challenge this. The type of irrelevant content matters. For production RAG systems, the background documents you inject to provide "context" aren't neutral filler if they share semantic territory with the content you need the model to focus on.
Honest caveat from the paper: Two haystack topics are insufficient to generalize this into a clean design rule. The paper is explicit about this limitation and flags needle-haystack similarity as a direction for future research. Don't treat this finding as a complete theory - treat it as an important early signal about a variable that deserves more investigation.
What was tested: The researchers compared two haystack conditions using identical content: one with natural structural coherence (essays and documents that flow logically) and one with sentences randomly shuffled (same words, same content, but no logical flow or narrative structure).
What they found: Across all 18 models and all tested configurations, models performed better on shuffled haystacks than on logically coherent ones. This result was consistent, not an outlier.
Why this is genuinely surprising: Every intuitive prior says coherent text should help. Structured information should be easier to navigate than randomized noise. The paper's suggested framing : coherent text creates more plausible-seeming distractors. When sentences flow logically, adjacent content appears more relevant even when it isn't, potentially influencing how the attention mechanism allocates focus as input length grows. A shuffled context, counterintuitively, may produce less misleading structure for the model to anchor on.
How to interpret this responsibly: The paper establishes this as a real finding from a rigorous study. What it does not establish is any prescription to shuffle documents before feeding them to models. The mechanism isn't understood, the finding is specific to the experimental conditions, and operationalizing "shuffle your context" without understanding why would be premature. Context structure is a real variable that affects performance in ways current research doesn't fully characterize. That's the finding - and it deserves to be taken seriously on its own terms rather than inflated into an actionable rule.
What was tested: Using the LongMemEval benchmark (Wu, Wang, Yu, Zhang, Chang, and Yu), which evaluates conversational question-answering across realistic memory tasks, the researchers compared two input conditions: a "focused" input containing only the approximately 300 tokens directly relevant to the question, versus a "full" input of approximately 113,000 tokens that included all session history and irrelevant context.
What they found: Every model family - Claude, GPT, Gemini, and Qwen - showed significantly higher performance on the 300-token focused inputs than on the 113K-token full inputs. The gap was most pronounced for Claude Opus 4 and Sonnet 4, which became notably more conservative under the full-context condition, abstaining more frequently when confronted with ambiguity at scale.
Why this is the experiment that matters most for practitioners: This is the Chroma study's most realistic experimental design. LongMemEval uses actual conversational QA tasks - the kind of thing you'd ask an AI assistant managing your calendar, your codebase, or your project notes. And 113K tokens is well within the declared context windows of all the tested models - nobody here is exceeding advertised limits. The finding that a 300-token focused prompt substantially outperforms a 113K-token dump directly challenges the "include everything and let the model sort it out" approach that many production systems currently use.
The pattern across question types: Models showed the best relative performance on knowledge-update tasks (the model was told something changed; did it update its understanding?), then multi-session tasks, then temporal reasoning tasks. Models running in thinking modes narrowed the focused-versus-full gap somewhat, but did not close it.
One additional experiment from the paper is worth brief coverage because it makes the nature of the problem unusually clear.
What was tested: Models were asked to reproduce a sequence of a single repeated word - "apple apple apple..." - with one unique word inserted at a specific position. The task requires no reasoning, no retrieval, no synthesis. It's as simple as a language task gets.
What they found: Performance degraded consistently as the sequence length grew across all models. Models over-generated, under-generated, hallucinated additional words, or lost track of the unique word entirely.
Claude Opus 4 took a different approach: it refused the task, citing concerns about reproducing copyrighted material, from a list of the word "apple" repeated 5,000 times. The paper documents this behavior without editorializing; I'll follow the same example. Whatever you think about the refusal itself, it illustrates something real about how Claude Opus 4 handles edge-case requests at scale.
Why this finding matters: Context rot isn't about task complexity. It's about the input length itself. If a model can't reliably reproduce a trivially simple sequence of 5,000 identical tokens, the assumption that it can reliably reason across 200,000 tokens of production context warrants scrutiny. The degradation isn't concentrated in hard tasks - it's a property of length.
The Chroma study is an important, carefully designed piece of research - and it has real scope limitations that its authors are explicit about. Eliding those limitations would undercut the post's credibility and misrepresent the research.
It doesn't explain the mechanism. The paper documents the phenomenon with unusual rigor. It doesn't claim to explain why it occurs. The structural coherence finding's interpretation - that coherent text creates more plausible-seeming distractors - is a hypothesis, not a demonstrated mechanism. Mechanistic interpretability research is explicitly flagged as future work. Readers who want to know why context rot happens will have to wait for that follow-on research.
It doesn't test complex real-world tasks. The five experiments deliberately hold task complexity constant while varying input length. This is the right methodological choice for isolating the variable of interest - but it means the study doesn't capture what happens during multi-step reasoning, tool use, synthesis tasks, or agentic workflows. Those use cases probably have context sensitivity too; this paper doesn't measure it.
It doesn't produce a model ranking. The paper is explicit: no single model wins across all five experiments. The findings show Claude Sonnet 4 leading on the repeated-words task, and GPT-4.1 showing competitive needle retrieval. Cross-model behavioral differences (Claude's abstention pattern vs. GPT's hallucination pattern) are documented, but the paper is an analysis of patterns in the phenomenon, not a leaderboard.
The needle-haystack similarity finding is preliminary. Two haystack topics cannot support a generalized theory about which content relationships matter. The paper says this explicitly. Don't overweight experiment 3 in architectural decisions until there's more replication.
It doesn't say "don't use long context." The paper's conclusion is specifically about how to use long context well, through careful construction of what goes into the context window. The conclusion points toward context engineering as the discipline that the findings demand. It does not point toward abandoning long context as a tool.
Three terms get conflated enough to be worth distinguishing briefly. See the full definitions of related terms .
Context rot ≠ lost in the middle. The Stanford "lost in the middle" research identified a position-dependent failure mode: information placed in the middle of a long context receives worse attention than information at the beginning or end. Context rot describes degradation as a function of length itself, regardless of where relevant content sits. Notably, the Chroma study found no strong position-based pattern in its primary experiments - the degradation they measured is length-driven, not position-driven.
Context rot ≠ context overflow. Overflow is hitting the model's hard token limit. Context rot happens well before overflow - a model with a 200K-token window can show significant performance degradation at 50K tokens, as the LongMemEval experiment demonstrates with 113K-token inputs.
Context rot ≠ hallucination. Hallucination is a failure mode that context rot exacerbates. They're related but distinct. A model that confidently generates an incorrect answer under distractor pressure (as Chroma documented with GPT models) is hallucinating, but the cause in that context is context rot. The phenomenon and its symptoms aren't the same thing.
The practical implications vary depending on how you're working with these models. I'll organize this by the way the research maps to different builder contexts.
If you're building agents or long-running LLM workflows:
"Just use a bigger context window" isn't a viable strategy - the research now documents this clearly enough that engineering teams should treat it as a design input, not a debate. The degradation the Chroma study found isn't an edge case at near-capacity context lengths; it's measurable well before that.
Distractor management deserves to be treated as first-class engineering. Old instructions that conflict with new ones. Deprecated examples still present in your system prompt. Background documents that share vocabulary with your target content. The paper demonstrates that this near-miss content degrades performance in ways you can't compensate for by enlarging the window.
Claude's abstention behavior under ambiguity is worth thinking about explicitly. In production, a model that says "I can't find a reliable answer" is often safer than one that confidently generates an incorrect one, which is what the distractor experiment shows as the alternative pattern. Design your error handling accordingly.
If you're using AI tools daily - as a developer, PM, or analyst:
The LongMemEval finding is the most directly actionable result in the study. Focused prompts containing only the ~300 tokens relevant to your question substantially outperform 113K-token context dumps - on the same models, for the same tasks. This validates the practice of assembling task-scoped context stacks rather than pasting everything into a prompt and hoping. For how this plays out in Claude Code specifically , the dynamics are worth understanding directly.
The needle-haystack similarity finding has an interesting practical implication: when you're including reference material in a prompt, semantically distinct background material may interfere less than material that shares vocabulary with your target. Including one example from a completely different domain might be less disruptive than including five examples from the same domain as your question.
If you're writing CLAUDE.md, AGENTS.md, or system prompts:
Brevity isn't just a cost-management practice - it's a performance practice. Every instruction in a persistent prompt competes for the model's attention as context grows around it. The Chroma findings on distractors apply here: outdated guidance, conflicting rules, and deprecated examples in your instruction files aren't inert. Here are some practical examples for keeping your instruction files lean .
This is precisely the discipline that sits at the center of the broader field of context engineering - the practice of deliberately constructing what goes into the context window rather than treating context as a free resource. The Chroma study is the empirical foundation under practices that the practical guide to context engineering turns into workflows.
The full technical report is at research.trychroma.com/context-rot - it's 40+ pages, but the methodology sections are worth reading if you want to understand the experimental design in detail. The replication codebase is on GitHub .
For Anthropic's practitioner framing that builds explicitly on Chroma's findings, their "Effective Context Engineering for AI Agents" post is worth reading alongside this one.
Hamel Husain has written an annotated walkthrough of the Chroma presentation in a different format, presentation notes with the authors' own explanations, which is useful if you prefer a more conversational format.
For Claude Code users: what to do about it in Claude Code
For readers who want a tool that turns the "focused context" finding into a reusable workflow - saved context stacks you can load per task instead of rebuilding from scratch - that's what we built Mesh for. The underlying research says focused beats full. Mesh is the practical shape of that discipline.
The author is the founder of HiveTrail , where he is building context management tools for LLMs and agentic AI. HiveTrail's flagship product, Mesh , is a desktop app in beta that assembles just-in-time, task-scoped context for LLMs from Notion, local files, and prompt libraries, with built-in privacy scanning and token management. He writes about context engineering, LLM development workflows, and the research behind building reliable AI systems.
On this page
Context rot is the measurable phenomenon where an LLM's output quality degrades as input context length grows, even well before the model's maximum context window is reached. The term was formalized by Chroma's July 2025 technical report , which tested 18 frontier LLMs and found that every single one exhibited the pattern across five distinct experiments.
The Chroma study found that performance degraded with input length across all 18 tested models, including Claude Opus 4, Sonnet 4, GPT-4.1, Gemini 2.5 Pro, and Qwen3 variants, across five experiments: needle-question similarity, distractor impact, needle-haystack similarity, haystack structure, and a realistic LongMemEval conversational QA benchmark. The degradation is nonuniform, often sharp, and in at least one case counterintuitive. Logically coherent haystacks consistently produced worse performance than randomly shuffled haystacks across all 18 models.
No. Lost-in-the-middle refers specifically to position: information placed in the middle of a long context gets worse attention than information at the beginning or end. Context rot describes performance degradation as a function of input length itself, regardless of where relevant content sits. Notably, the Chroma study did not observe strong position-based patterns in its primary experiments.
No. The Chroma study explicitly tests this assumption and finds it doesn't hold. Models with context windows of 200K tokens or more all exhibited performance degradation well before reaching their limits. The LongMemEval experiment found significant gaps between 300-token focused inputs and 113K-token full inputs that remain well within declared context window sizes. The paper concludes that context engineering, careful construction of what goes into the window, matters more than the window's size.
The paper's own conclusion points toward context engineering: deliberately curating what goes into the context window rather than filling it. Practical steps include:
The gap between a 300-token focused input and a 113K-token full input in the LongMemEval experiment is the clearest quantitative signal for why this discipline matters.
Claude Code sessions degrade silently - not from bugs, but from context rot. Here's the science, the symptoms to spot early, and the fix that works upstream.
Read more about Claude Code Context Window Rot: Why Sessions Get Dumber (And How to Fix It)
Read the Anthropic context engineering guide 2026 but stuck on implementation? Translate its four pillars into a concrete checklist for your next LLM session.
Read more about Anthropic Context Engineering Guide 2026: A Field Manual
Stop dumping raw files into your LLM. Learn how to build a structured LLM context stack covering source selection, token budgeting, privacy, and XML assembly.
Read more about How to Build an LLM Context Stack: A Practical Playbook for Developers (2026)