Ben •
The most-searched Anthropic context engineering guide in 2026 is the same one published in September 2025: Anthropic's post on effective context engineering for AI agents . It's rigorous, well-argued, and worth reading in full.
Here's what probably happened after you read it: you opened a new Claude session and built your prompt roughly the way you always have. A block of instructions at the top. A few examples. Some background pasted in. Send.
That's not a failure of comprehension. But theory rarely reshapes habit on its own - and Anthropic's post is 5,000 words of theory. It doesn't tell you what to do on Monday morning. It doesn't tell you what to cut from the system prompt you're running in production. It doesn't tell you how to look at your tool list and know which entries are costing you performance.
This post is the bridge. It's a practitioner's translation of Anthropic's context engineering vocabulary into concrete actions, organized around the same four-pillar framework Anthropic describes, with a checklist at the end you can run before your next session.
Tobi Lütke (Shopify CEO) gave the field its name in a June 18, 2025 post on X : "the art of providing all the context for the task to be plausibly solvable by the LLM." Anthropic formalized the framework a few months later. The term has momentum. What's been missing is a practitioner's version of the ideas.
That's what this is.
Anthropic's post makes three claims that sit underneath everything else.
First: context is a finite attention budget. Every token in the context window competes for the model's attention. Longer isn't better - it's more expensive in two directions simultaneously. You pay for it in latency and cost. The model "pays" for it in degraded retrieval, where signal gets buried in noise. Anthropic cites the concept of context rot - as the number of tokens in the context window increases, the model's ability to accurately recall information from that context decreases, across all models.
Second: more context is not better context. This is counterintuitive in a world where context windows keep getting larger. Anthropic is explicit about it: the goal is not to fill the window. Stuffing a context window with everything possibly relevant is roughly the same mistake as CC-ing everyone on an email when you need a decision from one person. The extra recipients don't help - they dilute accountability.
Third: the goal is the smallest set of high-signal tokens. Not minimum context. Minimum sufficient context. The question to ask before every session isn't "what might be useful?" but "what does this specific task actually require?" The gap between those two questions is where most practitioners lose performance.
These three claims aren't new in isolation. What's new is that Anthropic organized them into four specific areas where practitioners routinely get this wrong - four pillars worth operationalizing. As Anthropic summarizes it: the goal is finding "the smallest possible set of high-signal tokens that maximize the likelihood of some desired outcome."
Anthropic identifies four primary sources of context in an AI session: system prompts, tools, examples, and message history. Each one has a failure mode that most practitioners are actively hitting, and each one has a concrete fix.
What follows is not a paraphrase of Anthropic's framework. It's a translation. The goal is that after reading each pillar, you know something specific to check before your next session - not a principle to agree with.
The failure mode here has two poles, and practitioners usually live at one of them.
The first pole is over-specification: the 150-line system prompt that defines personality, tone, error-handling behavior, edge cases, fallback patterns, domain restrictions, and a dozen other things that should be downstream of the task, not baked into every session. If your system prompt would make sense for most tasks you'd ever give this assistant, it's probably too long. You've collapsed task-specific logic into what should be standing context - and now every session carries that weight whether it needs to or not.
The second pole is vague mission statements: "You are a helpful, expert software engineer who thinks carefully before responding." This tells the model almost nothing about how to behave. It's context that reads as context but functions like noise.
Anthropic describes this as finding the "Goldilocks zone" - system prompts specific enough to constrain behavior meaningfully, and narrow enough that they don't pretend to anticipate everything. A system prompt should encode things that are genuinely invariant across sessions: standing constraints, persona-level facts, access policies. Task-specific logic belongs in the task.
The practical check: Read your system prompt and ask which lines you would need to change if you handed this tool to someone doing a completely different task. If the answer is "most of it," you've collapsed task context into standing context. Cut the task-specific half and move it into a stack you assemble per session.
Here's what that shift looks like:
# Before - collapsed system prompt (~12 lines of 180)
You are an expert backend engineer specialized in Python and FastAPI.
You write clean, idiomatic code with strong type hints.
When reviewing code, always check for: SQL injection, rate limiting gaps,
missing auth decorators, unvalidated inputs, and improper exception handling.
Always suggest tests for any code you write. Prefer pytest.
Use Black formatting conventions. Target Python 3.11+.
On architectural questions, default to SOLID principles...
[continues for 168 more lines]
# After - system prompt (standing context only)
You are a backend code reviewer. You identify security issues,
suggest tests, and flag architectural problems.
Code style: Python 3.11+, Black, pytest.
# Moved to per-task stack
Task: Review the auth middleware in /app/middleware/auth.py.
Specific concerns: rate limiting, token validation, exception leakage.
Relevant context: [paste of the specific file + adjacent dependencies]The system prompt didn't get weaker. It got narrower. Task-specific context now travels with the task, not with every session.
The central argument Anthropic makes about tool design: if a human engineer couldn't definitively say which tool should be called in a given situation, the agent can't be expected to do better. Tool selection is a reasoning task. Ambiguity in your tool list becomes ambiguity in agent behavior.
The failure mode here is additive. Every new integration gets added to the tool list. MCP servers accumulate. By the time you have 18 connected tools, you have multiple tools that read files, multiple tools that search documentation, and multiple tools that query the same database via slightly different interfaces. The agent now has to do disambiguation work before it can do the actual work - and that disambiguation competes for attention budget with everything else.
Anthropic frames tools as a context source, not just capabilities. Every tool description you expose to the model occupies tokens and shapes the decision space. A large, overlapping tool list doesn't give the agent more power - it gives it more to reason about before acting.
The practical check: For each pair of tools in your list, ask: is there any task where a human developer couldn't immediately say which one to call? If you hesitate, those tools need clearer differentiation - or one of them needs to be removed from the session.
# Before - overlapping tool list (12 of 18 shown)
- read_file
- read_local_file
- file_reader
- search_docs
- search_documentation
- query_knowledge_base
- fetch_notion_page
- get_notion_content
- run_sql
- query_database
- execute_query
- check_db
# After - curated tool list (same session scope)
- read_file # reads any local file by path
- search_docs # semantic search across indexed documentation
- fetch_notion # retrieves Notion pages by ID or URL
- run_sql # executes read-only SQL against the project databaseThe second list isn't less capable. It's unambiguous. Each tool has a distinct purpose a human could explain in one sentence. That's the target. "If the human engineer can't definitively say which tool should be used, an AI agent can't be expected to do better" - Anthropic's framing, and worth posting above your MCP configuration.
Anthropic's framing here is precise: for an LLM, examples are the "pictures worth a thousand words." A well-chosen example doesn't just illustrate a pattern - it teaches the model what kind of output you're after in a way that instructions rarely achieve.
The failure mode is kitchen-sink exemplars. You write one example, then add another to cover an edge case, then another because a previous run went wrong, then another because a colleague had a different use case. By the end, you have eleven examples, several of which contradict each other in subtle ways - different formatting conventions, different levels of verbosity, different handling of the same edge condition. The model is now learning from the average of your examples, not from a coherent pattern.
Anthropic's guidance is to curate a small set of diverse, canonical examples that portray expected behavior - not a laundry list of edge cases. Three diverse examples that span the actual distribution of the work will outperform a dozen redundant ones that cover the same territory from slightly different angles.
The practical check: Look at your current examples. Can you describe in one sentence what each one is teaching? If two examples are teaching the same thing at slightly different difficulty levels, you only need one. If an example exists because a specific session went wrong, ask whether it's fixing a real pattern or a one-time edge case.
# Before - kitchen-sink examples (11 total, excerpt)
Example 1: Summarize a short customer support email (3 sentences, informal)
Example 2: Summarize a longer customer support email (5 sentences, informal)
Example 3: Summarize a customer support email that mentions refund (3 sentences)
Example 4: Summarize when the customer is angry (neutral tone, 4 sentences)
Example 5: Summarize when there's no clear ask (flag ambiguity)
...6 more
# After - three diverse examples
Example 1: Routine inquiry, clear ask → concise summary + one recommended action
Example 2: Complex complaint, multiple issues → structured summary + escalation flag
Example 3: Ambiguous message, no clear ask → summary + explicit "action unclear" noteThe first set teaches the model to vary sentence count based on email length and tone. The second set teaches the actual decision tree: is there a clear action? Is there ambiguity? Is escalation warranted? Those are the real dimensions. Three examples, covering all three cases.
This is the pillar that receives the least practitioner attention, and it may be the most consequential for anyone running long or recurring sessions.
Anthropic's concern is token accumulation over a session's lifetime. As a session grows, the message history fills with context that was relevant ten turns ago but isn't relevant now - status updates that are now stale, intermediate reasoning the agent has moved past, scaffolding from early in the task. Left uncurated, the agent risks "drowning in exhaustive but potentially irrelevant information." That phrase is Anthropic's, and it's worth sitting with for a moment: the problem isn't that the context is bad, it's that it was good, once, and now it's occupying attention budget that should go to the current task.
In a short session, this doesn't matter much. In a long agentic session - one where an agent is running multiple tool calls, looping across subtasks, and maintaining state across dozens of turns - the accumulation becomes a real problem. Performance degrades in ways that feel like capability issues but are actually context issues.
For long-horizon tasks, Anthropic recommends techniques including compaction (summarizing and reinitializing the context window when it nears capacity) and structured note-taking (writing persistent notes outside the context window and pulling them back in on demand). Both are forms of active curation: keep what's still load-bearing, discard what isn't.
Here's what the difference looks like at the message history level:
# Before - uncurated message history (turn 18 of a long refactoring session)
[Turn 1] User: Let's refactor the auth module.
[Turn 2] Assistant: Here's my plan: [600-token architectural breakdown]
[Turn 3] User: Actually start with the middleware layer.
[Turn 4] Assistant: Sure. Reading middleware files now... [tool call + 800-token result]
[Turn 5] User: The token refresh logic looks wrong.
[Turn 6] Assistant: You're right, here's the issue: [400-token explanation, now resolved]
[Turn 7] User: Good. Now move to the session handler.
...17 more turns of similar accumulation
[Turn 18] User: Why is the session handler expiring tokens early?
# After - compacted message history (same session, same turn 18)
[Summary] Refactoring auth module. Completed: middleware layer (token refresh bug fixed,
PR #214). In progress: session handler. Open question: early token expiration
in session_handler.py line 84. Key files modified: auth/middleware.py,
auth/tokens.py. Next: investigate expiry logic.
[Turn 18] User: Why is the session handler expiring tokens early?The compacted version isn't less informed. It's precisely informed - it carries what's still relevant and has released what isn't.
After building Mesh, I've spent more time thinking about context curation than I'd care to admit. The most counterintuitive thing I've learned: curation decisions made before the session starts matter as much as mid-session housekeeping. What you bring in shapes the trajectory. A session that starts with 40,000 tokens of loosely relevant background has already spent meaningful attention budget before the first message.
The practitioner version of Pillar 4 has two parts:
Pre-session: Assemble the minimum sufficient context before the session starts. Pull the specific files, notes, and reference material this task actually needs - not everything that might be relevant. This is the same discipline Anthropic is describing, applied before the model ever sees a token. It's also the same shift Anthropic flagged in their 2026 trends report : the move from prompt-centric to context-centric workflows.
In-session: For long-running sessions, build the habit of explicit history curation. Use compaction to summarize and replace long chains of intermediate reasoning. Drop context that's been acted on and is no longer decision-relevant. Treat the context window like working memory, not a log.
HiveTrail Mesh is the tool we built for the pre-session half of this. Assemble a curated stack from Notion and local files, scan for secrets, hit a token budget, and copy a clean payload to your clipboard for use with Claude, ChatGPT, Gemini, or any other LLM. (If you're working in Claude Code specifically, see our Claude Code token usage guide for the in-session side of this.)
Midway through Anthropic's post is an argument that doesn't get quoted nearly enough. The idea: agents shouldn't receive all their context upfront. Instead, they should be equipped with tools that allow them to explore for context at runtime - retrieving what they need when they need it, rather than carrying everything in from the start.
This runs against the instinct most practitioners have developed. The instinct is: give the agent more context, more upfront, so it doesn't have to ask. Anthropic's counter-argument is that this instinct creates fragile, expensive agents - and that the alternative isn't less capable agents, it's better-designed ones.
Anthropic illustrates this with Claude Code itself : rather than pre-loading an entire codebase into context, Claude Code maintains lightweight identifiers - file paths, stored queries, Bash references - and uses tools like grep and glob to retrieve files just-in-time. The model can write targeted queries, store intermediate results, and analyze large volumes of data without ever loading full data objects into context. CLAUDE.md files are loaded upfront; everything else is fetched on demand.
Here's what that architectural difference looks like at the tool design level:
# Over-provisioned upfront approach
[system prompt includes:]
<codebase_dump>
[full contents of src/auth/middleware.py] # 800 tokens
[full contents of src/auth/tokens.py] # 600 tokens
[full contents of src/auth/session.py] # 900 tokens
[full contents of tests/test_auth.py] # 1,100 tokens
[full contents of config/auth_config.yaml] # 400 tokens
</codebase_dump>
Total upfront: ~3,800 tokens, whether needed or not.
# Just-in-time approach
[system prompt includes:]
Available auth module files:
- src/auth/middleware.py
- src/auth/tokens.py
- src/auth/session.py
- tests/test_auth.py
- config/auth_config.yaml
[agent calls read_file("src/auth/session.py") only when investigating the session bug]
Total upfront: ~100 tokens. File loaded only when its content is actually needed.The agent with the just-in-time approach isn't working with less information. It's working with the right information at the right time - Anthropic's framing of what good context engineering actually means.
The practical implication for agent designers: rather than asking "what should I pre-load?", ask "what should my agent be able to fetch, and when?" Design tools for retrieval rather than context blocks for pre-loading. This shifts the cost from attention budget to tool call latency - a trade-off that's usually worth making.
For individual practitioners (not agent architects), the same principle applies at a smaller scale. You don't have to load everything you might need before a session. You can start with a tighter context and use follow-up prompts or explicit retrieval steps to pull in what you need once the task's actual shape becomes clear. It feels like starting with less. It usually produces better sessions.
The future of context engineering isn't bigger context windows. It's smarter assembly - knowing what to bring in, when to bring it in, and what to leave outside the window entirely.
If you've been working with LLMs for more than a year, you’ve built your intuitions around prompt engineering - the craft of wording instructions to get reliable outputs. That skill is still real and still matters. But it's no longer the whole story.
Anthropic published their Prompt Engineering guide in 2024. It's a solid reference - chain-of-thought, few-shot patterns, instruction clarity, and related techniques. In September 2025, they published "Effective Context Engineering for AI Agents," a post that explicitly frames context engineering as "the natural progression of prompt engineering." The evolution isn't a rebrand. It's a scope expansion.
The distinction how prompt engineering became a subset is worth being precise about: prompt engineering is the craft of wording your instructions. Context engineering is everything else - what you include in the session, what you exclude, how you structure it, how you maintain it over time. The wording of your prompt matters. But increasingly, what surrounds the prompt matters more.
Anthropic frames it this way: in the early days of LLM engineering, prompting was the biggest component of the work - most use cases required prompts optimized for one-shot tasks, and the primary focus was how to write effective prompts. As agents operating over multiple turns and longer time horizons became the norm, the work shifted. Now the question isn't just "what should I ask?" but "what should be present in the entire context window when I ask it?"
A well-worded prompt in a bloated context window will underperform a mediocre prompt in a clean, high-signal context. This is what the attention budget framing points at: the model's capacity to process your carefully worded instruction is constrained by everything else competing for attention in the same window.
Practically: if your current workflow treats context as a passive container for your prompt, context engineering asks you to make that container an active design decision.
Run this before your next session. Fourteen checks, organized by pillar.
Pillar 1 - System Prompt
Pillar 2 - Tools
Pillar 3 - Examples
Pillar 4 - Message History / Context
Pillar 4 - curating the message history - is the one we built Mesh for. If you're already curating context manually before your sessions, HiveTrail Mesh automates the pre-session half: pull from Notion and local files, scan for secrets, hit a token budget, and copy a clean payload to your clipboard for use with Claude, ChatGPT, Gemini, or any other LLM. The beta is open. Join the beta and tell us what you're building.
For practitioners who want to go deeper into Anthropic's own material, the Anthropic Cookbook covers the implementation side - memory management, context compaction, and tool clearing for production agents.
On this page
Anthropic defines context engineering as the practice of curating and maintaining the optimal set of tokens during LLM inference, including everything in the context window: system instructions, tools, examples, retrieved data, and message history. Where prompt engineering focused on how to write effective instructions, context engineering addresses the full configuration of what the model sees at any given time. Anthropic frames the goal as finding "the smallest possible set of high-signal tokens that maximize the likelihood of some desired outcome."
According to Anthropic's September 2025 post on effective context engineering , the four pillars are:
1. system prompts - encoding standing constraints at the right altitude, neither over-specified nor vague.
2. tools - designing tool sets where each tool has an unambiguous, non-overlapping purpose.
3. examples - using a small number of diverse, canonical demonstrations rather than exhaustive edge-case coverage.
4. message history - actively curating the context window over the life of a session rather than letting it accumulate passively.
The attention budget is Anthropic's framing for the model's context window as a finite resource to be allocated, not just filled. Anthropic grounds this in the transformer architecture itself: because every token attends to every other token, the number of pairwise relationships grows quadratically with context length. A context window with 100,000 tokens isn't uniformly processed - signal buried in long, dense context receives weaker representation than signal in a short, focused context. Managing the attention budget means asking, before every session, what the task actually requires - and excluding everything that doesn't meet that bar.
Prompt engineering is the craft of wording instructions to produce reliable outputs: phrasing, ordering, chain-of-thought scaffolding, and few-shot patterns. Context engineering is broader - it covers everything present in the session, including the system prompt, tools, examples, conversation history, and any retrieved documents. Anthropic explicitly frames context engineering as "the natural progression of prompt engineering." You can write a perfect prompt and still deliver poor results if the context around it is bloated, redundant, or stale.
Start with the four-pillar checklist above.
The highest-leverage first step for most practitioners is Pillar 1: audit your system prompt and separate standing context from task-specific context.
The second step is Pillar 2: look at your active tool list and eliminate any pair of tools where the selection logic is ambiguous.
These two changes are non-destructive. You're not deleting anything, just moving and clarifying, and they produce immediate, measurable improvements in session behavior.
Chroma tested 18 frontier LLMs and found every one degrades as input length grows. Here is what their context rot study proves developers must change.
Read more about Context Rot Is Real: What Chroma's 18-Model Study Found
Master the vocabulary of AI agents with this context engineering glossary. Discover 22 key terms, from context rot to attention budgets, to build better apps.
Read more about The Context Engineering Glossary: 22 Terms for AI Developers
A plain-English guide to Agentic Context Engineering (ACE). Learn how this evolving playbook framework prevents context collapse in self-improving AI agents.
Read more about Agentic Context Engineering (ACE) Explained: How Evolving Playbooks Fix Context Collapse