HiveTrail Logo HiveTrail
Isometric diagram showing raw developer sources like Git, Notion, and local files being filtered into a structured LLM context stack with XML tags for task, context, diff, and format.

How to Build an LLM Context Stack: A Practical Playbook for Developers (2026)

Avatar for Ben Ben

Here is a pattern that repeats dozens of times a day in developer workflows everywhere. A developer opens Claude or ChatGPT, copies a chunk of code from their editor, pastes a Notion spec, adds a rough description of what they want, and sends it. The output is mediocre - generic, missing context, wrong in ways that take longer to fix than to write from scratch. So the developer refines the prompt, adds more text, and pastes more files. The output gets marginally better. Twenty minutes have passed.

The problem is almost never the model. It is the context.

As Philipp Schmid, Technical Lead at Hugging Face, put it plainly in his widely-read 2025 post on context engineering :

"Most agent failures are not model failures anymore, they are context failures."

And Andrej Karpathy - former Director of AI at Tesla and co-founder of OpenAI - framed the discipline even more sharply when he endorsed the term "context engineering" on X in June 2025:

"In every industrial-strength LLM app, context engineering is the delicate art and science of filling the context window with just the right information for the next step."

This post is about building that discipline into your personal LLM workflow - not as an abstract concept, but as a concrete, repeatable context engineering workflow you can apply to any task. We will cover what a context stack is, where to source its components, how to budget tokens wisely, why structure matters, and how to add a privacy layer before anything leaves your machine. By the end, you will have a practical playbook you can apply starting today.

What is a context stack (and why you need one)

The context window is the model's working memory - the total number of tokens it can process in a single request, shared between your input and its output. As of 2026, that window ranges from 128K tokens (DeepSeek, Mistral) to 10 million (Llama 4 Scout), with Claude's 200K and GPT-5.2's 400K sitting in the mid-tier.

A context stack is different. It is the curated set of information sources you assemble before you open an LLM session - the deliberate process of deciding what goes in, in what order, and in what structure. Think of the context window as a whiteboard. The context stack is the act of deciding, before the meeting, exactly which diagrams, reference sheets, and notes to pin on it.

Most developers skip this step entirely. They treat the context window like a junk drawer - pasting in whatever seems related and hoping the model figures it out. This works for trivial tasks. For anything complex, it guarantees degraded output.

There are four layers to a well-built context stack:

  1. Task definition - exactly what you want the model to do, and any hard constraints
  2. Project knowledge - the background context that the model needs to understand your specific situation
  3. Task-specific data - the files, diffs, records, or documents directly relevant to this request
  4. Output format - the structure, tone, length, or schema you expect in the response

These four layers map to how a skilled human consultant would approach a complex task: first understand the goal, then understand the environment, then review the specific case, then produce structured output. Every LLM task benefits from this same mental model.

The science behind why context quality beats context quantity

Before we get into the mechanics, it is worth understanding why this matters at the research level - because the findings are counterintuitive and somewhat alarming.

In July 2025, Chroma Research published a landmark study: Context Rot: How Increasing Input Tokens Impacts LLM Performance . The researchers tested 18 frontier models - including GPT-4.1, Claude Opus 4, Gemini 2.5 Pro, and Qwen3-235B - and reached a finding that should change how every developer thinks about context:

"We demonstrate that even under these minimal conditions, model performance degrades as input length increases, often in surprising and non-uniform ways."

Every single model they tested got worse as the input length increased. Not some. Not most. All 18. And the degradation began well before the context window was close to full - a model with a 200K token window showed measurable performance decay at 50K tokens.

This is not a bug. It is an architectural property of transformer-based attention. As you add tokens, the signal-to-noise ratio in the model's attention mechanism deteriorates. More input means more relationships the model has to compute between tokens, and the important signal competes with an ever-larger pool of noise.

The Chroma study also identified a particularly counterintuitive finding: models perform better on shuffled context than on coherent, well-structured context when that context is long. Logical document flow actually hurts retrieval performance, because a coherent narrative creates stronger positional patterns that bias the attention mechanism away from the actual target information.

This compounds a separate, well-established finding. In 2024, Nelson F. Liu and colleagues at Stanford published "Lost in the Middle: How Language Models Use Long Contexts" in the Transactions of the Association for Computational Linguistics. They found that LLM performance follows a U-shaped curve across input positions: models attend strongly to the beginning and end of context, and significantly less to the middle. When the answer document was placed at position 10 out of 20 documents in a question-answering task, GPT-3.5-Turbo's accuracy dropped by more than 20 percentage points compared to when the answer was at position 1 or position 20.

The practical implication is stark: if you paste a long project spec, then a git diff, then a README, then your actual question, the model may largely ignore the middle material, even though it "fits" in the context window.

The solution is not a better model. The solution is a better context stack.

The 5 sources most developers already have (and underuse)

Building a context stack does not require new tools, infrastructure, or RAG pipelines. The raw material is already sitting in your existing workflow. Here is how to think about each source and when to use it.

1. Notion pages and internal documentation

If you use Notion for specs, design docs, decision logs, or sprint planning, you are sitting on the single most valuable context source for product and feature work. Yet most developers either paste raw Notion content (which includes formatting noise, sidebar metadata, and orphaned draft text) or ignore it entirely.

What to extract: the specific sections relevant to the current task. For a PR description, that might be two paragraphs from the feature spec and the acceptance criteria. For a code review, it might be the architecture decision record. Not the whole document - the relevant slices.

When not to include it: if the page is outdated, heavily formatted, or longer than ~1,500 tokens and only tangentially related to the task. Stale context is worse than no context - the model treats it as signal and reasons from incorrect assumptions.

2. Git diffs and commit history

Git output is an underused context gold, especially for developer-facing tasks like PR descriptions, release notes, code reviews, and regression analysis. A well-scoped git diff of the changed files, combined with the last 5 commit messages, gives a model nearly everything it needs to understand what changed, why, and how it fits into recent history.

The key is scoping. git diff HEAD~1 of a large codebase will produce thousands of tokens of noise. git diff HEAD~1 -- src/services/github_service.py gives you a targeted signal. Specificity is the discipline.

Here is the practical toolkit for assembling Git context:

# Get the diff for only the files you changed on this branch
git diff main...HEAD -- $(git diff --name-only main...HEAD)

# Scope to a specific file or directory
git diff main...HEAD -- src/services/github_service.py

# Get the commit narrative for this branch (concise, one line per commit)
git log main...HEAD --oneline

# Count approximate tokens before pasting (rough estimate: 1 token ≈ 4 chars)
git diff main...HEAD -- src/services/ | wc -c
# Divide by 4 for a token estimate. If over 2,000, scope more tightly.

# Show only changed function signatures without full body (useful for large files)
git diff main...HEAD -- src/services/github_service.py | grep "^[+-].*def "

These two together - git diff scoped to changed files plus git log --oneline - typically produce 500–2,000 tokens of extremely high-density context. That is more useful information per token than almost any other source in your stack.

3. Local files: specs, READMEs, configs, and schemas

Local files are the forgotten layer. Developers habitually paste code snippets into LLM chats, but rarely include the broader file context that explains why the code is written the way it is - the architecture notes in a README, the data schema that a function operates on, the config file that controls feature flags.

For code tasks, the most useful local files are usually: the file being modified (scoped to the relevant functions, not the whole file), any type definitions or interfaces it imports, and the test file for that module. This trio gives the model the implementation, the contract, and the expected behavior - the three things it needs to produce correct code.

For writing tasks (docs, PR descriptions, release notes), the most useful local files are usually the previous version of the document being updated and any style guide or template you want the output to follow.

4. Previous LLM output and conversation checkpoints

This is the most underused source of all. If you have already used an LLM to produce a structured analysis, a draft spec, or a set of decisions, that output is a compressed, high-quality summary of prior reasoning. Feeding it back into a subsequent task saves both tokens and cognitive work.

A practical pattern: after any significant LLM session, extract and save a "context summary" - a brief, structured document capturing the key decisions, constraints, and outputs from that session. Next time you return to the same feature or problem, this becomes the top-level context block in your stack.

This is especially valuable for Claude Code or Cursor workflows, where sessions can span multiple days. The manual checkpoint discipline - saving a HANDOFF.md before clearing context - is one of the highest-leverage habits in a modern dev workflow. The dev.to post "I Spent Four Weeks Reading 200+ Sources on Context Engineering" documented this concretely: a single 805-line SESSION.md was consuming 17,000 tokens on every session start, loading irrelevant historical context into every conversation. After trimming it to 59 lines, task benchmark scores jumped 32 points.

5. Reusable context blocks

Reusable context blocks are the equivalent of code snippets for context engineering. They are pre-written, curated descriptions of recurring context elements - your tech stack, coding conventions, project architecture, output templates, or persona definitions. You write them once and pull them into any relevant task.

Examples: a 300-token description of your service architecture that you include whenever asking for code that touches multiple services; a 200-token output format spec that defines the PR description template you want Claude to follow every time; a 150-token description of your team's coding standards that prevents the model from generating patterns you will just reject.

The return on investment for reusable context blocks is asymmetric: 30 minutes writing them pays dividends across hundreds of future tasks.

Token budgeting: how to decide what stays and what gets cut

Knowing your sources is not enough. The harder discipline is deciding how much of each source to include, and in what order.

The most actionable guidance comes from production data. Based on the analysis of Anthropic and Manus production systems, published by AI practitioner Dex Horthy, the safe zone for context utilization is below 60% of the available window. At 70% utilization, precision begins to drop. At 85%, hallucination rates rise meaningfully.

Horizontal infographic bar showing LLM token budgeting zones: a green optimal zone below 60% utilization, an amber zone at 70% where precision drops, and a red zone at 85% where AI hallucinations begin.

This means a developer using Claude with a 200K token context window should aim to keep total input context below 120K tokens - and in practice, far below that for most tasks. The gap between focused and unfocused context is not marginal. Research compiled from the LongMemEval benchmark found that a focused 300-token context outperformed an unfocused 113,000-token context on the same task. The model with less context - but the right context - produced better results than the model with a context window 377 times larger. This is the defining empirical argument for context stack discipline: more is not better, right is better.

A focused 5,000-token context stack will almost always outperform a sprawling 80,000-token dump.

This validates what the Stanford research documented empirically. A model does not become more capable because you fed it more text. It becomes more distracted.

The ordering rule: the "lost in the middle" effect has a direct practical corollary for how you assemble your stack. Always put your highest-priority context at the beginning or end of your input, never in the middle. Specifically:

  • Opening block: task definition and constraints (what you want, what it must and must not do)
  • Middle blocks: project knowledge and supporting files (where degraded attention is acceptable because this context is background, not critical)
  • Closing block: task-specific data (the diff, the function, the document to be modified) - put this as close to your actual question as possible

This ordering ensures the model's strongest attention lands on the most critical inputs: the instructions and the specific target.

A practical token budget framework for common tasks:

TaskTask definitionProject contextTask-specific dataTotal target
PR description~200 tokens~500 tokens~1,500 tokens (diff + commits)~2,500
Code review~200 tokens~300 tokens~2,000 tokens (file + tests)~2,500
Feature spec draft~300 tokens~800 tokens~500 tokens (similar prior specs)~1,600
Bug investigation~200 tokens~400 tokens~1,500 tokens (error + relevant code)~2,100

Note that even the most context-heavy of these tasks stays well under 3,000 tokens. That is not a constraint - it is the point. Focused, high-signal context at 3,000 tokens will produce better output than unfocused context at 30,000 tokens, because the signal-to-noise ratio is dramatically higher.

Structuring for the model: why XML context outperforms plain text

How you format your context stack matters as much as what you put in it. A plain text dump - even a well-curated one - forces the model to spend attention inferring boundaries between components. Explicit structure eliminates that overhead.

Anthropic's own prompt engineering documentation recommends XML tags for structuring complex inputs, specifically because Claude parses them more accurately than other delimiter formats. The guidance applies broadly: XML-style tags help any model parse multi-component inputs by providing semantic clarity about what each block contains and why it is there.

Here is the contrast in practice. This is a typical unstructured paste:

I need to write a PR description. Here is the diff: [paste of git diff] Here is the spec: [paste of Notion content] Here is the previous PR description format we use: [paste of template]

Here is the same content as a structured context stack:

Split-screen comparison showing a chaotic plain text dump of developer code and specs on the left, contrasted with a cleanly structured LLM context stack on the right using XML tags for task, output-template, and git-diff.
<task>
  Write a PR description following the format in <output_template>.
  The change should be described from the perspective of the reviewer.
</task>

<output_template>
  ## Summary
  [1-2 sentences describing what changed and why]

  ## Changes
  - [bullet per meaningful change]

  ## Testing
  [what was tested and how]

  Closes #[issue number]
</output_template>

<feature_spec>
  [relevant 2 paragraphs from the Notion spec]
</feature_spec>

<git_diff>
  [scoped diff of changed files]
</git_diff>

The structured version does several things the unstructured version cannot. It tells the model the role of each block before the model reads it. It separates the instruction from the data from the output contract. It eliminates ambiguity about where one component ends and another begins.

Research cited by FlowHunt found that strategic use of descriptive XML tags can improve AI response quality by up to 40% for complex, multi-component tasks. The gain is largest when prompts include multiple distinct types of information, which is exactly the definition of a context stack.

The naming of your tags matters. <feature_spec> is clearer than <doc>. <git_diff> is clearer than <code>. <output_template> is clearer than <format>. The model reasons better when tag names communicate the purpose of the content, not just its type.

The privacy layer: what to scrub before your context leaves your machine

A context stack pulled from Notion, Git, and local files will frequently contain material that should not be sent to a third-party API: API keys, authentication tokens, database credentials, internal employee names, customer email addresses, financial data, or proprietary architecture details.

The risk is not hypothetical. As context stacks grow richer and more comprehensive, the surface area for accidental data exposure grows with them. A .env file pasted alongside application code, a Notion page that includes a vendor contract summary, a git diff that touches a config file - each of these is a potential privacy incident waiting to happen.

Before any context stack leaves your machine, run a brief mental checklist:

Strip these automatically:

  • Environment variables and .env file content
  • API keys, tokens, secrets, and credentials (even if they look expired)
  • Database connection strings
  • Internal IP addresses and hostnames in infrastructure files

Review these contextually:

  • Customer or user data (names, emails, IDs, transaction records)
  • Employee personal information
  • Vendor contracts, pricing, and commercial terms
  • Internal financial data or projections

Safe to include:

  • Anonymized code and logic
  • Technical architecture descriptions that do not reveal the security posture
  • Public documentation and specs
  • Your own notes and task descriptions

A practical rule: if the data would require a legal review before you shared it with a contractor, it should not go into an LLM context without explicit approval. Most LLM API providers retain input data for safety and moderation purposes, and enterprise compliance teams are increasingly auditing AI usage for exactly this category of exposure.

The privacy layer is not an obstacle to building a rich context stack - it is what makes it safe to build a rich one. When your scrubbing step is fast and systematic, you can be more aggressive about including genuinely useful context.

Before/after: the same task, two ways

The best way to understand the value of a context stack is to see it applied to a real task. Below is a PR description workflow - one of the highest-frequency developer tasks - executed with and without a structured context stack.

The task: write a PR description for a bug fix in a GitHub service that was returning stale cached data after a user revoked a token.

Split-screen developer UI comparing a brief, generic AI response without a context stack to a highly detailed, structured PR description generated using a comprehensive LLM context stack.

Without a context stack (the typical approach)

The developer opens Claude, types:

"Write a PR description for my bug fix. It was a caching issue in the GitHub service."

Output (representative):

## Summary
Fix caching issue in GitHub service.

## Changes
- Fixed bug with caching
- Updated cache logic

## Testing
Tested locally.

This is the output that makes developers distrust AI writing assistance. It is generic, it contains no specific information, and it would be rejected in any serious code review. The developer then spends 10 minutes adding the specifics manually - at which point they might as well have written it from scratch.

With a structured context stack

The developer takes 3 minutes to assemble the following:

<task>
  Write a PR description for a bug fix. Follow the output_template format.
  The reviewer is a senior backend engineer who will want to understand
  the root cause and verify the fix is complete.
</task>

<output_template>
  ## Summary
  [1-2 sentences: what changed and why]

  ## Root cause
  [Brief technical explanation of what was broken]

  ## Fix
  [What the fix does and why it works]

  ## Testing
  [How it was tested]

  Closes #[issue number]
</output_template>

<feature_context>
  GitHub service caches API responses to reduce rate limit exposure.
  Cache TTL is 5 minutes. Token revocation is handled by the auth service
  and emits a `token.revoked` event on the internal event bus.
</feature_context>

<git_diff>
  diff --git a/services/github_service.py b/services/github_service.py
  -    def get_repos(self, user_id: str) -> list[Repo]:
  -        cached = self.cache.get(f"repos:{user_id}")
  -        if cached:
  -            return cached
  +    def get_repos(self, user_id: str) -> list[Repo]:
  +        token = self.auth.get_token(user_id)
  +        if token is None:
  +            self.cache.invalidate(f"repos:{user_id}")
  +            raise TokenRevokedError(user_id)
  +        cached = self.cache.get(f"repos:{user_id}")
  +        if cached:
  +            return cached
</git_diff>

<recent_commits>
  abc1234 Add token.revoked event listener to auth service
  def5678 Implement cache invalidation interface on CacheService
</recent_commits>

Output:

## Summary
Fix stale cache returns in GitHub service after token revocation, where
previously revoked users could still receive cached repo data for up to
5 minutes post-revocation.

## Root cause
The `get_repos` method checked the cache before validating the user's
token state. When a token was revoked via the auth service, the cache
was not invalidated, allowing the stale response to be served until
the TTL expired.

## Fix
Added a token validity check before the cache lookup. If the token is
absent (revoked), the cache entry is immediately invalidated and a
`TokenRevokedError` is raised. This ensures revoked users receive a
correct error response rather than stale data, regardless of cache TTL.

## Testing
- Verified cache invalidation fires on `token.revoked` event (unit test)
- Confirmed `TokenRevokedError` raised for revoked user (integration test)
- Tested TTL expiry path still returns fresh data for valid tokens

Closes #412

This output requires no editing. It accurately describes the root cause from the diff, correctly characterizes the fix, and reflects the testing from the commit history. The total assembly time was under 3 minutes; the editing time was zero.

The difference is not the model. The same model produced both outputs. The difference is the context stack.

This is exactly what HiveTrail Mesh was built to do. Mesh assembles your context stack automatically - pulling from Notion, Git, and local files into structured XML, with privacy scanning built in. It's in beta and free to try. Join the beta →

What to read next

Frequently Asked Questions

What is an LLM context stack?

An LLM context stack is the curated set of information sources you assemble before an LLM session. A deliberate process of selecting, structuring, and ordering the context you feed into the model for a specific task. It includes task definition, project background, task-specific data, and output format instructions. Unlike the context window (the model's working memory), the context stack is what you build to fill that window intelligently.

How much of the context window should I fill?

Keep total context utilization below 60% of the available window for reliable results. Based on production data from Anthropic and Manus systems: at 70% utilization, precision begins to drop; at 85%, hallucination rates increase. For Claude Sonnet 4.6 with its 200K token window, this means keeping input context below approximately 120K tokens, and in practice, most single tasks can be handled in 2,000–5,000 tokens of focused context.

Why does LLM output quality degrade with more context?

The primary mechanism is attention dilution combined with the "lost in the middle" effect. As context grows longer, the transformer's attention mechanism computes more pairwise token relationships, and the signal-to-noise ratio falls. Chroma's 2025 study found that every one of 18 tested frontier models degraded as input length increased. This is an architectural property, not a bug that model updates will fix. Stanford's "Lost in the Middle" research showed an additional effect: models attend poorly to information in the middle third of long prompts, with accuracy drops exceeding 20 percentage points compared to content at the beginning or end.

What should I include in my LLM context?

In priority order:

  1. Task goal and hard constraints - always include this first.
  2. Output format or template - so the model knows the target structure before reading the data.
  3. Project background - the minimal architecture/spec context needed to understand your specific situation.
  4. task-specific data - the actual code, diff, document, or record you want the model to work with. Include only what is directly relevant to the current task. When in doubt, leave it out.

Is XML better than plain text for LLM context?

For multi-component contexts, anything with more than one distinct type of information, yes. Anthropic explicitly recommends XML tags for structuring complex Claude inputs. XML tags communicate the role of each block before the model reads the content, reducing the attention overhead spent on inferring boundaries. The benefit scales with task complexity: a simple one-paragraph question gains nothing from XML. A four-component context stack gains substantially.

What's the difference between a context stack and RAG?

RAG (Retrieval-Augmented Generation) is an automated, infrastructure-level pattern: a system automatically retrieves relevant content from a vector database or search index and injects it into each model call at runtime.

A context stack is a curated, human-in-the-loop assembly process: you select and structure the relevant sources for a specific task.

RAG is appropriate for production applications that need to handle many different queries automatically. A context stack is appropriate for individual developer workflows where you have direct knowledge of what context is most relevant. They solve overlapping but distinct problems.

How do I prevent PII leaks when using LLMs with company data?

Before assembling your context stack, run a systematic scrub:

  • remove all environment variables, API keys, tokens, and credentials.
  • remove or anonymize customer-identifying data (names, emails, IDs).
  • flag any vendor contract or financial information for review.

The safest approach is a rule-based scanner that runs on all context before it is exported, catching secret patterns, email formats, and credential-like strings automatically.

For teams, consider making this a required step in any LLM workflow that touches production data or customer records.

Like this post? Share it:

Related Posts

A split-screen comparison showing how to fix Claude Code context rot. The left side shows a broken funnel overflowing with raw inputs like Notion docs and Git logs, pushing the context window gauge into the red. The right side shows a clean, structured XML capsule keeping the context window gauge in the calm green zone.

Claude Code Context Window Rot: Why Sessions Get Dumber (And How to Fix It)

Claude Code sessions degrade silently - not from bugs, but from context rot. Here's the science, the symptoms to spot early, and the fix that works upstream.

Read more about Claude Code Context Window Rot: Why Sessions Get Dumber (And How to Fix It)
A developer at a desk points to three glowing document cards  labelled spec, diff, and issue — selected from a large pile of  chaotic, urgent-tagged documents on the left — while an AI chat  window on the right begins generating a response.

Context Engineering for Developers: A Practical Guide (2026)

Master context engineering for developers. Learn the step-by-step workflow to assemble LLM context stacks, avoid context rot, and get better AI outputs.

Read more about Context Engineering for Developers: A Practical Guide (2026)