Ben â˘
How two independent PR generation benchmarks pointed to the same conclusion about context quality - and why your model choice matters less than you think.
Here's a finding that should change how you think about AI tooling: in two independent experiments using real production code, a "budget" model fed rich context consistently outperformed flagship models operating on shallow git summaries. The budget model didn't just win. It won by a landslide, unanimously, against models that cost significantly more per token.
This isn't a post about which model is best. It's about why the question itself might be the wrong one to ask.
HiveTrail Mesh is a context assembly tool. One of its core features is PR Brief - it scans a git branch against a base branch, reads every changed file in full, assembles all diffs and commit metadata into a structured XML document, and hands it to an LLM. The output is typically a 100Kâ380K token document containing everything an LLM needs to write a comprehensive PR description.
We used this workflow as the basis for both experiments. The prompt in each case was deliberately simple:
Based on the staged changes / recent commits, write me a PR title and description.
No elaborate prompting. No chain-of-thought instructions. Just the raw context and a task.
The first experiment ran on the Git Tools feature - a substantial new addition to HiveTrail Mesh covering 27 commits across 32 files, with async XML generation, state management, UI components, and 41 new tests.
We ran three conditions:
Condition A - Claude Code (Sonnet 4.6), native git context. Claude Code ran and - the standard abbreviated approach. Generated in about 25 seconds.
git log main..HEAD --onelinegit diff main...HEAD --statCondition B - Haiku 4.5, Mesh context. Mesh assembled a 380KB XML file (~106K tokens) covering every changed file, diff, and commit. Haiku 4.5 received this in full.
Condition C - Sonnet 4.6, Mesh context. Same Mesh XML, same prompt, given to Sonnet 4.6.
Gemini 3 Pro evaluated all three as a senior software developer and product manager.
The verdict was unambiguous. The Mesh-fed PRs were called "significantly stronger" across every dimension: product context, workflow clarity, architectural structure, technical depth, and testing visibility. The Claude Code version was characterised as reading like "a rough draft or a quick brain dump before hitting Create Pull Request."
This wasn't a knock on Sonnet 4.6. It was a knock on what Sonnet 4.6 was given to work with.
Claude Code, like most agentic coding tools, acts like a developer who skims the commit titles and says "looks good to me." It reads summaries: which files changed, roughly how many lines, and what the commit subjects say. HiveTrail Mesh acts like the reviewer who actually pulls down the branch and reads every single file. The difference in output reflects that difference in reading.
Haiku 4.5 with full context outperformed Sonnet 4.6 with shallow context. A cheaper, faster model, given the complete picture, wrote a better PR than a more capable model working from a summary.
But here's the part that should really give you pause: Haiku 4.5 didn't just beat Sonnet 4.6's native shallow context - it beat Sonnet 4.6 when both were fed the exact same Mesh XML. The budget model outperformed the flagship on a level playing field.
Final ranking:
Several months later, we ran a second experiment on a completely different feature - the GitHub API integration for HiveTrail Mesh, covering 24 files and 22 commits.
The framing this time was sharper. The question wasn't "which model is best" - it was "can an agentic tool using native git context compete with the same model family when context is properly assembled?"
Gemini CLI was the subject under test. It has its own git tooling, can run shell commands, and is built by the same team behind the models it would be competing against. If any tool could close the context gap through smart native tool use, Gemini CLI was the candidate.
We set it against seven Gemini models - ranging from Gemini 3 Fast to Gemini 3.1 Pro with high thinking - all fed via HiveTrail Mesh. We also added Haiku 4.5 via Mesh as an external reference point, since it had won Experiment 1.
Three independent judges evaluated all nine PR texts blind, without knowing which model produced which:
Scoring: 9 points for 1st place, 1 point for last. Maximum possible: 27.
| Rank | Model | Gemini 3 Pro | Opus 4.6 | ChatGPT | Total |
|---|---|---|---|---|---|
| 1 | Haiku 4.5 + Mesh | 9 | 9 | 9 | 27 |
| 2 | Gemini Flash 3 preview (Thinking Low) + Mesh | 8 | 7 | 8 | 23 |
| 3 | Gemini 3 Fast + Mesh | 7 | 6 | 4 | 17 |
| 4 | Gemini 3.1 Pro preview (Thinking High) + Mesh | 2 | 8 | 6 | 16 |
| Tied 5 | ChatGPT + Mesh | 6 | 1 | 7 | 14 |
| Tied 5 | Gemini Flash 3 preview (Thinking High) + Mesh | 5 | 4 | 5 | 14 |
| 6 | Gemini 3.1 Flash Light preview (Thinking High) + Mesh | 3 | 5 | 3 | 11 |
| 7 | Gemini 3 Pro + Mesh | 4 | 3 | 2 | 9 |
| 8 | Gemini CLI (native context) | 1 | 2 | 1 | 4 |
Two results stand out.
First, Haiku 4.5 received a perfect score - 9 from every judge, unanimously, with a 4-point gap over second place. All three judges independently placed it first for the same reasons: dedicated test coverage sections, specific method names and API behaviors called out by name, explicit reasoning behind architectural decisions, and reviewer notes that no other entry included. Opus 4.6 called it "the most complete and production-grade PR description" of the nine.
Second, and more telling: Gemini CLI finished last. Not second to last - last, with 4 points, behind every Mesh-fed entry, including smaller, cheaper Gemini variants. Its own model family, given better context by a different tool, beat it at every position in the table.
The reason is the same as Experiment 1. Gemini CLI ran and a few shell commands. Fast, low-cost, reasonable for most tasks - but it produced the same shallow picture. The resulting PR covered the surface of the changes without the architectural reasoning, edge case handling, or quantified test results that the Mesh-fed models could draw on because they had actually read the code.
git log -n 10 --statIt's worth noting that the Mesh PR Brief isn't just raw file content dumped into a prompt. It's structured XML - commits organized chronologically, files grouped by change type, diffs nested within their commit context. That structure helps LLMs navigate 100K+ token documents more efficiently than a flat wall of text would. So "full context" here means both more information and better-organized information. Both matter.
After the main competition, we ran Claude Code on the same feature - not as a competitor, but as a consistency check. Same pattern as Experiment 1: a short, surface-level PR based on abbreviated git output. The shallow-context behavior isn't specific to any one tool or vendor. It's structural - it's what happens when speed is optimized over depth of reading.
Context quality sets the ceiling. Model choice determines where within that ceiling you land.
Run both experiments side by side, and the picture is hard to argue with.
Experiment 1 tested the context delivery method with the same model family. Mesh-assembled context won over native git context regardless of model tier - and the budget model beat the flagship even on a level playing field.
Experiment 2 tested whether a sophisticated agentic tool could close that gap through smart native tool use. It couldn't - and it finished last against its own model family.
Different features. Different PR Briefs. Different competitive sets. Different judges. The only constant was the relationship between context quality and output quality.
When an AI tool reads a few lines of git log to write a PR, it isn't producing a poor result because it's a bad model. It's producing a poor result because it has been given a poor picture of what changed and why. Give any capable model the full picture - every file, every diff, every commit, structured and organized - and the output improves dramatically.
The implication runs both ways. A "budget" model with rich context outperforms a flagship with shallow context. And a flagship with shallow context produces flagship-priced shallow output.
If you're using AI tools for PR descriptions today, the most impactful change probably isn't switching models.
Agentic coding tools are optimized for speed and low token cost - they read summaries, not full file content. That's the right tradeoff for interactive coding tasks, where you want fast feedback and low latency. For a PR covering 20+ files and weeks of work, summary-level context produces summary-level output.
The alternative is deliberate context assembly before you prompt: read every changed file in full, preserve the diff structure, organize commits chronologically, package everything in a format the LLM can navigate. You could build a script to do this - pull every changed file, run the diffs, and format it into structured XML. It's achievable engineering. It's also a few days of work to do properly, and more to maintain as your codebase evolves.
That's exactly why we built HiveTrail Mesh's PR Brief. Point it at a branch and within seconds it has scanned every changed file, assembled the diffs, and produced a structured 100,000+ token XML document - faster than most agentic tools complete their own context gathering. The remaining time in the workflow is just the LLM responding, which varies by model (a few seconds for smaller models, up to ~30 seconds for the larger ones). The total end-to-end time is competitive with agentic coding tools - with dramatically better output to show for it. Use any LLM you prefer: Claude, Gemini, ChatGPT, whatever fits your workflow. The model choice, as these experiments suggest, matters less than you might expect.
For teams where PRs serve as living documentation, get reviewed by multiple people, or feed downstream into release notes, the tradeoff is straightforward. For a solo developer pushing a two-file fix, probably not worth it.
In the spirit of intellectual honesty:
Prompt engineering. Both experiments used a minimal prompt. A carefully crafted prompt might narrow the gap somewhat - though we'd expect the ceiling to remain lower without full file content. It's also worth noting that the Mesh PR Brief's structured XML format is itself a form of context organization: commits are sequenced chronologically, files are grouped by change type, and diffs are nested within their commit context. That structure likely helps LLMs parse large documents more efficiently than flat CLI output would.
Other writing tasks. Both experiments focused on PR descriptions. Commit messages, technical documentation, and code review summaries likely follow the same pattern, but we haven't tested them.
Newer model releases. These experiments used models current at the time of testing. Rankings will shift as new models release - though the underlying dynamic (context quality determines ceiling) should hold.
Cost efficiency. Haiku 4.5 is significantly cheaper per token than most of the models it beat. The cost-per-quality-point story is compelling, but token pricing changes frequently enough that any number we published here would be stale quickly.
The most useful takeaway from two experiments isn't a model recommendation. It's a workflow question worth asking before you prompt: what does the model actually see?
If the answer is "a handful of commit subject lines and a diffstat," you've already constrained the output - regardless of which model is on the other end.
The models are good enough. The context is usually the bottleneck.
HiveTrail Mesh is a context assembly tool for developers and product teams. PR Brief assembles a token-optimized, structured XML document from your git branch - ready to paste into any LLM. Try the beta â
Not necessarily. Our experiments show that model tier is less determinative than context quality. A budget model like Claude Haiku 4.5 fed a complete, structured context document consistently outperformed flagship models working from abbreviated git summaries - and in one test, outperformed a more expensive model even when both received identical context. The ceiling on output quality is set by what the model is given to read, not by the model's capability alone.
Context assembly is the process of gathering, structuring, and formatting all relevant information before passing it to an LLM. For PR descriptions, this means reading every changed file in full, collecting all diffs, organizing commit metadata chronologically, and packaging it into a structured format that the model can navigate efficiently. Most agentic coding tools skip this step. They read git summaries rather than full file content. The difference in output quality between summary-level context and fully assembled context is significant and consistent across multiple experiments and model families.
They can produce reasonable PR descriptions, but they operate under a structural constraint: they're optimized for speed and low token cost, so they read abbreviated git output rather than full file content. In both experiments documented here, agentic tools using native git context finished behind every model that received a fully assembled context, including smaller, cheaper models from the same family. For a two-file fix, the difference may not matter. For a PR covering 20+ files and multiple weeks of work, the gap in output quality is substantial.
Structure helps LLMs navigate large documents more efficiently. When context is organized - commits sequenced chronologically, files grouped by change type, diffs nested within their commit context - the model can locate and reason about related information without having to reconstruct relationships from a flat wall of text. At 100,000+ tokens, the organization isn't cosmetic. It affects how accurately the model can synthesize architectural decisions, test coverage, and cross-file dependencies into a coherent PR description.
The core approach involves three steps: read every changed file in full (not just the diffstat), preserve the diff structure with enough surrounding context to understand intent, and organize commits chronologically rather than presenting them as a flat list. This can be scripted - pull changed files via git diff --name-only, read each one, format everything into a structured document. Done properly, it takes a few days of engineering work and ongoing maintenance as your codebase evolves. HiveTrail Mesh automates this workflow: point it at a branch, and it assembles a structured XML document covering all changed files, diffs, and commit metadata in seconds, ready to pass to any LLM.
We had Gemini blind-judge three Claude-generated pull requests. Here is the exact AI pull request template it built, and why rich code context is essential.
Read more about We Had Gemini Blind-Judge Three Claude-Generated Pull Requests. Here's the Template It Built.