Context Engineering on Roland Huß

Context engineering 101

Sat, 18 Apr 2026 00:00:00 +0000

Everybody talks about prompt engineering. Write better prompts, get better results. That framing was useful once, but it misses the point for coding agents. The prompt is maybe five percent of what determines whether a session goes well or falls apart. The rest is context: what the agent sees when it starts working, how that context evolves over the session, and what happens when it grows too large for the model to track.

Andrej Karpathy named this “context engineering” in mid-2025, and the term stuck because it describes something real. You’re not just writing prompts. You’re engineering the entire information environment the agent operates in. That includes your project structure, your AGENTS.md files, the git state, the conversation history, and everything the agent discovers as it works. Get this right and the agent feels like a capable collaborator. Get it wrong and you’ll spend more time correcting it than doing the work yourself.

What follows are lessons from months of daily Claude Code use. While the examples are Claude Code-specific, the principles apply to any coding agent. Not all of them will apply directly to your setup, but I wanted to share what I’ve learned so far. Maybe some of it saves you a few wrong turns.

Context fatigue is real
#

Large context windows are both a gift and a trap. Claude Code gives you a million tokens of context, and you’d think you’d never run out. In practice, model performance degrades long before you hit the limit. It’s not a cliff, more like a gradual fog. Responses get less precise, the agent starts forgetting instructions from earlier in the conversation, and it suggests approaches you already tried and rejected fifty messages ago.

Give it some rest. A /clear from time to time is the equivalent of a good night’s sleep. Start a fresh session when you switch to a different task, or when you notice the agent struggling with things it handled well earlier. I used to resist this because starting over felt wasteful: all that accumulated context, all those decisions, gone. But a fresh session with a clear problem statement almost always outperforms a tired session with a thousand messages of history.

This matters more with larger context windows, not less. When the window was small, sessions naturally ended before quality degraded too far. With a million tokens, you can keep going well past the point where the agent is still being useful. The context window tells you how much the model can hold. It doesn’t tell you how much it can hold while still being sharp.

There’s a related pattern worth calling out: when you’re stuck on a problem, the worst thing you can do is keep hammering at it in the same session. The agent has already committed to an approach. It’s anchored on assumptions from earlier in the conversation that might be wrong. Open a new session, describe the problem from scratch, and don’t mention what you’ve already tried. A fresh session often finds a solution in minutes that the previous one couldn’t find in an hour, not because the model got smarter but because you removed the accumulated bias. If I’ve spent more than three back-and-forth exchanges on something without progress, I start over. The cognitive cost of re-explaining the problem is much lower than continuing down a dead end.

Understand before you apply
#

This one bit me early. Claude suggests something that sounds sophisticated, maybe a design pattern you haven’t seen before, or an architectural decision that looks reasonable. Your instinct is to accept it and move on, which makes it especially dangerous: in a typical session you accept 80 to 90 percent of proposals unchanged, so unfamiliar patterns slip through easily.

If you can’t explain why a particular approach is the right one, you can’t debug it when it breaks. And it will break, usually at the worst possible time, in a part of the codebase you haven’t touched in weeks. Claude isn’t trying to mislead you. It’s pattern-matching against its training data and producing what statistically looks correct. Sometimes that’s exactly right. Sometimes it’s a plausible-looking solution to a problem you don’t actually have.

The fix is simple: before applying something you don’t fully understand, ask Claude to explain the reasoning. Not “what does this do” but “why this approach over the alternatives.” The explanations are often excellent, and they sometimes reveal that the suggestion doesn’t fit your situation at all. Five minutes of questioning can save you an afternoon of debugging code you never understood in the first place.

Hardening rounds are not optional
#

The first version that works is not the version you should ship. This is true for human-written code too, but with coding agents the gap between “works” and “works reliably” is wider because the agent optimizes for getting something functional as fast as possible. That’s exactly what you asked for, and it’s exactly what will bite you later.

After the initial implementation, always add explicit hardening rounds. What that means concretely: for every error you encounter, have the agent write a test. Not “make this robust” (that’s far too vague and the agent will just sprinkle some error handling around), but “this specific thing broke, write a test that catches it, then fix it.” Each bug becomes a regression test, and over time those tests accumulate into a safety net that actually reflects how the code fails in practice.

Separately, add a refactoring round where you ask the agent to clean up the implementation without changing behavior. Hardening and refactoring are distinct concerns, and mixing them leads to worse results in both.

For UI work especially, expect many rounds. I’ve been building cc-deck (a TUI plugin for Claude Code) for some time, and this is where the pattern shows most clearly. The initial implementation of a feature usually works for the golden path. But the agent isn’t good at keeping things consistent and concise across UI states, so the hardening catches cases where terminal sizes change, sessions disappear mid-update, or multiple events arrive in the same tick. Without those extra passes, users would hit bugs within minutes of real use.

Debug systematically, not by guessing
#

For complex UI flows, whether TUI or web, there’s a technique I keep coming back to. Instead of describing a bug and hoping Claude guesses the right fix, give it actual data.

The process: have Claude add debug logging for event and mode handling. Navigate to your issue in the running application. Ask Claude to take a baseline of the current logs. Then exercise the bug, trigger whatever goes wrong. Now ask Claude to analyze the new events that flowed in after the baseline.

This turns debugging from “here’s a vague description of what went wrong” into “here’s the exact sequence of events that led to the unexpected state.” The agent can trace through the event log, identify where the state diverged from what was expected, and propose a fix based on actual data rather than speculation.

I developed this pattern while working on cc-deck’s session state machine. The sidebar plugin tracks session activity through Init, Working, Done, Idle, and Paused states, with time-based transitions between them. When a state transition fired at the wrong time, describing the problem in English was almost useless. Showing Claude the timestamped event log and saying “the transition from Working to Done happened here, but it shouldn’t have” led to accurate fixes every time.

When the output feels vague, the input was vague
#

So far, these tips have been about managing your session. But there’s also a real-time signal worth paying attention to: the quality of what the agent gives you right now.

You can spot it. When Claude starts hedging, producing long explanations with multiple “alternatively” branches, or suggesting solutions that feel vague rather than specific to your codebase, it’s guessing. It doesn’t have enough context to give you a confident answer, so it’s giving you a probability distribution instead.

This is your signal to provide more context, not to pick the option that sounds best. Show it the relevant code. Point it at the test that’s failing. Give it the error message, the full one, not a paraphrase.

Think of context like a budget. Every piece of information in the context window has a cost, not in tokens but in attention. The more noise in the conversation, the harder it is for the model to focus on what matters. Be deliberate: point the agent at specific files, quote the relevant error, describe what you’ve already ruled out. My best sessions start with a minute of framing the problem, while the worst start with a one-line description and an expectation that the agent will figure out the rest.

This also means cleaning up after yourself. If you tried an approach that didn’t work, say so explicitly. “I tried X and it failed because Y, so don’t go down that path” is more valuable than hoping the agent noticed from the conversation history. And if you’ve been going back and forth without progress, remember the context fatigue lesson: a /clear or a fresh session beats piling more confusion onto an already muddled conversation.

The skill patterns post covers this from the plugin author’s side: how to structure skills and hooks so the agent has the right context at the right time. From the user’s side, the principle is simpler. The old programmer’s wisdom “garbage in, garbage out” applies with full force here. Sloppy context produces sloppy output, and no amount of prompt cleverness will compensate.

Your project IS the context
#

Beyond individual sessions, there’s a layer of context engineering that pays off before you even start a conversation: your project structure itself.

Everything the agent sees when it starts a session shapes how it works. Your instruction files, your directory structure, your naming conventions, your test patterns. If these are messy, the agent’s output will reflect that mess.

This cuts both ways. A well-structured project with clear conventions makes the agent noticeably better. A project instruction file that explains your architecture, coding standards, and common patterns gives the agent a running start on every session. You write it once and it pays off hundreds of times. A cluttered codebase with inconsistent naming, dead code, and no documentation gives the agent bad examples to learn from, and it will faithfully reproduce those bad patterns in new code.

Most coding agents support some form of project-level instructions: CLAUDE.md for Claude Code, .cursorrules for Cursor, .github/copilot-instructions.md for Copilot. The emerging cross-tool standard is AGENTS.md, a vendor-neutral format that any agent can read. Claude Code picks it up too, so you can maintain one file instead of several.

For projects where you want to go further, Spec-Driven Development (SDD) takes this idea to its logical conclusion. SDD starts from the premise that coding agents can only produce reliable output when you structure the context properly. Formal specifications define what to build, a project “constitution” captures architectural decisions and coding standards, and the agent implements from those artifacts rather than from ad-hoc conversation. SDD is applied context engineering with a defined workflow. Tools like spec-kit provide the scaffolding for this approach. It’s more overhead than a simple AGENTS.md, but for larger projects the structure pays for itself quickly. More on SDD in a future post.

Whatever format you use, treat it as living documentation. If the project evolves and the instructions don’t, the agent will drift.

These aren’t permanent rules
#

Context engineering is a young discipline. But the core principle will last: working well with coding agents is about engineering the context, not crafting the prompt. The prompt is the last five percent. Everything else is what makes that five percent effective.

If you’ve found your own patterns for keeping coding agent sessions productive, I’d love to hear about them. Leave a comment below. What works for you? What did you have to learn the hard way?

Author: Roland Huß AIA Ph SeNc Hin R Claude Opus 4.6 v1.0

Claude Code skill patterns

Thu, 09 Apr 2026 00:00:00 +0000

Claude Code skills let you extend a coding agent with custom workflows, specialist knowledge, and automation. You write a SKILL.md file with instructions, and the agent follows them. At least, that’s the idea.

In practice, Claude treats skill content as advice, not as instructions. A skill that says “always use spec-kit to create the specification” might get followed, or Claude might decide it already has enough context from the brainstorming phase to write the spec directly. It’s being helpful, but it’s also wrong. This post describes patterns for dealing with that challenge, from scripts that enforce consistency to hooks that block shortcuts before they happen.

Stronger wording does not help. I tried “MUST”, “ALWAYS”, “CRITICAL”, bold, uppercase. None of it changes the underlying behavior: the model reads your skill, considers it, and then makes its own judgment call. Compare that to hook-injected context (system reminders), which Claude treats as ground truth. This difference in compliance is not a bug, it’s how the architecture works. Understanding it early saves you from writing increasingly aggressive skill instructions that still get ignored.

The patterns in this post come from building cc-prose (voice-consistent writing), cc-spex (spec-driven development on top of spec-kit), and a few others.

The shortcut race
#

This shortcutting behavior keeps coming back and deserves its own section.

You fix one shortcut in your skill instructions, Claude finds another path around it, you patch that one, and it discovers a third. It’s a classic Hase-und-Igel race, and as the skill author you will always be the hedgehog.

In cc-spex, the brainstorming phase collects requirements, explores alternatives, and validates assumptions before the specification phase formalizes everything into a proper spec using spec-kit. But Claude regularly decides that the brainstorming output is “structured enough” and writes the spec directly. From the model’s perspective, this is a reasonable optimization. From the workflow’s perspective, it’s skipping the entire point.

What makes this particularly frustrating is that Claude usually knows it did the wrong thing. Point out the shortcut and it will apologize, acknowledge that it should have followed the workflow, and promise to do better next time, without that actually preventing the next occurrence.

A useful debugging technique is to open a fresh session, show Claude what happened, and ask: “Why did you do this, and how can we prevent it in the future?” The answers are surprisingly insightful, and many of the patterns in this post came directly from those conversations. Then I let Claude fix the skill right there in the same session, while the reasoning is still fresh. Each iteration makes the skill more reliable.

Can you win this race? Not entirely. But you can make shortcuts increasingly expensive for the model by moving critical logic out of skills and into scripts and hooks, where it stops being a suggestion and becomes a constraint.

Skills vs scripts
#

Skills can call out to scripts during execution. This is the escape hatch from the shortcut race: instead of relying on Claude to follow a multi-step procedure described in English, you delegate the deterministic parts to a script that the skill invokes at the right moment. The question then becomes: what belongs in the skill, and what belongs in the script?

Skills give you flexibility, error-forgiveness, and adaptability. The LLM can handle ambiguous input, recover from unexpected situations, and apply judgment, so use them when the outcome benefits from interpretation.

Scripts give you determinism, repeatability, and predictability because a Python script does the same thing every time. Use them when the outcome must be consistent across runs.

One of my plugins has a workflow for processing external PDF reviews: iterate over reviewer comments, map them to source files, assess complexity, and generate fix proposals. This follows a fixed procedure with multiple stages. Early versions used skills for the whole flow, and the result was a different experience every time: sometimes it would skip comments, sometimes it would process them in the wrong order, sometimes it would invent stages that didn’t exist. A session_manager.py script now handles the state management, tracks the interaction state, and stores results. The skill handles the judgment calls within each stage, like assessing whether a reviewer’s suggestion actually improves clarity. The script handles everything else.

The evolution path is predictable: you start with a simple prompt-only skill (just English instructions, fragile by nature) and gradually move state management into scripts as you hit consistency problems. And you don’t write those scripts by hand, either. You tell Claude “let’s create a script to make this more predictable” and let it build the script and wire it into the skill at the right checkpoints. The rule of thumb: every state change should go through a script. Let the LLM do the thinking, and let scripts do the bookkeeping.

Hooks over hopes
#

If skills are advisory and scripts are deterministic, hooks are the bridge. Claude Code’s hook system lets you inject context at specific points in the interaction, and that context arrives as a system reminder, which Claude treats as ground truth. Better skill instructions won’t fix the compliance gap, but hooks can enforce what instructions can’t.

cc-spex uses a PreToolUse hook as a guardrail layer. This hook fires before every tool call and can either allow it, deny it with an error message, or inject additional context. When the plugin runs a multi-stage ship workflow (brainstorm, specify, plan, implement, review, verify), Claude regularly tries to skip stages. A skill instruction saying “complete each stage in order” gets interpreted as a suggestion. The hook makes it a constraint: a state file on the filesystem tracks the current pipeline stage, and the hook checks it before every tool call. If Claude tries to jump from stage 2 (spec review) to stage 6 (implementation), the hook blocks the call and lists every stage it needs to complete first:

PIPELINE DISCIPLINE: You are trying to invoke speckit-implement 
(stage 6: implement) but the pipeline is at stage 2: review-spec. 
You MUST complete these stages first, in order:
 2. review-spec (spex:review-spec)
 3. plan (speckit-plan)
 4. tasks (speckit-tasks)
 5. review-plan (spex:review-plan)
Do NOT skip stages. Do NOT shortcut.

The same hook catches hallucinated commands (Claude inventing /spex:specify which doesn’t exist) and reminds about verification before git commits. A companion UserPromptSubmit hook injects plugin state as a structured system reminder on every prompt, including resolved script paths, project configuration, and session context, so that skills always have accurate state without needing to rediscover it.

Instead of telling Claude what to do and hoping it complies, you build guardrails that make non-compliance impossible. The hooks don’t ask Claude to follow the process, they enforce it at the tool level before Claude’s output reaches the user.

More patterns
#

A few more patterns that came up across my skills.

Multi-specialist over monolith. Instead of one massive skill that tries to do everything, split into independent specialist skills with their own activation rules. One of my plugins uses 12 specialists for different editing concerns: style, consistency, flow, references, links, examples, and more. This keeps individual skills focused and within context budget, because one monolithic skill will lose track of its own instructions.

Prevention over correction. cc-prose generates clean content upfront using voice profiles and AI pattern avoidance, rather than generating text and then running a fixer pass. It’s cheaper to prevent bad output than to detect and fix it after the fact. Not to mention that fixing AI-generated text while using AI tends to be an exercise in whack-a-mole.

Keep skills small. Every line of skill content fills the context window, so if Claude “helpfully” adds optional features to your skill output, remove them. Periodic cleanup during development is important because skill code accumulates cruft faster than regular code.

Configuration hierarchy. Skill defaults in knowledge-base/ get overridden by global user config, which get overridden by project config. cc-prose uses this to layer project-specific voice profiles on top of general writing defaults. Three layers is enough.

Building on a moving platform. Claude Code is transitioning from commands/ to user-invocable skills/, but plugin skills installed from marketplaces still don’t appear in the / slash command autocomplete (anthropics/claude-code#18949, open since January 2026). Build with skills/ as the primary structure but keep thin commands/*.md stubs as a bridge until autocomplete catches up.

The core tension between flexibility and determinism, between skills and scripts, won’t go away with better models. If anything, more capable models will just make the shortcut race more creative.

If you’re building Claude Code skills and have found patterns of your own, I’d like to hear about them.

Author: Roland Huß AIA Ph SeNc Hin R Claude Opus 4.6 v1.0

Context Engineering on Roland Huß

Context engineering 101

Context fatigue is real #

Understand before you apply #

Hardening rounds are not optional #

Debug systematically, not by guessing #

When the output feels vague, the input was vague #

Your project IS the context #

These aren’t permanent rules #