Org Status: π‘ Dormant Cloudflare: N/A Last Audited: 2026-04-28
Most people use AI coding assistants like a search engine with attitude. They open a chat, dump in a vague question, get a vague answer, and close the tab. They are leaving 80% of the capacity on the table. The developers who get 10-50x more value from the same subscription have figured out something fundamental: the context window is a computer, and you need to program it.
This is a practitionerβs field guide to context engineering for AI coding assistants. It is based on running 50+ AI sessions per week across multiple concurrent projects, tracking every token, and iterating on the system for months. The patterns work with Claude Code, Cursor, Copilot, Windsurf, Aider, or any tool that gives you a context window to manage.
What you will learn:
- Why βjust chattingβ with AI wastes most of your context window, and what to do instead
- The OPEN-WORK-CLOSE session pattern that creates project continuity across conversations
- How to measure and budget tokens so you use 100% of your subscription capacity
- A 3-layer context architecture that cuts session startup cost by 60%+
- How to run two AI workstreams in parallel without them colliding
- Persistent memory systems that survive
/clearand make every session smarter - The train system for batching work into focused sprints with measurable outcomes
- When to use sub-agents vs direct tools and how agent isolation protects your context
- How to treat AI subscriptions as employees with ROI targets
- The Context Window Problem
- Token Economics: What You Are Actually Paying For
- Session-Based Workflows
- Token Budgeting
- Context Layering: The 3-Layer Architecture
- Two-Lane Concurrency
- Capacity Pacing
- Persistent Memory Systems
- The Train System
- Tool and Agent Design
- Comparisons
- Anti-Patterns
- References
Every AI coding assistant has a context window --- a fixed-size buffer of tokens that represents everything the model can βseeβ at once. For Claude, this is 200K tokens (about 150K words). For GPT-4o, it is 128K. For Gemini, it is up to 1M but with significant degradation at the long end.
Here is the problem: most users fill this window with garbage.
A typical unstructured AI coding session looks like this:
- Open a new conversation
- Paste in 3 files (8K tokens)
- Ask a question (50 tokens)
- Get a response (2K tokens)
- Ask a follow-up that contradicts the first question (80 tokens)
- Paste in 2 more files because the AI βdoesnβt understandβ (6K tokens)
- Get a response that hallucinated because the context is now a mess (3K tokens)
- Give up and start a new conversation
Total tokens consumed: ~20K. Useful work accomplished: maybe 2K tokens of actual insight, buried in 18K tokens of noise. And the next session starts from zero because nothing was persisted.
Compare this to a structured session:
- AI opens, reads a 2K-token memory file that contains project state, decisions, and priorities
- AI reads a 1K-token priority list from a GitHub issue
- AI identifies the goal, reads exactly the files it needs (5K tokens)
- AI executes the work, committing code, updating issues, logging progress
- AI closes, writing 500 tokens of state back to memory for the next session
Total tokens consumed: ~15K. Useful work: a shipped feature, updated documentation, and persistent state that makes the next session even more efficient.
Key insight: Context engineering is not about cramming more into the window. It is about curating what goes in so that every token earns its place.
Why Existing Approaches Fall Short
βJust chat with itβ --- No persistence, no structure, no continuity. Each conversation starts from scratch. You re-explain your project, your preferences, your architecture every single time.
βPaste in everythingβ --- Context windows have a quality problem, not just a size problem. Research shows that model attention degrades as context grows --- a phenomenon Anthropic calls βcontext rot.β Pasting in your entire codebase does not help; it hurts.
βUse the biggest modelβ --- A 1M-token context window does not solve the curation problem. It just means you can be sloppy with more tokens before things break. The retrieval accuracy at the edges of a 1M window is measurably worse than a well-curated 50K window.
βMemory features will save meβ --- ChatGPTβs memory, Windsurfβs Cascade memories, and similar features store facts but not workflows. They cannot encode βwhen you open a dev session, read the priority list, check for conflicts with the other lane, and pick a goal.β That requires structure.
What Changes If You Get This Right
The difference between unstructured and structured AI usage is not incremental. It is categorical:
| Metric | Unstructured | Structured |
|---|---|---|
| Sessions before useful output | 3-5 | 1 |
| Context wasted on re-explanation | 40-60% | <5% |
| Cross-session continuity | None | Full |
| Concurrent workstreams | 1 | 2+ |
| Weekly capacity utilization | 20-40% | 90-100% |
| Subscription ROI (API-equivalent) | 3-5x | 15-60x |
Those are real numbers from tracked sessions. On a single day running structured sessions, one developer logged $91 in API-equivalent value on a $200/month subscription --- a 13x daily ROI. On a day of unstructured chatting, the same developer might generate $5-10 of value.
Before you can manage context efficiently, you need to understand what tokens cost and where they go.
Token Pricing (Claude API, March 2026)
| Model | Input (per 1M tokens) | Output (per 1M tokens) | Context Window |
|---|---|---|---|
| Claude Opus 4.6 | $5.00 | $25.00 | 200K (1M extended) |
| Claude Sonnet 4.6 | $3.00 | $15.00 | 200K |
| Claude Haiku 4.5 | $1.00 | $5.00 | 200K |
| GPT-4o | $2.50 | $10.00 | 128K |
| GPT-4o mini | $0.15 | $0.60 | 128K |
Key insight: Output tokens cost 5x more than input tokens. A 2K-token response costs the same as a 10K-token input. This means verbose AI responses are disproportionately expensive, and extended thinking (which bills as output) can dominate your budget.
Subscription vs API Economics
Most individual developers use subscriptions, not API billing:
| Plan | Monthly Cost | Approximate Weekly Token Budget | API-Equivalent Value |
|---|---|---|---|
| Claude Pro | $20 | ~1M output tokens | ~$100-200/week |
| Claude Max 5x | $100 | ~5M output tokens | ~$500-1000/week |
| Claude Max 20x | $200 | ~20M output tokens | ~$2000-4000/week |
| Cursor Pro | $20 | Varies (500 requests/mo) | ~$100-300/week |
On a subscription plan, you are not paying per token --- you are paying for a weekly capacity bucket. This fundamentally changes the optimization target: you want to use 100% of your capacity, not minimize usage. An unused subscription day is wasted money.
Where Tokens Actually Go
Here is a real breakdown from a tracked dev session:
Session: Train 6 S1 β 21 articles published
API-equivalent cost: ~$45
Duration: ~3 hours
Token breakdown (approximate):
System prompt + CLAUDE.md: 4,000 tokens (input, cached)
Memory file (MEMORY.md): 3,500 tokens (input)
GitHub issues (#12, #13, #30): 2,500 tokens (input)
Playbook (CLAUDE-main.md): 2,000 tokens (input)
Code files read during work: 35,000 tokens (input)
Tool calls + responses: 20,000 tokens (input/output)
AI reasoning + output: 80,000 tokens (output)
Extended thinking: 50,000 tokens (output, billed as output)
βββββββββββββββββββββββββββββββββββββββββββββ
Total: ~197,000 tokens
The session startup cost (everything before actual work begins) was about 12K tokens. That is the βtaxβ you pay to restore context. The goal of context engineering is to minimize this tax while maximizing the signal it delivers.
The Input Token Tax
Every message you send to the AI includes the entire conversation history as input tokens. If your conversation is 50K tokens deep, every new message costs 50K input tokens plus your new message plus the response. This means:
// The cost of message N in a conversation
interface MessageCost {
// Everything before this message
conversationHistory: number; // grows with every exchange
// The new content
newInput: number; // your message + any file reads
// The response
output: number; // AI's response + thinking
// Total input = conversationHistory + newInput
// This is why long conversations get expensive fast
}
// Example: 20-message conversation
// Message 1: input=2K, output=1K β total=3K
// Message 5: input=15K, output=2K β total=17K
// Message 10: input=40K, output=3K β total=43K
// Message 20: input=90K, output=2K β total=92K
// By message 20, you're paying 90K tokens just to maintain history
This is why Claude Code has auto-compaction at ~95% of the context window, and why /clear between unrelated tasks is not optional --- it is essential.
The most impactful pattern in this entire guide is treating AI interactions as sessions rather than conversations. A session has a lifecycle: it opens, does work, and closes. Each phase has specific responsibilities.
The Session Lifecycle
ββββββββββββ ββββββββββββ ββββββββββββ
β OPEN ββββββΆβ WORK ββββββΆβ CLOSE β
β β β β β β
β β’ Load β β β’ Executeβ β β’ Persistβ
β state β β goal β β state β
β β’ Brief β β β’ Log β β β’ Update β
β β’ Pick β β progress β memory β
β goal β β β’ Stay β β β’ Log β
β β β focused β β cost β
ββββββββββββ ββββββββββββ ββββββββββββ
The Session Interface
interface Session {
type: "dev" | "planning" | "ops" | "architecture" | "monetization";
goal: string; // One sentence. If you can't state it, don't start.
lane: "main" | "aux"; // Which concurrency lane (more on this later)
open(): Promise<SessionContext>;
work(): Promise<WorkResult>;
close(): Promise<void>;
}
interface SessionContext {
memory: MemoryFile; // Persistent state from last session
priorities: PriorityList; // What needs doing, ranked
sessionLog: LogEntry[]; // Recent session history (detect conflicts)
planningBook: Signal[]; // Strategic signals to process
capacityBudget: TokenBudget; // How much output you can afford today
}
interface WorkResult {
shipped: string[]; // Concrete deliverables
decisions: string[]; // Choices made and why
insights: string[]; // Things that change how you think
blockers: string[]; // What went wrong
}
OPEN: Restoring Context Efficiently
The OPEN phase is where context engineering matters most. You need to reload enough state for the AI to be productive without wasting tokens on information it does not need.
Here is a real OPEN sequence from a production system:
1. Read MEMORY.md (status, issue index, decisions)
2. Run capacity check (weekly token budget β report pace)
3. Read session log (scan for active parallel sessions)
4. Read priorities (pick a goal)
5. Read planning book (is strategic planning overdue?)
Then: Self-brief in 3-5 lines. State where we are, what was done
last, what is next. If a parallel session is active, note it.
Each step is intentional:
- MEMORY.md provides cross-session state (2-3K tokens vs re-reading 20 files)
- Capacity check prevents burning your weekly budget by Tuesday
- Session log detects if another AI instance is working concurrently
- Priorities ensure you work on the most important thing, not whatever feels interesting
- Planning book surfaces strategic signals before they rot
Key insight: The OPEN sequence is your sessionβs βboot sequence.β Like a computer BIOS, it should be fast, deterministic, and load only what is needed. A bloated OPEN wastes tokens on every single session.
WORK: Staying Focused
The WORK phase has one rule: do the thing you said you would do at OPEN.
Rules:
- One group, one goal. Don't wander.
- If strategy is wrong, that IS the finding β surface it, stop building.
- Log progress as you go (comments on issues, commit messages)
- If you notice something strategic, capture it on the planning book
immediately β don't try to solve it now.
- Seek close context: after the main goal, look for small wins in
the same area. Don't start new groups.
The βcapture, donβt executeβ pattern is critical. During a dev session, you will inevitably notice architectural issues, missing features, or strategic opportunities. Without discipline, these become rabbit holes that consume your entire session. Instead:
// Anti-pattern: Scope creep during execution
async function devSession() {
await fixLoginBug(); // The goal
await refactorAuthModule(); // "While I'm here..."
await addOAuth2Support(); // "This would be easy to add..."
await redesignUserModel(); // "Actually the data model is wrong..."
// Result: nothing shipped, everything half-done
}
// Pattern: Capture and continue
async function devSession() {
await fixLoginBug(); // The goal
// Noticed auth module needs refactoring
await captureToPlanning("ποΈ ARCHITECTURE: auth module needs refactoring β see commit abc123");
// Noticed OAuth would be valuable
await captureToPlanning("π‘ OPPORTUNITY: OAuth2 support β 3 customers asked for it");
// Done. One thing shipped.
}
CLOSE: Persisting State
CLOSE is the most underrated phase. It is where you convert a sessionβs ephemeral knowledge into persistent state that compounds across sessions.
1. Update priorities β check off done items
2. Update work issues β description current, has progress comments, close if done
3. Capture to planning book β anything strategic not yet captured
4. Log the session β type, summary, what shipped
5. Update daily log β insights, connections, blockers
6. Update MEMORY.md β but only your lane's section
7. Log session cost β track API-equivalent spend
8. Check capacity β report pace vs weekly budget
9. Say "Session closed. Safe to /clear."
The final line is important. /clear destroys the conversation context, but everything important has been persisted to files and issues. The next session can reconstruct full context in ~12K tokens instead of the 100K+ tokens the conversation consumed.
interface CloseChecklist {
// State persistence (required)
memoryUpdated: boolean; // MEMORY.md reflects current state
prioritiesUpdated: boolean; // Done items checked off
issuesCurrent: boolean; // Work issues have progress comments
sessionLogged: boolean; // #12 has CLOSE entry
// Knowledge capture (required)
planningBookCaptured: boolean; // Strategic signals preserved
dailyLogWritten: boolean; // Insights, decisions, connections
// Analytics (required)
costLogged: boolean; // API-equivalent spend tracked
capacityChecked: boolean; // Weekly budget pace reported
// Gate
safeToClose(): boolean {
return this.memoryUpdated
&& this.sessionLogged
&& this.costLogged;
}
}
Session Types
Not all sessions are the same. Different work requires different context loads and different OPEN sequences:
type SessionType =
| "dev" // Execute one goal. Ship code.
| "planning" // Evaluate past work. Launch next sprint.
| "ops" // Build automation. Reduce interactive time.
| "architecture" // Process ideas. Design systems.
| "monetization"; // Find the shortest path to money.
// Each type loads different context at OPEN:
const contextLoads: Record<SessionType, string[]> = {
dev: ["MEMORY.md", "#13 priorities", "#12 session log", "#30 planning book"],
planning: ["MEMORY.md", "#30 planning book (ALL comments)", "#13 priorities", "#12 session log"],
ops: ["MEMORY.md", "#49 ops log", "#12 session log"],
architecture: ["MEMORY.md", "Ideas/Inbox/*", "#33 pipeline", "#13 priorities"],
monetization: ["MEMORY.md", "#50 monetization log", "#23 revenue loop", "Brands/*.md"],
};
Planning sessions load more context than dev sessions because they need to see the full picture. Dev sessions load the minimum needed to execute a single goal. This is intentional --- every token of context load is a tax, and you should only pay for what you need.
If you cannot measure it, you cannot manage it. Token budgeting means tracking what costs tokens, auditing your session startup cost, and cutting bloat.
Measuring Session Cost
Here is a lightweight cost-tracking script that runs at session CLOSE:
#!/bin/bash
TYPE="${1:-W}"
DESC="${2:-session}"
TODAY=$(date +"%Y-%m-%d")
STATE_FILE=".session-cost-state"
LOG_FILE="session-costs.md"
CURRENT=$(npx ccusage daily --limit 1 --json 2>/dev/null | python3 -c "
import sys, json
d = json.load(sys.stdin)
days = d.get('daily', [])
today = '$TODAY'
for day in days:
if day.get('date','') == today:
print(f'{day.get(\"totalCost\", 0):.2f}')
sys.exit()
if days:
print(f'{days[-1].get(\"totalCost\", 0):.2f}')
" 2>/dev/null)
PREV_COST="0"
if [ -f "$STATE_FILE" ]; then
PREV_DATE=$(head -1 "$STATE_FILE")
PREV_COST=$(tail -1 "$STATE_FILE")
[ "$PREV_DATE" != "$TODAY" ] && PREV_COST="0"
fi
DELTA=$(echo "$CURRENT - $PREV_COST" | bc 2>/dev/null)
echo "$TODAY" > "$STATE_FILE"
echo "$CURRENT" >> "$STATE_FILE"
echo "| $TODAY | [$TYPE] | ~\$$DELTA | ~\$$CURRENT | $DESC |" >> "$LOG_FILE"
echo "Logged: [$TYPE] $TODAY β session: ~\$$DELTA (day total: ~\$$CURRENT)"
Real Session Cost Data
Here is actual session cost data from a week of tracked sessions:
| Date | Type | Session | Day Total | Description |
|------------|------|----------|-----------|------------------------------------------------|
| 2026-03-12 | [A] | ~$4 | ~$4 | Social intelligence + vault tooling |
| 2026-03-12 | [P] | ~$3 | ~$7 | Train system established, Train 4 launched |
| 2026-03-12 | [A] | ~$2 | ~$9 | Two-lane concurrency + session type design |
| 2026-03-12 | [D] | ~$4 | ~$13 | Affiliate blocks live on 33 articles |
| 2026-03-12 | [O] | ~$0.18 | ~$13 | Fixed cost tracking, created smoke tests |
| 2026-03-12 | [O] | ~$1.82 | ~$15 | Stale issue detection + planning prep tools |
| 2026-03-12 | [D] | ~$4.91 | ~$20 | Affiliate directory (12 tools) |
| 2026-03-12 | [D] | ~$20 | ~$40 | 1Password CLI + domain portfolio audit |
| 2026-03-12 | [D] | ~$6 | ~$48 | Custom domain wiring, standard pages |
| 2026-03-12 | [O] | ~$7 | ~$55 | Twitter API proxy + MCP tools |
| 2026-03-12 | [O] | ~$3.50 | ~$59 | MCP server consolidation (3β1, 21 tools) |
| 2026-03-12 | [D] | ~$4 | ~$71 | Standard pages, content gen triggered |
| 2026-03-12 | [P] | ~$2 | ~$73 | Train 4 evaluated, Train 5 launched |
| 2026-03-12 | [O] | ~$11 | ~$85 | YouTube intelligence via API proxy |
| 2026-03-12 | [D] | ~$3 | ~$89 | SMβGatherFeed intelligence wiring |
| 2026-03-12 | [P] | ~$2 | ~$91 | Train 5 evaluated, Train 6 launched |
That is 22 sessions in one day, producing $91 in API-equivalent value on a $200/month subscription. The daily budget at that rate is ~$6.60/day. This represents a 13.8x ROI on that single day.
Notice the variance: ops sessions cost $0.18-$11, dev sessions $1-$45, planning sessions $2-3. This variance is useful --- it tells you which session types are token-heavy and where to optimize.
Auditing Your OPEN Cost
The session startup cost is the tax you pay before any real work begins. Here is how to audit it:
// Before optimization: 42K token OPEN sequence
interface BloatedOpen {
claudeMd: 4_000; // System instructions
memoryMd: 8_000; // Entire memory file, including stale sections
issueView12: 12_000; // Full session log (all 50+ comments)
issueView13: 6_000; // Full priorities (with all comments)
issueView30: 8_000; // Full planning book (all comments)
playbook: 4_000; // Full playbook file
// Total: ~42,000 tokens BEFORE any work begins
}
// After optimization: 15K token OPEN sequence
interface OptimizedOpen {
claudeMd: 2_000; // Trimmed to essentials, moved details to skills
memoryMd: 3_500; // Structured with lane sections, only read your lane
sessionContextSh: 2_500; // Lightweight fetcher: last 5 comments only
prioritiesBody: 1_500; // Body only, skip comment history
playbook: 2_000; // Lane-specific playbook, not both
capacityCheck: 500; // Compact capacity report
// Total: ~12,000 tokens β 71% reduction
}
The key optimization was the lightweight fetcher pattern --- a shell script that replaces expensive gh issue view calls with targeted API queries:
#!/bin/bash
REPO="your-org/your-repo"
TARGET="${1:-all}"
fetch_issue_body() {
local num=$1
local title=$2
echo "--- #$num $title ---"
gh api "repos/$REPO/issues/$num" --jq '.body' 2>/dev/null | head -80
echo ""
}
fetch_recent_comments() {
local num=$1
local title=$2
local count=${3:-5}
echo "--- #$num $title (last $count comments) ---"
gh api "repos/$REPO/issues/$num/comments?per_page=$count&sort=created&direction=desc" \
--jq 'reverse | .[] | "--- " + .created_at[0:10] + " ---\n" + .body[0:500]' 2>/dev/null
echo ""
}
if [ "$TARGET" = "12" ] || [ "$TARGET" = "all" ]; then
fetch_recent_comments 12 "Session Log" 5
fi
if [ "$TARGET" = "13" ] || [ "$TARGET" = "all" ]; then
fetch_issue_body 13 "Priorities"
fi
if [ "$TARGET" = "30" ] || [ "$TARGET" = "all" ]; then
fetch_recent_comments 30 "Planning Book" 5
fi
This script saves ~27K tokens per session by fetching only what is needed:
- Session log (#12): Last 5 comments instead of full history (saves ~10K)
- Priorities (#13): Body only, skip the comment thread (saves ~5K)
- Planning book (#30): Last 5 comments instead of full history (saves ~6K)
- Truncation: Comments capped at 500 characters, body at 80 lines
Token Budget by Session Type
Different session types should have different token budgets:
interface TokenBudget {
sessionType: SessionType;
openCost: number; // Target startup cost
workBudget: number; // Expected working tokens
totalTarget: number; // Target total session cost
}
const budgets: TokenBudget[] = [
{ sessionType: "dev", openCost: 12_000, workBudget: 150_000, totalTarget: 180_000 },
{ sessionType: "planning", openCost: 20_000, workBudget: 40_000, totalTarget: 60_000 },
{ sessionType: "ops", openCost: 8_000, workBudget: 80_000, totalTarget: 100_000 },
{ sessionType: "architecture", openCost: 15_000, workBudget: 60_000, totalTarget: 80_000 },
{ sessionType: "monetization", openCost: 18_000, workBudget: 30_000, totalTarget: 50_000 },
];
// Planning is expensive to OPEN but cheap to execute (mostly analysis, not code)
// Dev is moderate to OPEN but expensive to execute (reading + writing code)
// Ops varies wildly (quick scripts vs complex infrastructure)
The most effective context management pattern is a 3-layer architecture where information is loaded progressively based on need.
Layer 1: Always-Loaded (Tiny Index)
This layer is loaded into every session, regardless of type. It must be small --- under 200 lines, ideally under 4K tokens.
Brief project description. One paragraph max.
Directory layout. Just the top-level names.
- Dev [D]: Execute one goal. Ship code.
- Planning [P]: Evaluate and plan.
- Ops [O]: Build automation.
- Architecture [A]: Process ideas.
- Monetization [M]: Find revenue paths.
At OPEN, read your lane's playbook:
- Dev or Planning β Read CLAUDE-main.md
- Others β Read CLAUDE-aux.md
1. Reference issues by number β don't re-explain
2. Don't re-read files already in conversation
3. Grep/Glob directly β no agent for simple searches
List of available shell tools with one-line descriptions
This file tells the AI what kind of thing it is, what kinds of work exist, and where to find more detailed instructions. It does NOT contain the detailed instructions themselves --- those live in Layer 2.
Key insight: Layer 1 is a routing table, not a manual. It tells the AI where to look, not what to do. This keeps the always-loaded cost under 4K tokens.
Layer 2: Session-Loaded (Playbooks, Memory)
This layer is loaded based on the session type determined at OPEN. Different sessions load different playbooks.
// Layer 2 files and when they load
interface Layer2 {
// Loaded by Dev and Planning sessions only
"CLAUDE-main.md": {
content: "Full OPEN/WORK/CLOSE playbook for dev and planning";
tokens: 2_000;
loadedBy: ["dev", "planning"];
};
// Loaded by Ops, Architecture, and Monetization sessions only
"CLAUDE-aux.md": {
content: "Full OPEN/WORK/CLOSE playbook for aux lane sessions";
tokens: 3_000;
loadedBy: ["ops", "architecture", "monetization"];
};
// Loaded by all sessions
"MEMORY.md": {
content: "Persistent state: project status, issue index, decisions";
tokens: 3_500;
loadedBy: ["all"];
sections: {
"MAIN LANE STATUS": "Written by dev/planning only",
"AUX LANE STATUS": "Written by ops/arch only",
"SHARED SECTIONS": "Both lanes read, additive only",
};
};
}
The key design decision is splitting playbooks by lane. A dev session never loads the aux playbook, and vice versa. This saves 2-3K tokens per session and prevents the AI from confusing session types.
Here is what a lane-specific playbook looks like:
Read this at OPEN for Dev or Planning sessions.
Do NOT read CLAUDE-aux.md (save tokens).
---
### OPEN (refresh β brief β goal β log)
1. Read MEMORY.md (status, issue index, decisions)
2. Run capacity check (weekly budget β report pace)
3. Read session log (scan for active parallel sessions)
4. Read priorities (pick a group to work on)
5. Read planning book (is planning overdue? 5+ unprocessed entries?)
### WORK (one goal, tight context, finish it)
- One group, one goal. Don't wander.
- If strategy is wrong, that IS the finding β surface it, stop.
- Log progress as you go on the work issue.
### CLOSE
1. Update priorities β check off done items
2. Update work issues β current descriptions, close if done
3. Capture to planning book β strategic signals
4. Log session to #12
5. Update MEMORY.md (your lane's section only)
6. Log session cost
7. Check capacity
8. "Session closed. Safe to /clear."
Layer 3: On-Demand (Full Files, APIs)
This layer is never pre-loaded. The AI fetches it during the WORK phase when it needs specific information.
// Layer 3: fetched on demand during work
type OnDemandSource =
| { type: "file"; path: string } // Read a specific source file
| { type: "api"; endpoint: string } // Call an API
| { type: "search"; query: string } // Search the codebase
| { type: "issue"; number: number } // Read a GitHub issue
| { type: "subagent"; agent: string } // Delegate to a specialized agent
| { type: "web"; url: string }; // Fetch documentation
// Examples of on-demand fetches:
// - Reading a specific source file the AI needs to modify
// - Querying a database for keyword research data
// - Running a codebase search for usage patterns
// - Delegating a multi-file scan to a sub-agent
// - Fetching external documentation for a library
The key principle: Layer 3 content should NEVER be loaded speculatively. If the AI might need a file, it should not pre-load it βjust in case.β It should fetch it when the need is confirmed.
The Full Architecture
βββββββββββββββββββββββββββββββββββββββββββ
β Context Window β
β β
β ββ Layer 1: Always-Loaded (~4K) ββββββββ
β β CLAUDE.md (routing table) ββ
β β System prompt ββ
β βββββββββββββββββββββββββββββββββββββββββ
β β
β ββ Layer 2: Session-Loaded (~8K) βββββββ
β β Lane playbook (main OR aux) ββ
β β MEMORY.md (persistent state) ββ
β β Priority list (body only) ββ
β β Session log (last 5 comments) ββ
β βββββββββββββββββββββββββββββββββββββββββ
β β
β ββ Layer 3: On-Demand (varies) βββββββββ
β β Source files (as needed) ββ
β β API responses (as needed) ββ
β β Sub-agent results (summarized) ββ
β β Documentation (fetched when used) ββ
β βββββββββββββββββββββββββββββββββββββββββ
β β
β ββ Conversation History (grows) ββββββββ
β β Previous messages + responses ββ
β β (auto-compacted at ~95% capacity) ββ
β βββββββββββββββββββββββββββββββββββββββββ
βββββββββββββββββββββββββββββββββββββββββββ
Budget allocation:
Layer 1: ~4K tokens (2% of 200K window)
Layer 2: ~8K tokens (4% of 200K window)
Layer 3: ~50K tokens (25% of 200K window, variable)
History: ~120K tokens (60% of 200K window, grows)
Headroom: ~18K tokens (9% buffer for compaction)
Designing Each Layer
Layer 1 design rules:
- Maximum 200 lines
- No detailed instructions (those go in Layer 2 playbooks)
- Include file paths, not file contents
- Use references (βsee MEMORY.mdβ) not inline content
- Must work for ALL session types
Layer 2 design rules:
- Split by session type or lane
- Include the full OPEN/WORK/CLOSE checklist
- Load only the relevant playbook, never both
- Memory file structured with clear section ownership
- Maximum 5K tokens per playbook
Layer 3 design rules:
- Never pre-load
- Fetch the minimum needed (body only, last N comments, specific function)
- Use lightweight fetchers instead of expensive full reads
- Delegate bulk processing to sub-agents
- Discard after use if possible (summarize, then let it compact)
Once you have session-based workflows working, you unlock a powerful capability: running two AI sessions simultaneously on different workstreams without them colliding.
The Problem
If two AI sessions are both modifying the same files, reading the same issues, and updating the same state, you get conflicts:
- Session A updates MEMORY.md while Session B is reading it
- Session A rewrites the priority list while Session B is working from the old one
- Both sessions push commits that conflict
- Both sessions comment on the same issue, creating confusing interleaved state
The Solution: Lane-Based Partitioning
Partition your work into two lanes, each with its own owned resources:
interface Lane {
name: "main" | "aux";
sessionTypes: SessionType[];
ownedResources: Resource[];
sharedResources: Resource[];
}
const mainLane: Lane = {
name: "main",
sessionTypes: ["dev", "planning"],
ownedResources: [
"#13 priorities", // Main lane restructures freely
"work issues", // Dev work tracked here
"CLAUDE-main.md", // Main lane playbook
"MEMORY.md ## MAIN STATUS", // Main lane's memory section
],
sharedResources: [
"#12 session log", // Both lanes append
"#30 planning book", // Both lanes append
"MEMORY.md ## SHARED", // Both lanes read, additive only
],
};
const auxLane: Lane = {
name: "aux",
sessionTypes: ["ops", "architecture", "monetization"],
ownedResources: [
"#33 ideas pipeline", // Architecture tracking
"#49 ops tracking", // Ops tracking
"#50 monetization tracking", // Monetization tracking
"CLAUDE-aux.md", // Aux lane playbook
"MEMORY.md ## AUX STATUS", // Aux lane's memory section
],
sharedResources: [
"#12 session log",
"#30 planning book",
"MEMORY.md ## SHARED",
],
};
Conflict Prevention Rules
1. **Comments are safe, edits are dangerous.**
Prefer additive comments on shared issues over overwriting
descriptions or files.
2. **Each lane owns different resources.**
Main lane owns #13, work issues, CLAUDE-main.md.
Aux lane owns #33/#49/#50, CLAUDE-aux.md.
Shared: #12, #30, MEMORY.md.
3. **OPEN: detect the other lane.**
Scan recent session log comments for unmatched OPEN markers.
If a sibling is active, note it.
4. **CLOSE: write to your lane's MEMORY section.**
MEMORY.md has ## MAIN LANE STATUS and ## AUX LANE STATUS.
Each lane writes only to its own section.
Shared sections are additive only.
5. **Priorities: main lane writes, aux lane appends.**
Dev/Planning restructure #13 freely.
Architecture/Ops only ADD items, never reorder.
6. **Planning book: entry-count trigger.**
5+ unprocessed entries since last CLEARED.
Either lane can comment; only Planning clears.
7. **File edits: coordinate by area.**
Main lane edits code and project notes.
Aux lane edits Ideas/, Standards/, agents, study decks.
If both need the same file, aux lane defers.
Detecting Concurrent Sessions
At OPEN, each session checks for siblings:
Valid Concurrent Combinations
Main lane: Dev ββ Planning (they alternate, never concurrent)
Aux lane: One of: Ops, Architecture, Monetization (never overlap)
Valid combos:
Dev + Ops β
Dev + Architecture β
Dev + Monetization β
Planning + Ops β
Planning + Architecture β
Invalid combos:
Dev + Planning β (both main lane)
Ops + Architecture β (both aux lane)
Dev + Dev β (same type)
Real-World Example
Here is how two concurrent sessions worked on the same day:
09:00 π’ OPEN [D] β Dev session: wire custom domains for 10 sites
(main lane reads MEMORY.md, #13, picks goal)
09:15 π’ OPEN [O] β Ops session: build YouTube intelligence MCP tools
(aux lane reads MEMORY.md, #49, detects dev session active)
(avoids editing any code the dev session is touching)
10:30 π΄ CLOSE [D] β Domains wired, standard pages deployed
(updates MEMORY.md ## MAIN LANE STATUS)
10:45 π΄ CLOSE [O] β 5 YouTube MCP tools built, 14/14 smoke tests pass
(updates MEMORY.md ## AUX LANE STATUS)
(appends YouTube tools to #13 priorities β does not reorder)
Both sessions produced value without conflicts because they operated on partitioned resources. The dev session edited production code; the ops session built new tooling in a separate directory.
Subscription plans have weekly token budgets. Capacity pacing means tracking your usage against the weekly cap and adjusting your session frequency to use 100% of your budget without hitting the wall.
The Capacity Check Script
#!/bin/bash
MODE="${1:-open}"
WEEKLY_CAP="${CLAUDE_WEEKLY_CAP:-5000000}"
DATA=$(npx ccusage weekly --limit 1 --json 2>/dev/null)
python3 -c "
import sys, json, os
from datetime import datetime
mode = '$MODE'
cap = $WEEKLY_CAP
dow = $(date +%u) # 1=Mon 7=Sun
d = json.loads('''$DATA''')
weeks = d.get('weekly', [])
if not weeks:
print('No weekly data')
sys.exit()
w = weeks[-1]
output = w.get('outputTokens', 0)
ideal_pct = dow / 7 * 100
actual_pct = min(output / cap * 100, 999) if cap > 0 else 0
diff = actual_pct - ideal_pct
remaining = max(cap - output, 0)
days_left = 7 - dow + 1
budget_per_day = remaining / days_left if days_left > 0 else 0
if diff > 15:
status = 'AHEAD β Throttle'
elif diff > 5:
status = 'SLIGHTLY AHEAD β Moderate pace'
elif diff > -10:
status = 'ON PACE β Good'
elif diff > -25:
status = 'BEHIND β Room to push'
else:
status = 'FAR BEHIND β Push harder'
bar_w = 30
filled = min(int(actual_pct / 100 * bar_w), bar_w)
bar = '#' * filled + '-' * (bar_w - filled)
print(f'Capacity: [{bar}] {actual_pct:.0f}% | Day {dow}/7 | {status}')
print(f'Output: {output:,} / {cap:,} | Remaining: {remaining:,}')
print(f'Budget: {budget_per_day:,.0f} tokens/day for {days_left} remaining days')
"
Pacing Strategy
interface PacingStrategy {
// Weekly budget (output tokens)
weeklyBudget: number;
// Current state
dayOfWeek: number; // 1-7
tokensUsed: number;
tokensRemaining: number;
// Derived
idealPace: number; // weeklyBudget * (dayOfWeek / 7)
actualPace: number; // tokensUsed
dailyBudget: number; // tokensRemaining / daysLeft
// Decision
recommendation(): string {
const diff = (this.actualPace / this.idealPace - 1) * 100;
if (diff > 15) return "THROTTLE: Run fewer sessions, use lighter models";
if (diff > 5) return "MODERATE: Normal session count, avoid heavy ops";
if (diff > -10) return "ON PACE: Keep going, this is ideal";
if (diff > -25) return "PUSH: Run extra sessions today, use heavier models";
return "SPRINT: Maximum sessions, large-scope work, use Opus freely";
}
}
Adjusting Session Weight
When you are behind pace, increase session weight (use Opus, tackle larger goals, run more sessions). When ahead, decrease it (use Sonnet, tackle smaller goals, skip optional sessions).
Week pacing example (5M weekly cap):
Mon: Used 400K (5.7%) β Ideal: 14% β FAR BEHIND
β Run 3-4 heavy dev sessions, use Opus
Tue: Used 1.2M (24%) β Ideal: 29% β BEHIND
β Run 2-3 sessions, normal weight
Wed: Used 2.1M (42%) β Ideal: 43% β ON PACE
β Normal operations
Thu: Used 3.0M (60%) β Ideal: 57% β SLIGHTLY AHEAD
β Moderate pace, prefer Sonnet
Fri: Used 3.8M (76%) β Ideal: 71% β AHEAD
β Lighter sessions, small scope
Sat: Used 4.3M (86%) β Ideal: 86% β ON PACE
β Normal wind-down
Sun: Used 4.9M (98%) β Ideal: 100% β GOOD
β One final session to close out the week
The Usage Report
A weekly usage report provides the big picture:
#!/bin/bash
echo "Daily breakdown (last 14 days):"
npx ccusage daily --limit 14 2>/dev/null
echo ""
echo "Weekly breakdown:"
npx ccusage weekly --limit 4 2>/dev/null
echo ""
echo "How to read:"
echo "Cost = API-equivalent (what tokens would cost without subscription)."
echo "Budget: \$6.60/day on Max plan."
echo "Most work days show \$100-400 = 15-60x ROI."
echo "Low days (<\$10) = subscription underused."
echo "Goal: use Claude Code every day, no zero days."
Key insight: On a Max subscription ($200/month), a day with $0 in API-equivalent usage is a $6.60 loss. The goal is not to minimize spend --- it is to maximize the value extracted from a fixed-cost resource. Zero days are the enemy.
The fundamental problem with AI coding sessions is that /clear destroys everything. Persistent memory systems ensure that critical state survives conversation resets.
The Memory File
MEMORY.md is the backbone of cross-session persistence. It is loaded at Layer 2 (every session reads it) and updated at CLOSE (every session writes to it).
_Written by Dev [D] and Planning [P] sessions only._
- **Current train**: Train 6 β "Ship Content"
- S1 done: 21 articles published on llctax.co
- S2 next: ugc.marketing (20+ articles)
- **Active brands**: llctax.co (score 149), ugc.marketing (score 114)
- **GatherFeed**: 18 RSS feeds, 185+ posts, cron every 6h
- **Blockers for DRR > $0**: AdSense ID (human), affiliate signups (human)
_Written by Ops [O] and Architecture [A] sessions only._
- **Last ops session**: YouTube MCP tools built, 14/14 smoke tests
- **API Mom providers**: brave, perplexity, dataforseo, twitter, youtube
- **Automation**: smoke test cron (4h), 6 shell tools, 5 sub-agents
- **Monetization**: No sessions yet. DRR = $0 across all brands.
_Both lanes read. Additive edits only._
- Scramjet: pipeline engine (D1+R2+DO+Workflows+Queues)
- Scalable Media: autonomous brand operator
- GatherFeed: async research engine
- Pages-plus: multi-site SSR
- API Mom: managed keys, cost attribution, cache
### Active (revenue path)
- #17 Brand Engine Β· #23 Revenue loop Β· #62 Batch content gen
### Open (background)
- #1 Monorepo Β· #32 Observability Β· #65 Bookmarks ingest
- Event-driven: all inter-service via Queues
- DRR is the KPI β daily recurring revenue per brand
- Always wrangler.jsonc, never .toml
Memory File Design Principles
interface MemoryFileDesign {
// Structure
sections: {
laneOwned: string[]; // Written by one lane only
shared: string[]; // Both lanes read, additive edits
};
// Sizing
maxTokens: 4_000; // Must fit comfortably in Layer 2 budget
maxLines: 200; // Enforced by periodic curation
// Content rules
rules: [
"State, not history. Current status, not what happened.",
"References, not content. Issue numbers, not issue descriptions.",
"Decisions, not debates. What was decided, not why alternatives were rejected.",
"Lane separation. Each lane writes only to its own section.",
"Shared sections are additive only. Never delete another lane's additions.",
];
// Anti-patterns
avoid: [
"Session transcripts (use session log issue instead)",
"Full code blocks (use file paths instead)",
"Stale information (prune at Planning sessions)",
"Duplicated content (reference once, don't repeat)",
];
}
GitHub Issues as Knowledge Base
Beyond MEMORY.md, GitHub issues serve as a persistent knowledge base with specific roles:
interface KnowledgeBase {
// System issues (always exist, never close)
"#12": {
name: "Session Log";
purpose: "Chronicle of all sessions β OPEN/CLOSE markers, summaries";
pattern: "Append-only comments. Never edit previous comments.";
readAt: "OPEN (last 5 comments for sibling detection)";
writeAt: "OPEN (π’ marker) and CLOSE (π΄ marker)";
};
"#13": {
name: "Priorities";
purpose: "What needs doing, ranked by importance";
pattern: "Body = current priorities. Main lane rewrites freely.";
readAt: "OPEN (body only)";
writeAt: "CLOSE (check off done items)";
};
"#30": {
name: "Planning Book";
purpose: "Strategic signals captured during work";
pattern: "Append-only. Cleared only by Planning sessions.";
readAt: "OPEN (last 5 comments, check if 5+ unprocessed)";
writeAt: "During WORK (when noticing strategic signals)";
tags: [
"SEQUENCE β something should happen in a different order",
"MISSING β a prerequisite we did not account for",
"OPPORTUNITY β a new revenue or efficiency opportunity",
"ARCHITECTURE β a system design consideration",
"PRIORITY β something should be ranked differently",
"PROCESS β a workflow improvement",
"OPS β an automation opportunity",
];
};
}
File-Based vs Database Memory
// File-based memory (what we use)
// Pros: readable, versionable, works offline, AI can read/write directly
// Cons: no query engine, manual curation needed, concurrent writes need rules
interface FileMemory {
"MEMORY.md": "Cross-session state, updated at CLOSE";
"session-costs.md": "Cost tracking log, append-only";
"Log/Daily/*.md": "Daily logs with insights, connections, decisions";
"Standards/*.md": "Institutional knowledge, updated at Planning";
}
// Database memory (what tools like ChatGPT use)
// Pros: queryable, automatic, no manual curation
// Cons: opaque, not versionable, AI can't control what's stored
interface DatabaseMemory {
chatgpt: "Automatic fact extraction, stored in profile";
windsurf: "Cascade memories β auto-generated during conversation";
copilot: "Copilot Spaces β scoped to repos, not sessions";
}
Key insight: File-based memory is more work to maintain but gives you full control. Database memory is easier but opaque. For serious AI workflows, you want both: file-based memory for structured state, and the toolβs built-in memory for ad hoc facts.
The Curation Problem
Memory files grow stale. Old status entries, completed issues still listed, decisions that have been superseded --- all of these waste tokens on every session.
The solution is periodic curation, triggered by the planning cycle:
interface CurationProcess {
trigger: "Planning session (every 5-10 dev sessions)";
tasks: [
"Remove completed items from MAIN/AUX STATUS sections",
"Archive closed issues from the index",
"Update SHARED SECTIONS with current infrastructure state",
"Prune KEY DECISIONS that are now obvious/established",
"Verify all referenced file paths still exist",
"Check MEMORY.md token count β must stay under 4K",
];
// Automated curation via sub-agent
agent: "vault-curator";
model: "sonnet"; // Cheap model for curation
output: "Report of changes made, stale items removed, token count";
}
Daily Logs as Institutional Knowledge
Each session produces a daily log entry. Over time, these become searchable institutional knowledge:
- [A] Social intelligence + vault tooling (4 session types, canvas diagrams)
- [P] Train system established, Trains 1-3 evaluated, Train 4 launched
- [D] Affiliate blocks live on bankstatementtoexcel (33 articles, 6 tools)
- [O] Smoke test automation, vault-curator validation
- [D] 1Password CLI + domain portfolio audit (181 domains)
- [O] Tailscale mesh β 4 endpoints with SSH
- 1Password CLI + service account tokens = no human needed for secrets
- Domain portfolio strategy: near-zero marginal cost per site means buy aggressively
- Smoke tests catch regressions that would otherwise go unnoticed for days
- [[domain-portfolio-strategy]] informed by [[monetization-vocabulary]]
- [[api-mom]] proxy pattern enables [[twitter]] and [[youtube]] intelligence
- [[affiliate-blocks]] pattern from [[composable-archetypes]]
- Train system adopted: planning + N dev sessions, max 5 per train
- Two-lane concurrency: main lane (dev/planning) + aux lane (ops/arch/monetization)
- Revenue items go FIRST in every train plan, not last
Sessions need to roll up into something larger. Without a macro-level structure, you get βsession soupβ --- productive individual sessions that do not add up to a coherent outcome.
What is a Train?
A train is a planning cycle: one planning session (the locomotive) followed by N execution sessions (the cars). The planning session sets the destination. The execution sessions get there. The next planning session evaluates results and launches a new train.
interface Train {
number: number;
name: string; // Goal-oriented, e.g., "Ship Content"
plannedAt: Date;
sessions: PlannedSession[]; // Max 5 dev sessions
successCriteria: string[]; // Measurable, revenue-linked
// Metrics (filled during/after execution)
planCompletion: number; // items done / items planned
workArchRatio: number; // % of sessions that shipped work (target >80%)
driftSessions: number; // sessions that went off-plan (target: 0)
drrDelta: number; // change in daily recurring revenue
issuesNet: number; // created - closed (target: <=0)
}
interface PlannedSession {
number: number;
goal: string;
revenueTest: string; // "Does this make money? How?"
status: "planned" | "complete" | "skipped";
}
Train Rules
These rules were learned from tracking 6 trains over several weeks:
1. **Revenue-closing items go FIRST**, not last.
Train 1-3 pattern: build infra first, monetize later.
Result: $0 revenue after 15 sessions.
Fix: put the money-making work at the front of every train.
2. **No architecture sessions during a train.**
Architecture is exploration. Trains are execution.
If you notice an architecture issue, capture it on #30.
Don't solve it now.
3. **Max 5 dev sessions per train.**
Longer trains drift. Shorter trains force focus.
If you can't accomplish the goal in 5 sessions,
the goal is too big β split it.
4. **Every session passes the revenue test.**
"Does this make money?" If the answer is "eventually"
or "it enables something that will," be suspicious.
Prefer direct revenue impact.
5. **Close >= create for issues per train.**
Issue count is a leading indicator of complexity creep.
If a train creates more issues than it closes,
something is wrong.
6. **No side quests.**
Capture on #30, don't execute.
Side quests are the #1 killer of train productivity.
Evolution Through Trains
Here is how the train system evolved through actual use:
Train 1-3: Architecture drift
Problem: Spent all sessions building infra, no revenue
Pattern: "Let me just set up one more thing before monetizing"
Lesson: Revenue items go FIRST
Train 4: Human-blocked
Problem: Plan required human actions (AdSense signup, domain transfers)
that didn't happen, blocking the whole train
Lesson: Human-gated items must complete BEFORE train launches
Train 5: Prerequisite discovery
Problem: Plan assumed infra existed that didn't
(keyword research assumed GatherFeed had keyword storage β it didn't)
Lesson: Identify ALL prerequisites at planning time.
If infra doesn't exist, that IS the session.
Train 6: Pure execution
Result: 21 articles published in Session 1 alone
Pattern: All infra in place, all human gates cleared,
sessions are pure output
Lesson: Front-load the boring work so trains are fast
Key insight: Trains get better. Train 1 produced $0 of value in 5 sessions. Train 6 produced 21 published articles in session 1. The system learns from itself because each planning session evaluates the previous train and adjusts.
Planning Session Structure
Step 0 β Evaluate previous train
- Plan completion % (items done / planned)
- Work/Arch ratio (target >80%)
- Drift sessions (target: 0)
- DRR delta
- Issues net (created - closed)
Write findings in the train file. Identify what caused drift.
Step 1 β Gather
Read all #30 comments since last CLEARED.
Group by theme.
Step 2 β Assess
For each theme: real signal or noise?
Changes sequence/strategy/groups?
Step 3 β Standards audit
Spawn standards-auditor agent.
Pick 1-2 gaps for next train.
Step 4 β Restructure
Rewrite #13. Update tracking issues.
Create/close issues. Revenue items first.
Step 5 β Launch next train
Create train file with:
- 5 planned dev sessions with revenue test justification
- Architecture decisions made HERE (not deferred)
- Measurable success criteria
Step 6 β Clear planning book
Comment "CLEARED β processed N entries"
AI coding assistants have tools: file read/write, bash execution, web search, and increasingly, sub-agents. How you design and use these tools has a massive impact on context efficiency.
Sub-Agents: When and Why
Sub-agents are isolated AI instances that receive a focused task, execute it, and return a summary. The key benefit is context isolation: the sub-agentβs working context (file reads, API responses, intermediate reasoning) stays in the sub-agent. Only the final summary returns to the main conversation.
interface SubAgentDesign {
// When to use a sub-agent
when: [
"Expensive scanning β reads many files across many projects",
"Repeatable process β same analysis runs regularly",
"Bulk data processing β API responses too large for main context",
"Specialized domain β needs a focused system prompt",
];
// When NOT to use a sub-agent
whenNot: [
"A simple Grep/Glob answers the question",
"The task is one-off and won't recur",
"The output is small enough for the main conversation",
];
// Design rules
rules: [
"Read-only by default. Agents audit and report β no code modification.",
"Concise output. Keep reports under 100 lines.",
"Use Sonnet unless the task requires Opus-level reasoning.",
"Track in a registry. Every agent gets a row in the registry table.",
];
}
Agent Registry
Here is a real agent registry from a production system:
| Agent | Model | Trigger | Purpose |
|----------------------|--------|----------------------------|----------------------------------|
| standards-auditor | sonnet | Planning (mandatory) | Scan projects against standards |
| seo-keyword-analyzer | sonnet | On-demand | Process bulk keyword data |
| article-writer | opus | On-demand | Write comprehensive articles |
| vault-curator | sonnet | Planning, on-demand | Maintain vault health |
| competitor-analyzer | sonnet | Dev (content), monetization| Analyze competitor websites |
| ideas-pipeline-sync | sonnet | Architecture | Validate Ideas pipeline state |
Context Isolation Math
// Without sub-agent: everything in main context
async function auditWithoutAgent() {
// Read 10 project configs (10 * 2K = 20K tokens)
const configs = await readAllProjectConfigs();
// Read 4 standards (4 * 3K = 12K tokens)
const standards = await readAllStandards();
// Process (reasoning: ~10K tokens)
const report = analyzeCompliance(configs, standards);
// Total added to main context: ~42K tokens
// This stays in context for the rest of the session
return report;
}
// With sub-agent: isolated context
async function auditWithAgent() {
// Spawn sub-agent with focused system prompt
const report = await runAgent("standards-auditor", {
task: "Audit all active projects against current standards",
// Sub-agent reads 42K tokens internally
// Only the summary comes back
});
// Total added to main context: ~2K tokens (the summary)
// 40K tokens of working context stayed in the sub-agent
return report; // 95% context savings
}
Tool Selection Hierarchy
Not every task needs the most powerful tool. Use the lightest tool that gets the job done:
// From cheapest to most expensive context-wise:
type ToolHierarchy = [
// 1. Direct tools (0 context overhead beyond the response)
"Grep", // Search file contents β returns only matches
"Glob", // Find files by pattern β returns only paths
"Read", // Read a specific file β returns content
// 2. Shell commands (small overhead from command + output)
"Bash", // Run a command β returns stdout/stderr
// 3. API calls (moderate overhead from request + response)
"gh api", // GitHub API β targeted queries
"gh issue view", // Full issue view β can be expensive
// 4. Sub-agents (large internal cost, small returned cost)
"Agent", // Spawns isolated instance β returns summary only
// 5. Web tools (variable, often large)
"WebSearch", // Search results β moderate tokens
"WebFetch", // Full page content β can be very large
];
// Rule: Always try the cheapest tool first
// Don't use Agent for something Grep can solve
// Don't use WebFetch if you just need a URL
// Don't use gh issue view if you just need the body
Hooks for Context Preprocessing
Claude Code hooks can preprocess data before the AI sees it, dramatically reducing context consumption:
{
"hooks": {
"PreToolUse": [
{
"matcher": "Bash",
"hooks": [
{
"type": "command",
"command": "~/.claude/hooks/filter-test-output.sh"
}
]
}
]
}
}
#!/bin/bash
input=$(cat)
cmd=$(echo "$input" | jq -r '.tool_input.command')
if [[ "$cmd" =~ ^(npm test|pytest|go test) ]]; then
# Filter to show only failures β reduces 10K lines to ~50
filtered_cmd="$cmd 2>&1 | grep -A 5 -E '(FAIL|ERROR|error:)' | head -100"
echo "{\"hookSpecificOutput\":{\"hookEventName\":\"PreToolUse\",\"permissionDecision\":\"allow\",\"updatedInput\":{\"command\":\"$filtered_cmd\"}}}"
else
echo "{}"
fi
This single hook can save 5-10K tokens per test run by filtering out passing tests and only showing failures.
Skills vs CLAUDE.md
Claude Code supports skills --- on-demand knowledge packs that load only when invoked, keeping them out of the always-loaded context:
// Anti-pattern: everything in CLAUDE.md (always loaded)
// CLAUDE.md = 800 lines, 6K tokens loaded EVERY session
// Including: PR review guidelines, database migration steps,
// deployment checklist, testing standards...
// Most sessions use 0-1 of these sections.
// Pattern: CLAUDE.md as index, skills as content
// CLAUDE.md = 150 lines, 2K tokens (routing table)
// Skills loaded on-demand:
// "pr-review" skill β 2K tokens, loaded only during PR reviews
// "db-migration" skill β 1.5K tokens, loaded only during migrations
// "deploy" skill β 1K tokens, loaded only during deploys
// "testing" skill β 1.5K tokens, loaded only during test work
// Savings: 4K tokens per session that doesn't need specialized knowledge
// That's 4K * 20 sessions/week = 80K tokens/week saved
AI Coding Assistants: Context Management Features
| Feature | Claude Code | Cursor | Copilot CLI | Windsurf | Aider | Continue.dev |
|---|---|---|---|---|---|---|
| Config file | CLAUDE.md | .cursor/rules/*.mdc | .github/copilot-instructions.md | Global + workspace rules | CONVENTIONS.md | config.yaml |
| Config location | Project root, ~/.claude/ | .cursor/ dir | .github/ dir | Settings + project | Project root, ~/.aider.conf.yml | ~/.continue/ |
| Hierarchical config | Yes (project + user + memory) | Yes (global + project rules) | Yes (org + repo) | Yes (global + workspace) | Yes (home + repo + cwd) | Yes (global + workspace) |
| Persistent memory | MEMORY.md (file-based, user-controlled) | Limited (chat history) | Copilot Spaces | Cascade Memories (auto-generated) | Chat history files | Custom context providers |
| Session lifecycle | Manual (OPEN/WORK/CLOSE via CLAUDE.md) | None (continuous chat) | None | None | None (stateless by design) | None |
| Context window | 200K tokens | ~128K (model-dependent) | Model-dependent | Model-dependent | Model-dependent | Model-dependent |
| Auto-compaction | Yes (at ~95%) | No (starts new chat) | No | No | No (repo map compression) | No |
| Sub-agents | Yes (isolated context) | No | Agent teams (preview) | No | No | No |
| Hooks/preprocessing | Yes (PreToolUse, PostToolUse) | No | No | No | No | Custom context providers |
| Cost tracking | /cost, /stats, ccusage | Limited | GitHub billing | Limited | Token counting per model | No |
| Concurrent sessions | Yes (with lane partitioning) | No (single workspace) | Possible (multiple terminals) | No | Possible (multiple terminals) | No |
| Model switching | /model mid-session | Per-chat model selection | Limited | Model selection | --model flag | Per-provider config |
Detailed Comparison: Context Persistence Approaches
| Approach | How It Works | Strengths | Weaknesses |
|---|---|---|---|
| Claude Code CLAUDE.md + MEMORY.md | Markdown files in project root. AI reads at session start, writes at session end. User controls structure and content. | Full control, version-controlled, structured sections, lane partitioning, works offline | Manual maintenance required, needs discipline to update, can grow stale |
| ChatGPT Custom Instructions + Memory | System-level instructions (1500 chars). Auto-extracted memories from conversations. | Zero effort, always active, works across all chats | Tiny instruction limit, opaque memory, no file-based persistence, no session structure |
| Cursor .cursor/rules | MDC files with frontmatter (description, globs, alwaysApply). Activated based on file patterns or always. | Pattern-based activation (load rules only when editing matching files), reduces noise | No memory system, no session lifecycle, no cross-session persistence beyond rules |
| Aider CONVENTIONS.md | Markdown file loaded as context. Repo map for compressed codebase overview. Model-agnostic. | Simple, model-agnostic, repo map is clever, works with any LLM | No session structure, no memory, no sub-agents, no compaction |
| Windsurf Rules + Memories | Global/workspace rules (saved prompts). Cascade memories (auto-generated during conversation). | Auto-generated memories, separates rules from memories | Opaque memory management, no session lifecycle, memories are per-project only |
| GitHub Copilot Spaces | Collaborative containers with repos, issues, docs, instructions. Scoped context across sessions. | Team-oriented, pulls in multiple repos, persistent across sessions | Not session-based, no lifecycle management, no token budgeting, no cost tracking |
| Continue.dev Context Providers | YAML config with custom context providers. HTTP, MCP, and built-in providers. @ mentions to load context. | Extensible, MCP integration, works with any model | Manual context selection, no auto-loading, no session lifecycle |
When to Use What
// Decision tree for choosing a context management approach
function chooseApproach(needs: ProjectNeeds): string {
if (needs.multiSessionContinuity && needs.concurrentWorkstreams) {
return "Claude Code with full session system";
// You need MEMORY.md, lane partitioning, train system
}
if (needs.multiSessionContinuity && !needs.concurrentWorkstreams) {
return "Claude Code with simplified sessions";
// MEMORY.md + single-lane OPEN/WORK/CLOSE
}
if (needs.teamConsistency && !needs.sessionLifecycle) {
return "Cursor rules or Copilot instructions";
// Rules/instructions for conventions, no session management
}
if (needs.modelAgnostic) {
return "Aider CONVENTIONS.md";
// Works with any model, simple and portable
}
if (needs.autoMemory && !needs.control) {
return "Windsurf or ChatGPT memory";
// Let the tool handle memory automatically
}
return "Start with CLAUDE.md/rules, add layers as needed";
}
| Donβt | Do Instead | Why |
|---|---|---|
| Dump entire codebase into context | Use 3-layer architecture: index β playbook β on-demand | Context rot degrades model attention. 50K of well-curated tokens beats 200K of everything. |
| Start every session from scratch | Use MEMORY.md with OPEN sequence | Re-explaining your project costs 5-10K tokens per session. Memory costs 3K once. |
| Keep one long conversation going forever | OPEN/WORK/CLOSE with /clear between sessions | Long conversations compound input token costs. Message 20 costs 90K+ input tokens. |
| Put everything in CLAUDE.md | Split into CLAUDE.md (index) + playbooks (session-loaded) + skills (on-demand) | CLAUDE.md loads every session. 800 lines = 6K tokens taxed on every interaction. |
Use gh issue view for large issues | Use lightweight fetcher (body only, last N comments, truncated) | A 50-comment issue can be 15K tokens. Last 5 comments is 2K. |
| Let both lanes edit the same files | Partition resources by lane with clear ownership rules | Concurrent edits create conflicts. Comments are safe; edits are dangerous. |
| Architecture rabbit holes during execution | Capture to planning book, continue with the goal | Side quests are the #1 killer of session productivity. Capture, donβt execute. |
| Ignore weekly capacity budget | Run capacity check at OPEN and CLOSE | Burning 80% of your budget by Wednesday means Thursday-Sunday are throttled. |
| Use Opus for everything | Use Sonnet for most work, Opus for complex reasoning | Opus costs 5x more (output) and uses more extended thinking tokens. |
| Run sub-agents for simple tasks | Use Grep/Glob/Read directly | Sub-agents have overhead (spawn, system prompt, return). A simple Grep is instant. |
| Store session transcripts in memory | Store state and decisions, not history | βWe discussed Xβ wastes tokens. βDecision: X, because Yβ is actionable. |
| Treat AI subscription as free | Track API-equivalent cost, target ROI | $200/month subscription generating $10/day value = 1.5x ROI. Target 15-60x. |
| Plan trains with >5 sessions | Split into smaller trains with focused goals | Longer trains drift. 5-session trains force prioritization and completion. |
| Skip the CLOSE phase | Always run the full CLOSE checklist | Skipping CLOSE means the next session starts from scratch. 2 minutes of CLOSE saves 10 minutes of re-orientation. |
| Load both playbooks at OPEN | Load only your laneβs playbook | Each playbook is 2-3K tokens. Loading both wastes 2-3K per session. |
| Use generic prompts (βmake it betterβ) | Specific prompts (βadd input validation to auth.ts login functionβ) | Vague prompts trigger broad scanning. Specific prompts trigger targeted work. |
The most important shift in thinking is not about any specific technique. It is about treating your AI subscription as an employee with an ROI target.
The ROI Framework
interface AIEmployeeMetrics {
// Cost
monthlyCost: number; // Subscription fee ($20-$200/month)
weeklyBudget: number; // monthlyCost / 4.3
// Value (API-equivalent)
weeklyApiEquivalent: number; // What tokens would cost at API rates
roi: number; // weeklyApiEquivalent / weeklyBudget
// Utilization
sessionsPerWeek: number; // How many sessions you ran
zeroDays: number; // Days with no AI usage (wasted capacity)
capacityUtilization: number; // % of weekly token budget used
// Effectiveness
shippedPerSession: number; // Concrete deliverables per session
driftRate: number; // % of sessions that went off-plan
revenueImpact: number; // DRR delta attributable to AI work
}
Real Numbers
Weekly report for a Max subscriber ($200/month):
Subscription cost: $46/week
API-equivalent value: $400-700/week (typical)
ROI: 8-15x
Sessions: 20-30/week
Zero days: 0-1 (target: 0)
Capacity utilization: 85-95%
Shipped per session: 1.2 concrete deliverables
Drift rate: <10%
Comparison:
- A junior developer costs $1200-2000/week
- AI at $46/week doing 50% of that work = massive ROI
- But ONLY if you manage context properly
The Revenue Test
Every session should pass the revenue test: βDoes this make money?β
type RevenueTest = {
direct: "This session directly produces revenue (publish articles, deploy ads)";
enabling: "This session enables revenue in the next 1-2 sessions (build pipeline, fix bug)";
infrastructure: "This session builds infrastructure that reduces future cost per session";
none: "This session produces no revenue impact β WHY are we doing it?";
};
// Train 1-3 pattern (bad): 15 sessions of infrastructure, $0 revenue
// Train 6 pattern (good): 21 articles published in session 1
// The fix: revenue-closing items go FIRST in every plan
// Infrastructure is a means, not an end
Subscription Selection
function chooseSubscription(usage: UsagePattern): string {
const weeklyOutputTokens = usage.estimateWeeklyOutputTokens();
const sessionsPerWeek = usage.sessionsPerWeek;
if (sessionsPerWeek < 5 || weeklyOutputTokens < 500_000) {
return "Claude Pro ($20/month)";
// Light usage, occasional sessions
}
if (sessionsPerWeek < 15 || weeklyOutputTokens < 2_000_000) {
return "Claude Max 5x ($100/month)";
// Regular daily usage, 2-3 sessions/day
}
if (sessionsPerWeek >= 15 || weeklyOutputTokens >= 2_000_000) {
return "Claude Max 20x ($200/month)";
// Heavy usage, 3-5+ sessions/day
// ROI: if each session produces $20+ in value, 15+ sessions/week = $300+/week
// $200/month = $46/week β 6.5x ROI at minimum
}
// For teams:
if (usage.isTeam) {
return "Claude API with team billing";
// Average $100-200/developer/month with Sonnet
// Set per-user TPM limits based on team size
}
}
The Zero-Day Problem
A zero day is a day where your AI subscription sits unused.
On a $200/month Max plan:
Daily cost: $6.60
A zero day = $6.60 wasted
If you have 2 zero days per week:
Weekly waste: $13.20
Monthly waste: $57.20
Annual waste: $686
That's enough for 3.4 months of the subscription.
Fix: capacity pacing. Check your pace at OPEN.
If behind, push more sessions. If ahead, lighter sessions.
Target: 0 zero days per week.
Stacking AI Tools
The meta-game includes stacking multiple AI tools for different purposes:
interface AIToolStack {
// Primary: Claude Code for coding sessions
claudeCode: {
cost: "$200/month";
use: "Dev, planning, ops, architecture sessions";
contextSystem: "Full OPEN/WORK/CLOSE with MEMORY.md";
};
// Secondary: ChatGPT for quick questions, brainstorming
chatGPT: {
cost: "$20/month";
use: "Quick questions, brainstorming, research";
contextSystem: "Custom instructions + auto memory";
};
// Tertiary: Perplexity for search-heavy research
perplexity: {
cost: "$20/month";
use: "Research, fact-checking, competitive analysis";
contextSystem: "None (stateless search)";
};
// Total: $240/month for a full AI productivity stack
// ROI target: 10x minimum = $2400/month in value
// Realistic: $5000-10000/month for a skilled practitioner
}
Here is what a fully optimized AI working day looks like:
07:00 bash capacity-check.sh open
β π’ ON PACE: 42% used, day 4/7, 320K tokens/day budget
07:05 π’ OPEN [D] β Train 6, Session 2: publish 20 articles on ugc.marketing
(reads MEMORY.md: 3.5K tokens)
(reads session-context.sh output: 2.5K tokens)
(reads CLAUDE-main.md playbook: 2K tokens)
OPEN cost: 12K tokens total
07:10 WORK β Generates articles, publishes via pipeline
(reads template files, competitor analysis, keyword data on demand)
09:30 π΄ CLOSE [D] β 22 articles published, 3 competitor archetypes added
(updates MEMORY.md, logs cost, checks capacity)
API-equivalent: ~$35
09:35 π’ OPEN [O] β Ops: build analytics verification tool
(reads MEMORY.md, #49 ops tracking)
OPEN cost: 8K tokens
10:00 WORK β Builds analytics verification script
10:45 π΄ CLOSE [O] β Analytics tool built, 3 brands verified
API-equivalent: ~$7
10:50 π’ OPEN [D] β Train 6, Session 3: publish 15 articles on vibemarketing.pro
(note: different domain, same workflow)
OPEN cost: 12K tokens
12:30 π΄ CLOSE [D] β 15 articles published
API-equivalent: ~$25
Day total: ~$67 API-equivalent, 3 sessions
Subscription cost: ~$6.60/day
ROI: 10.2x
bash capacity-check.sh close
β π’ ON PACE: 56% used, day 4/7, good
Implementation Checklist
If you want to implement this system, here is the minimal viable version:
Week 1: Foundation
- Create
CLAUDE.mdwith project structure and session types - Create
MEMORY.mdwith initial project state - Define your OPEN/WORK/CLOSE checklist
- Track session costs manually for one week
Week 2: Optimization
- Split CLAUDE.md into index + playbooks
- Build a lightweight context fetcher for your knowledge base
- Start capacity pacing (check pace at OPEN and CLOSE)
- Identify your top 3 token-wasting patterns
Week 3: Automation
- Build session cost tracking script
- Build capacity check script
- Create your first sub-agent for a repeatable task
- Launch your first train (planning + 5 dev sessions)
Week 4: Scale
- Add two-lane concurrency if running enough sessions
- Build hooks for context preprocessing
- Evaluate train 1, adjust the system, launch train 2
- Review ROI --- are you getting 10x+ from your subscription?
The Minimal Session System
If you do nothing else from this guide, implement this:
[One paragraph description]
1. At OPEN: read MEMORY.md, pick a goal
2. At WORK: execute the goal, don't wander
3. At CLOSE: update MEMORY.md, say "safe to /clear"
./MEMORY.md β read it first, update it last
[What's done, what's in progress, what's next]
[Choices made and why β so you don't re-debate them]
[What's stuck and what's needed to unstick it]
Last updated: [date] by [session type]
That is 20 lines of configuration. It takes 5 minutes to set up. And it will immediately improve every AI session you run because the AI starts with context instead of starting from zero.
Official Documentation
- Claude Code Best Practices --- Anthropicβs official guide to effective Claude Code usage, including CLAUDE.md patterns
- Manage Costs Effectively β Claude Code Docs --- Token tracking, cost management, compaction, hooks, and model selection strategies
- Claude API Pricing --- Current per-token pricing for all Claude models including prompt caching and extended context
- Using Claude Code with Pro or Max Plan --- Subscription plan details, usage patterns, and rate limits
- Cursor Rules for AI --- Official Cursor documentation on .cursor/rules configuration
- Continue.dev Context Providers --- Custom context provider configuration for Continue.dev
- Windsurf Cascade Memories --- Windsurfβs persistent memory system documentation
- GitHub Copilot CLI β Enhanced Agents and Context Management --- GitHubβs approach to persistent context across sessions
Research and Analysis
- Effective Context Engineering for AI Agents β Anthropic Engineering --- Anthropicβs comprehensive guide to context engineering, including compression strategies, sub-agent architectures, and the concept of βcontext rotβ
- Effective Harnesses for Long-Running Agents β Anthropic Engineering --- Managing multi-context-window agents over extended tasks
- Context Engineering for Coding Agents β Martin Fowler --- In-depth analysis of context engineering principles applied to coding agents
- Context Engineering vs Prompt Engineering β Neo4j --- Why teams are moving from prompt engineering to context engineering for agentic AI
- The Evolution of Prompt Engineering to Context Design β SDG Group --- The shift from prompt engineering to context engineering in 2025-2026
Practitioner Guides
- CLAUDE.md Best Practices β UX Planet --- 10 sections to include in your CLAUDE.md for maximum effectiveness
- How I Use Every Claude Code Feature β Shrivu Shankar --- Comprehensive walkthrough of Claude Code capabilities including context management
- 7 Claude Code Best Practices for 2026 β Eesel.ai --- Practical tips from real project experience
- Claude Code Context Management β SFEIR Institute --- Optimization guide for Claude Code context management including compaction strategies
- Claude Code Compaction β Steve Kinney --- Deep dive into compaction mechanics and how to preserve critical context
- 7 Ways to Cut Claude Code Token Usage β DEV Community --- Practical token reduction techniques
- Claude Code Hooks Mastery β GitHub --- Reference implementation for hooks including PreToolUse and PreCompact patterns
Tool-Specific References
- Mastering Context Management in Cursor β Steve Kinney --- Context management strategies specific to Cursor editor
- Mastering Cursor Rules β cursorrules.org --- Ultimate guide to Cursorβs rule system
- CONVENTIONS.md Guide for Aider β Claude MD Editor --- How to configure Aiderβs model-agnostic conventions
- Windsurf Rules and Workflows β Paul Duvall --- Best practices for Windsurfβs rule and memory system
- Context Management Strategies for Windsurf β Iceberg Lakehouse --- Complete guide to Windsurfβs context management
Configuration and Templates
- Opinionated Defaults for Claude Code β Trail of Bits --- Production-quality Claude Code configuration from a security-focused engineering team
- Claude Code Best Practice β shanraisshan --- Community-curated Claude Code best practices repository
- Awesome Cursorrules β PatrickJS --- Community collection of .cursorrules configurations for different project types
- AGENTS.md Specification --- The open, tool-agnostic specification for AI context files (Linux Foundation project)
Cost and Capacity
- Claude Code Token Limits β Faros AI --- Comprehensive guide to Claude Code token limits by plan
- Claude Code Limits Explained β TrueFoundry --- Quotas, rate limits, and strategies for working within them
- Claude Pro and Max Weekly Rate Limits Guide β Hypereal --- Detailed breakdown of weekly rate limits and pacing strategies
- Extra Usage for Paid Claude Plans β Claude Help Center --- How to purchase additional capacity when hitting limits
This article is based on running 50+ AI coding sessions per week across 10+ concurrent projects, tracking every token, and iterating on the system through 6 trains of development. The patterns are tool-agnostic but use Claude Code as the primary example because that is where the deepest experience lies. Every number cited is from real session cost logs, not estimates.