Context Engineering > Prompt Engineering

Gartner said it. Andrej Karpathy said it. Google’s ADK team built their entire multi-agent framework around it. And after synthesizing 20+ sources across Google, OpenAI, the VS Code Copilot ecosystem, and independent production teams, I’m convinced:

Context engineering has replaced prompt engineering as the primary discipline for building reliable AI agents.

This isn’t a branding exercise. It reflects a real shift in where the hard problems live. Writing a good prompt is necessary but insufficient. The real challenge is what information reaches the model, when, and in what form—and that’s a systems engineering problem, not a copywriting one.

“Context engineering—the delicate art and science of filling the context window with just the right information for the next step.” — Andrej Karpathy

The Two Context Bloat Problems

Before getting into solutions, it helps to understand what we’re solving. Agenteer’s research from January 2026 identified two distinct failure modes:

Startup bloat: Tool definitions consume context before any conversation begins. Connect enough MCP servers and two-thirds of a 200K token window can be consumed by tool definitions alone—before a single user message.
Runtime bloat: Intermediate tool results accumulate during execution. Every file read, search result, and API response stays in the conversation, crowding out the information that actually matters for the current step.

Each requires a different solution. Startup bloat needs lazy/deferred tool loading. Runtime bloat needs compaction and summarization. Throwing a bigger context window at the problem doesn’t work—quality degrades long before the window fills.

The “God Agent” Anti-Pattern

The most common mistake I see teams make is building what I call the “God Agent”: one powerful model with a massive tool set and a do-everything system prompt. It feels productive at first. It falls apart in production for three reasons:

Untestability—you can’t meaningfully test a system that does everything
Context pollution—irrelevant tools and instructions crowd out the actual task
Unpredictable routing—the model picks wrong tools when too many are available

The fix is decomposition. Which brings us to the first universal principle.

Universal Principle: The Orchestrator Must Never Execute

Mikhail Rogov ran 40+ production workflows across 3 providers and 6 agent topologies and arrived at a crisp rule: the orchestrator decomposes, delegates, validates, and escalates—it never writes code, runs tests, or modifies files.

This boundary held across every provider, every topology, and every failure mode. Systems that violate it collapse under their own context weight.

Think of it like a project manager who also tries to write all the code. They might be a great coder, but the moment they’re deep in an implementation, they lose track of the bigger picture. Agents have the same problem—except their “bigger picture” is literally their context window.

Google’s Tiered Context Architecture

Google’s ADK team published the most rigorous framework for thinking about context in multi-agent systems. Their design thesis: context is a compiled view over a richer stateful system, not a mutable string buffer.

They define four tiers:

Tier	Purpose	Lifetime
Working Context	The immediate prompt for this model call	Ephemeral—recomputed each invocation
Session	Durable log of interaction events	Structured, model-agnostic
Memory	Long-lived, searchable knowledge	Outlives sessions
Artifacts	Large binary/text data addressed by name + version	Never pasted into prompt

The key principles that emerge:

Separate storage from presentation—evolve schemas and prompt formats independently
Explicit transformations—context is built through named, ordered processors (a “compiler pipeline”), not ad-hoc string concatenation
Scope by default—every model call sees minimum context required; agents must “reach” for more via tools

When context grows too large, the ADK uses LLM-driven summarization over a sliding window. The summary is written back as a new event; raw events are pruned. This is compaction, not truncation—the meaning is preserved while the token count drops.

Multi-Agent Context Scoping

For multi-agent systems, Google offers two patterns:

Agents as Tools—the root agent calls a specialized agent as a function. The callee sees only specific instructions and necessary artifacts, no conversation history.
Agent Transfer (Hierarchy)—control is handed off, and the sub-agent inherits a scoped view that’s configurable: full, none, or custom.

Critically, when transferring between agents, ADK translates the conversation—prior assistant messages are re-cast as narrative context so the new agent doesn’t hallucinate that it performed those actions.

OpenAI’s Harness Engineering: A Million Lines, Zero Manual Code

Ryan Lopopolo at OpenAI shared one of the most striking production case studies I’ve seen. His team built a product with zero manually-written lines over 5 months—roughly 1 million lines of code across 1,500 PRs, all agent-generated.

The key insights aren’t about model capabilities. They’re about environment design:

AGENTS.md as Table of Contents, Not Encyclopedia

Keep it to ~100 lines. It should be a map with pointers to deeper docs, not a monolithic instruction dump:

AGENTS.md (~100 lines)
  └─→ points to docs/ directory
       ├── architecture.md
       ├── patterns.md
       ├── conventions.md
       └── domain-specific/*.md

The agent reads AGENTS.md first and navigates to relevant docs only when needed. This is progressive disclosure—the agent reaches for detail rather than being flooded with it.

“When everything is marked important, nothing is.”

Mechanical Invariant Enforcement

Documentation alone isn’t enough. Custom linters and structural tests beat written conventions because:

Lint error messages inject remediation instructions directly into agent context
Every violation becomes a learning opportunity
With agents, strict linting is a multiplier—it applies everywhere at once

The Repository as System of Record

All team knowledge must live in versioned, co-located artifacts in the repo. Slack discussions, Google Docs, tacit knowledge—all invisible to agents.

If it isn’t discoverable in the repo, it effectively doesn’t exist.

When Agents Struggle, Don’t Try Harder—Fix the Environment

This is the most counterintuitive insight. When an agent fails at a task, the instinct is to write a better prompt or try a different model. The Harness team’s approach: diagnose what’s missing—tools, guardrails, abstractions, or documentation—and have the agent itself build the missing capability into the repo.

Each encoded capability compounds. The repo gets smarter over time.

VS Code Copilot: The Six-Layer Stack

The VS Code ecosystem has converged on a practical implementation of these principles through a six-layer context stack:

┌─────────────────────────────────────────────────┐
│ Custom Agents (.agent.md)         ← Personas    │
├─────────────────────────────────────────────────┤
│ Always-On Instructions            ← Identity    │
│   ├── copilot-instructions.md                   │
│   └── AGENTS.md                                 │
├─────────────────────────────────────────────────┤
│ File-Based Instructions (.instructions.md)      │
├─────────────────────────────────────────────────┤
│ Agent Skills (SKILL.md)           ← On-demand   │
├─────────────────────────────────────────────────┤
│ MCP Servers (mcp.json)            ← External    │
├─────────────────────────────────────────────────┤
│ Prompt Files (.prompt.md)         ← Workflows   │
└─────────────────────────────────────────────────┘

Each layer serves a different purpose in the progressive disclosure hierarchy:

Always-on instructions load on every interaction—keep them under 100 lines
File-based instructions scope with applyTo patterns—language-specific or folder-specific
Skills load only when their description matches the current task—this is the key to avoiding context bloat for heavy content
MCP Servers provide external capabilities but consume startup context
Prompt files are reusable workflow templates invoked explicitly

The official VS Code guidance reinforces: limit enabled tools, scope instructions tightly, keep instruction files concise, and focus on information AI can’t infer from code.

10 Rules for Eliminating Context Bloat

Synthesizing across all sources, here are the rules that every team building with agents should follow:

Always-on files must be tiny—copilot-instructions.md ≤100 lines, AGENTS.md ≤100 lines. These load on every interaction.
Use progressive disclosure—entry points route to detail. Agents reach for context; they aren’t flooded with it.
The orchestrator never executes—planning and routing are separate from implementation.
Separate agents by concern—each agent gets its own tools, instructions, and minimal context.
Skills over instructions for heavy content—anything over ~50 lines should be a skill that loads on demand, not an always-on instruction.
Limit active tools—every unused tool definition wastes context tokens. Group and restrict with tool sets.
Repository = system of record—all conventions, decisions, and domain knowledge in versioned files.
Enforce invariants mechanically—linters and structural tests beat documentation.
Automate context hygiene—compaction for long sessions, garbage collection for stale patterns.
When agents struggle, fix the environment—encode missing capabilities as skills, instructions, or tools. Don’t try harder; make the repo smarter.

Self-Improving Agents: The Lesson Loop

Most AI agents are stateless. Each session starts fresh, the same errors get repeated, and engineers correct the same mistakes over and over. The self-improving agent pattern—popularized by OpenClaw’s community (108K+ downloads for the self-improvement skill alone)—breaks this cycle.

The core loop:

Reflection: Agent examines its actions, reasoning, and outputs
Evaluation: Objective assessment against success criteria
Correction: Generate specific fixes—prompt updates, rule changes, new skills
Execution: Apply corrections, verify improvement
Promotion: Durable learnings move to permanent memory/instructions

In practice, this means maintaining a .learnings/ directory:

Situation	Action
Command/operation fails	Log to `.learnings/ERRORS.md`
User corrects you	Log to `.learnings/LEARNINGS.md` (category: correction)
Knowledge was outdated	Log to `.learnings/LEARNINGS.md` (category: knowledge_gap)
User wants missing feature	Log to `.learnings/FEATURE_REQUESTS.md`

Learnings accumulate, get reviewed, and the best ones get promoted to permanent instructions in AGENTS.md or copilot-instructions.md.

But there’s a critical caveat: self-improvement only works reliably where outcomes are verifiable. Three independent research teams converged on this in 2025–2026. Code compilation, test results, API responses—these have objective pass/fail signals. For subjective tasks like writing quality or design choices, self-improvement loops risk reward hacking—the agent optimizes for the metric rather than the actual goal. Use human-in-the-loop checkpoints for subjective improvements.

The VIGIL framework (arXiv, December 2025) takes this further with a supervisor agent that monitors a sibling agent’s behavioral logs, maintains a persistent “emotional bank” tracking behavioral health with decay functions, and generates guarded prompt updates. VIGIL achieved a 24%+ improvement in task success rates through self-healing—without human intervention.

Agent Handoff Protocols

Once you decompose the God Agent into specialists, you need reliable handoff patterns. The production-proven approach uses a task lifecycle with explicit stages:

Task Lifecycle: inbox → spec → build → review → done

Roles:
  Orchestrator — Route tasks, track state, report results
  Builder      — Execute work, produce artifacts
  Reviewer     — Quality gates, feedback loops

Key rules for reliable handoffs:

Spawn sub-agents with scoped context, not full parent context
File-based communication—pass artifacts through workspace files, not chat message history
Quality gates between stages—a reviewer can reject work, looping it back to the builder
Depth limits: workers don’t spawn more workers (maxSpawnDepth: 2)
Concurrency caps: prevent runaway fan-out (maxChildrenPerAgent: 5)
Model tiering: cheaper models for sub-agents, powerful models for orchestration and synthesis

The framework you choose should match your workflow shape:

Looks like a flowchart with loops → LangGraph (native cycle support, typed state, best observability via LangSmith)
Looks like a conversation thread → AutoGen (low learning curve for chat patterns)
Looks like a job description board → CrewAI (lowest barrier, best for linear workflows)

One practitioner’s hard-won insight: “I spent more time debugging agent pipelines than building them. That ratio is embarrassing but true, and it makes observability the thing I’d weight most heavily before committing to a framework.”

MCP vs A2A: The New Infrastructure Layer

Two protocols are becoming infrastructure-level standards, and understanding their complementary roles matters:

Protocol	Purpose	Analogy
MCP (Model Context Protocol)	Agent → Tool communication	USB-C for AI
A2A (Agent-to-Agent)	Agent → Agent communication	HTTP for AI agents

MCP standardizes how an agent talks to tools—databases, APIs, file systems. It handles tool discovery and invocation.

A2A standardizes how agents talk to each other—agent discovery (via Agent Cards), task delegation, streaming updates, and multi-turn collaboration.

In production, they layer together:

User Request
    ↓
Orchestrator Agent (A2A)
    ├── Research Agent (A2A handoff)
    │   └── Web Search Tool (MCP)
    │   └── Database Tool (MCP)
    ├── Analysis Agent (A2A handoff)
    │   └── Code Execution Tool (MCP)
    └── Writer Agent (A2A handoff)
        └── File System Tool (MCP)

In December 2025, the Linux Foundation created the Agentic AI Foundation (AAIF), bringing Anthropic’s MCP, Block’s Goose framework, and OpenAI’s AGENTS.md under shared, vendor-neutral governance. These protocols are becoming infrastructure—not product differentiators.

Token Budget Management

Context engineering isn’t just architecture—it’s accounting. Every skill, every tool, every workspace file adds to the per-turn token cost. Here are production-proven techniques with their measured savings:

Strategy	Token Savings
Session compaction (summarize long conversations)	40–60%
Model tiering (cheaper models for sub-agents)	50–70% cost reduction
Prompt caching (stable prefixes reused)	30–50% of reads
Trim large tool outputs early in pipeline	20–30%
Reduce image dimensions for vision tasks	15–25%
Concise skill descriptions (~97 chars/skill)	~170 tokens per 5 skills

The system prompt gets rebuilt each run—tool lists, skill descriptions, workspace file contents, runtime metadata. That means pruning is not optional. Every tool you don’t need, every skill with a bloated description, every always-on instruction file costs you on every single turn.

Security: The Skill Marketplace Risk

A quick but important note: 1,184 malicious skills were discovered on major skill marketplaces in early 2026. A coordinated campaign exfiltrated API keys, crypto wallets, and browser credentials through innocuous-looking community skills.

Mitigation is straightforward:

Review skill code before installing
Pin versions
Don’t auto-install from untrusted sources
Use proper secret management (environment variables, vaults)—never store credentials in workspace files

The workspace-first, file-readable design that makes agents powerful also means everything is visible—including secrets, if you’re careless.

The Context Lifecycle

Different techniques apply at different phases of the agent lifecycle:

Phase	Technique
Startup	Deferred/lazy tool loading; minimal always-on instructions
Per-invocation	Compile working context from session + memory + artifacts
Mid-session	Compaction via iterative summarization
Long-running	Memory service (searchable, not pinned to context)
Cross-agent	Scoped handoffs with narrative translation
Cleanup	Automated garbage collection agents

Concrete Implementation: Setting Up a VS Code Repository

Theory is useful. Shipping is better. Here’s a step-by-step guide to implementing these patterns in a real VS Code workspace. You can apply this to a new repo or retrofit an existing one.

Step 1: Create the Directory Structure

mkdir -p .github/instructions .github/agents .github/skills .github/prompts
mkdir -p .vscode
mkdir -p .learnings
mkdir -p docs

This gives you the progressive disclosure hierarchy: always-on files at the root, scoped instructions and agents under .github/, deep reference docs under docs/.

Step 2: Write a Lean `copilot-instructions.md`

Create .github/copilot-instructions.md — keep it under 100 lines. This loads on every Copilot interaction, so every line costs tokens:

# Project Instructions

## Build & Dev Commands
npm install          # Install dependencies
npm run dev          # Start dev server
npm run build        # Production build
npm run test         # Run tests
npm run lint         # Lint

## Architecture
- [Brief 2-3 sentence description of your project]
- Key directories: src/, tests/, docs/

## Conventions
- TypeScript strict mode
- [Your formatting rules]
- [Your naming conventions]

## Self-Improvement
Log learnings to `.learnings/` for continuous improvement.
See docs/ for detailed architecture and patterns.

Notice: it points to docs/ for detail. It doesn’t try to contain everything.

Step 3: Write `AGENTS.md` as a Table of Contents

Create AGENTS.md at the repo root — also under 100 lines:

# Agent Instructions

## Quick Reference
- Architecture: docs/architecture.md
- Patterns: docs/patterns.md
- Conventions: docs/conventions.md

## Agent Roster
- **reviewer** (.github/agents/reviewer.agent.md) — Code review
- **writer** (.github/agents/writer.agent.md) — Documentation

## Skills
- **deploy** (.github/skills/deploy/) — Deployment workflow
- **test-coverage** (.github/skills/test-coverage/) — Coverage analysis

## Key Rules
1. Run `npm run build` before pushing
2. All PRs need tests
3. No hardcoded secrets

Step 4: Create Scoped Instructions with `applyTo`

Create .github/instructions/typescript.instructions.md:

---
applyTo: "**/*.ts,**/*.tsx"
---
# TypeScript Conventions
- Use explicit return types on exported functions
- Prefer `interface` over `type` for object shapes
- Use `readonly` for properties that shouldn't change
- Error handling: use custom error classes, not string throws

This only loads when the agent is working with TypeScript files—zero cost for all other interactions.

Step 5: Create Your First Agent

Create .github/agents/reviewer.agent.md:

---
description: Reviews code changes for quality, security, and convention compliance
tools:
  - codebase
  - terminal
---
# Code Reviewer

You review code for:
1. Security vulnerabilities (OWASP Top 10)
2. Convention compliance (see docs/conventions.md)
3. Test coverage gaps
4. Performance concerns

Provide specific, actionable feedback. Reference line numbers.
Do NOT suggest style changes that a formatter would handle.

The tools field restricts which tools this agent can use—fewer tools means less context consumed.

Step 6: Create Your First Skill

Create .github/skills/deploy/SKILL.md:

---
description: "Handles deployment workflows including build verification,
  staging deployment, and production release"
---
# Deploy Skill

## Pre-deployment Checklist
1. Run `npm run lint` — fix any errors
2. Run `npm run test` — all tests must pass
3. Run `npm run build` — verify clean build
4. Check for uncommitted changes

## Staging Deploy
[Your staging deployment steps]

## Production Deploy
[Your production deployment steps]

This only loads when the agent detects a deployment-related task—based on the description field matching.

Step 7: Configure Tool Sets

Create .vscode/tool-sets.jsonc:

{
  // Minimal tool set for quick questions
  "reading": {
    "codebase": true,
    "fetch": false,
    "terminal": false
  },
  // Full tool set for implementation work
  "development": {
    "codebase": true,
    "terminal": true,
    "fetch": true
  }
}

Step 8: Set Up the Self-Improvement Loop

Create three files in .learnings/:

.learnings/ERRORS.md:

# Errors Log
| Date | Error | Resolution | Status |
|------|-------|------------|--------|

.learnings/LEARNINGS.md:

# Learnings
| Date | Category | Learning | Status |
|------|----------|----------|--------|

.learnings/FEATURE_REQUESTS.md:

# Feature Requests
| Date | Request | Priority | Status |
|------|---------|----------|--------|

Add this to your copilot-instructions.md:

## Self-Improvement
| Situation | Log to |
|-----------|--------|
| Command fails | .learnings/ERRORS.md |
| User corrects you | .learnings/LEARNINGS.md (correction) |
| Knowledge outdated | .learnings/LEARNINGS.md (knowledge_gap) |
| Better approach found | .learnings/LEARNINGS.md (best_practice) |
| Missing feature | .learnings/FEATURE_REQUESTS.md |

When a learning is broadly applicable, promote it to copilot-instructions.md.

Step 9: Configure MCP Servers (Sparingly)

Create .vscode/mcp.json — only add servers you actually use:

{
  "servers": {
    "context7": {
      "type": "stdio",
      "command": "npx",
      "args": ["-y", "@context7/mcp"]
    }
  }
}

Every MCP server adds tool definitions to startup context. Start with one or two and add more only when you hit a real need.

Step 10: Write Reference Docs for Deep Context

Create docs/architecture.md, docs/conventions.md, and docs/patterns.md with the detailed information that would bloat your always-on files. The agent navigates here from AGENTS.md when it needs depth.

The Final Structure

project-root/
├── AGENTS.md                          # ≤100 lines — table of contents
├── .github/
│   ├── copilot-instructions.md        # ≤100 lines — identity + routing
│   ├── instructions/
│   │   └── typescript.instructions.md # Scoped by applyTo
│   ├── agents/
│   │   └── reviewer.agent.md          # Role with restricted tools
│   ├── skills/
│   │   └── deploy/SKILL.md            # On-demand heavy content
│   └── prompts/
│       └── pr-review.prompt.md        # Reusable workflow templates
├── .vscode/
│   ├── mcp.json                       # External tool servers
│   └── tool-sets.jsonc                # Grouped tool bundles
├── .learnings/
│   ├── ERRORS.md                      # Self-improvement: errors
│   ├── LEARNINGS.md                   # Self-improvement: corrections
│   └── FEATURE_REQUESTS.md            # Self-improvement: gaps
├── docs/
│   ├── architecture.md                # Deep reference
│   ├── conventions.md                 # Detailed conventions
│   └── patterns.md                    # Reusable patterns
└── src/                               # Your actual code

Every file has a purpose. Always-on files are tiny. Detail lives in docs and skills that load on demand. Agents are scoped by concern with restricted tools. The self-improvement loop captures institutional knowledge over time.

What This Means for You

If you’re building with AI agents today—whether that’s configuring VS Code Copilot for your team, building multi-agent pipelines, or shipping agent-powered products—the takeaway is the same:

Stop optimizing prompts. Start engineering context.

Design your repository structure, instruction hierarchy, and tool configuration so the right information reaches the right agent at the right time. Enforce conventions mechanically. Decompose agents by concern. Build in self-improvement loops. Budget your tokens like you budget your compute. And when something breaks, resist the urge to write a cleverer prompt—instead, make the environment smarter.

The teams that get this right will build agents that compound in capability over time. The ones that don’t will keep fighting context bloat, unpredictable routing, and the God Agent anti-pattern.

Context engineering isn’t just a new buzzword. It’s the discipline that makes everything else work.

Sources: This post synthesizes research from Google ADK (Dec 2025), OpenAI Harness Engineering (Feb 2026), Mikhail Rogov / Towards AI (Mar 2026), Agenteer (Jan 2026), OpenClaw architecture docs, VIGIL (arXiv:2512.07094), VS Code official docs, GitHub Blog, CPI Consulting, DEV Community, Linux Foundation AAIF, and practitioner reports across LangGraph, AutoGen, and CrewAI. Full source list available on request.