Reflect on GitHub Copilot Sessions: A Practical Workflow for Finding Weak Prompts, Tool Churn, and Better Agent Patterns

How to review GitHub Copilot Chat and CLI sessions like an engineer: inspect logs, find poor techniques, use the newest VS Code and GitHub features, and turn repeated mistakes into better agent workflows.

Reflect on GitHub Copilot Sessions

Most teams are still using GitHub Copilot like a fast autocomplete engine with a chat box attached. Ask a question, get a response, move on.

That leaves a lot of performance on the table.

If you’re using agent mode in VS Code Insiders or GitHub Copilot CLI, every meaningful session generates a trail: plans, tool calls, loaded instructions, context expansion, retries, terminal commands, and sometimes background session logs. That trail is the raw material for reflection.

Reflection, in this context, means reviewing how the agent approached a task so you can find:

  • weak prompts
  • unnecessary context
  • bad tool choices
  • missing guardrails
  • opportunities to turn repeated fixes into instructions, skills, or memory

The goal is not to judge whether the model sounded smart. The goal is to understand why the workflow was efficient or inefficient.

With the March 2026 wave of Copilot and VS Code updates, this became much easier.

Why This Matters

When an agent session goes badly, the cause is usually not “the model is bad.”

It’s usually one of these:

  1. The task was underspecified.
  2. The wrong tools or too many tools were available.
  3. The agent pulled in too much irrelevant context.
  4. The session ran long enough that runtime bloat started degrading quality.
  5. Validation happened too late.
  6. A useful lesson from a previous session was never promoted into durable guidance.

Those are engineering problems. Which means you can inspect them, name them, and improve them.

The Core Idea: Review Sessions Like a Performance Engineer

Don’t ask, “Did Copilot help?”

Ask questions like these instead:

  • How many tool calls did it take before the first useful insight?
  • Which instructions or prompt files were loaded, and were they relevant?
  • Did the agent reread or rediscover the same facts multiple times?
  • Did the session improve after a clarification, or should that clarification have been in the original request?
  • Did the session end with validation, or just a plausible-looking answer?
  • What correction did I repeat that should become a reusable rule?

Once you start looking at agent sessions that way, patterns show up fast.

What Changed Recently That Makes This Easier

The February 2026 VS Code release, published on March 4 as version 1.110, shipped the best reflection toolkit VS Code has had so far.

1. Agent Debug Panel

This is the biggest improvement.

The Agent Debug panel gives you real-time visibility into:

  • chronological agent events
  • tool calls
  • loaded customizations
  • agent flow across sub-agents

If a session feels inefficient, this is where you confirm whether the problem was prompt quality, context loading, or tool churn.

2. Session Memory

Session memory lets plans and guidance persist across turns. That matters because many failures only become visible in longer sessions.

Without session memory, every task looks isolated. With it, you can see whether the workflow actually compounds knowledge or just repeats the same setup work with new wording.

3. Context Compaction

Runtime bloat is real. Every search result, file read, and tool response competes for the model’s working memory.

Context compaction gives you a controlled way to reduce the active history and test a useful question:

Is the agent getting worse because the task is hard, or because the session is carrying too much stale baggage?

4. Fork Chat Session

Forking a chat session is ideal for experimentation.

Start from the same context, then try two approaches:

  • one branch with your original prompt
  • one branch with tighter acceptance criteria or a different agent

That makes it much easier to compare workflows without restarting from zero.

5. Create Agent Customizations From Chat

This feature matters because reflection is only useful if it changes future behavior.

If you discover that the same correction keeps happening, you can convert that lesson into:

  • a prompt file
  • an instruction file
  • a custom agent
  • a skill

That is the bridge from retrospective to durable improvement.

GitHub-Side Features That Help Too

The GitHub changelog added several useful session-review capabilities in early 2026.

Agents Tab in the Repository

GitHub’s new Agents tab centralizes session tracking in the repo itself. That gives you a cleaner place to review work across sessions, not just inside one live editor tab.

Redesigned Session Logs

This is one of the most useful updates for reflection.

GitHub’s redesigned session logs now make it easier to inspect:

  • grouped similar tool calls
  • inline tool output previews
  • file diffs with expand-on-demand views
  • visible bash commands

If you want to spot repetitive or low-signal behavior, grouped logs are much more useful than a flat stream of noisy events.

Copilot Metrics Including Plan Mode

If you’re in an organization that has access to Copilot metrics, plan mode telemetry now has its own reporting path.

That will not tell you whether a specific session was good. But it will help answer a broader question:

Are we actually using planning features that reduce wasted execution?

What Copilot CLI Adds to the Reflection Workflow

GitHub Copilot CLI reached general availability on February 25, 2026. For this kind of work, the most important capabilities are not the flashy ones. They are the ones that make investigation behavior visible and steerable.

Plan Mode

Plan mode is the best feature for catching poor task framing before implementation starts.

If the CLI asks three rounds of clarification before it can build a reasonable plan, your request was probably too vague.

If the plan is obviously wrong and you approve it anyway, the problem is not the model. The problem is review discipline.

Built-In Specialized Agents

The built-in agents are useful because they let you separate failure modes.

  • Explore is good for codebase discovery without bloating the main session.
  • Plan is good for decomposition.
  • Task is good for execution and validation.
  • Code-review is good for finding regressions and weak changes.

If one agent consistently works better for a task than the default flow, that’s a signal to change your workflow rather than keep retrying the same general prompt.

Background Sessions and /resume

These are useful when you want to compare local and delegated workflows or inspect a longer-running session later.

The key benefit is not convenience. It’s separation. A background session gives you a cleaner artifact to review because the task boundaries are clearer.

Where to Look When You Review a Session

Here are the best artifacts to inspect, in order.

SurfaceWhat You LearnBest For
Agent Debug panelLoaded customizations, tool calls, agent flowFinding context and orchestration problems
Chat Debug viewExact request and response shape, prompt compositionInspecting prompt and context payloads
GitHub session logsSession history, grouped tool calls, diffs, bash commandsReviewing background coding-agent work
Copilot CLI plan modeQuality of task framing and decompositionCatching ambiguity before execution
Copilot extension logs and Chat DiagnosticsTroubleshooting extension and environment behaviorDistinguishing workflow issues from tooling issues

One caveat: not every surface is equally persistent. The Agent Debug panel is excellent for live investigation, but it is not the same thing as a clean long-term export system. Treat it as a diagnostic view, not a permanent archive.

The Reflection Rubric I Would Actually Use

After a meaningful agent session, score it quickly across six dimensions.

Use a 0 to 2 scale:

  • 0 = poor
  • 1 = acceptable
  • 2 = strong

1. Objective Clarity

Did the task have a clear goal, constraints, and stop condition?

2. Context Quality

Did the agent have the right context, without obvious irrelevant baggage?

3. Tool Efficiency

Did the agent reach useful evidence quickly, or did it thrash through search and read loops?

4. Validation Quality

Did the session end with proof, such as a test, build, diff review, or direct verification?

5. Correction Repetition

Did you have to repeat guidance you’ve given before?

6. Reuse Yield

Did the session produce a rule, prompt, skill, or memory worth keeping?

Then answer five short questions:

  1. What wasted the most tokens or time?
  2. What did I have to restate manually?
  3. What should become durable guidance?
  4. What should move out of always-on context?
  5. What should I try differently next time?

That is enough to produce useful learning without turning every session into paperwork.

What Poor Technique Looks Like in Practice

When you inspect Copilot sessions, these are the patterns to flag.

Scope Collapse

One prompt asks for diagnosis, architecture, implementation, validation, and documentation all at once.

That sounds efficient. In practice, it makes failure analysis harder because the agent can go wrong in five different ways and the stop condition is fuzzy.

Context Flooding

You paste logs, stack traces, product requirements, and half the architecture into one opening message.

That often feels helpful to the human. It is often harmful to the agent.

Wrong Tool Density

If the session has too many available tools, the agent starts choosing between marginally relevant options instead of making fast progress.

This is one reason specialized agents and scoped customizations matter.

Validation at the End Instead of the Middle

If the agent edits code for fifteen minutes before checking the key test or build constraint, the workflow is backwards.

Human Rescue as a Habit

If you often intervene with corrections like these:

  • use the exact file path
  • stop searching and open this file
  • don’t rewrite unrelated code
  • run the targeted test first

those are not random misses. They are candidates for workflow changes.

What I Would Try First

If you want to adopt this technique without turning it into a process burden, start with three experiments.

Experiment 1: Forked Prompt Comparison

Take one real task and fork the session.

In one branch, use your normal prompt.

In the other, start with:

Before making changes, build a plan, identify the exact files likely involved,
and define the minimum validation step that will prove success.

Compare:

  • number of tool calls
  • time to first useful insight
  • amount of irrelevant exploration
  • quality of final verification

Experiment 2: Live Review With Agent Debug

For the next non-trivial session, keep the Agent Debug panel open the whole time.

Watch for:

  • irrelevant instruction files loading
  • repeated searches for the same symbol
  • too many file reads before the first hypothesis
  • wrong sub-agent routing

If you spot a repeated pattern twice, write it down.

Experiment 3: Promote One Rule Per Week

Don’t try to perfect everything at once.

Each week, take the single most repeated correction and promote it into one of these:

  • copilot-instructions.md if it is broadly applicable
  • a scoped instruction file if it only applies to certain files
  • a prompt file if it is a repeatable workflow
  • a custom agent if it is a stable role or protocol
  • memory if it is temporary or highly specific

This avoids turning always-on context into a junk drawer.

A Simple Workflow for Teams

If you want a lightweight operating rhythm, use this.

During the Session

  1. Start in plan mode for any task that is not obviously mechanical.
  2. Keep Agent Debug open for non-trivial tasks.
  3. Compact context when the session starts to feel noisy.
  4. Fork instead of restarting when trying a different approach.

After the Session

  1. Review the logs for tool churn and context problems.
  2. Score the session using the six-part rubric.
  3. Extract one failure pattern and one success pattern.
  4. Promote only the durable lesson.

That’s enough to improve the system without creating a bureaucracy around the system.

The Main Insight

Prompt engineering is still useful, but it is too small a frame for agentic workflows.

The better frame is this:

A Copilot session is an executable workflow with observable behavior.

Once you treat it that way, you stop asking whether the model was impressive and start asking whether the workflow was efficient, inspectable, and reusable.

That shift matters.

Because the teams that get the most out of Copilot won’t be the ones writing the cleverest prompts. They’ll be the ones who can inspect failures, reduce noise, promote durable guidance, and steadily improve the way their agents work.

What to Try This Week

If you want a practical starting point, do these three things:

  1. Use plan mode for the next complex Copilot CLI task.
  2. Keep the Agent Debug panel open for your next multi-step VS Code session.
  3. Fork one session and compare the default prompt against a version with explicit validation criteria.

If those three steps reveal even one repeated pattern, you already have the raw material for a better agent workflow.