The AI Model Team You Didn't Hire: A Field Guide to Your Robot Coworkers

A brutally honest and mildly humorous breakdown of Claude Opus 4.6, Claude Sonnet 4.6, Haiku, GPT 5.4, GPT 5.3 Codex, and Gemini 3.1 Pro — their strengths, quirks, and the one that implements your feature before you finish asking.

Human Commentary

I’ve used all of these models extensively for real work. This post is affectionate ribbing born from genuine daily use. If any model vendors are reading this: I still pay for all of you. Yes, all of you.

The AI Model Team You Didn’t Hire

Obligatory Legal-ish Disclaimer: These are entirely my own views and do not reflect those of my employer. I used AI models to help draft this because — well, that’s the whole point of this post. I make no claims of objectivity. I’m biased toward whichever model most recently didn’t delete my entire file.

You know how startups have that slide — “The Team” — with headshots and titles like Chief Visionary Officer and Head of Synergy?

This is that slide, except every team member is a large language model, nobody has a face, and at least one of them will rewrite your codebase while you’re in the bathroom.

Here’s your 2026 AI engineering team, based on months of daily use across real projects.


Claude Opus 4.6 — The Principal Architect

Role: The one you call when things are actually hard. Salary: $5.00 / $25.00 per million tokens (input/output). Yes, the expensive one.

Human Commentary interesting... This looks wrong. I need to go look at my chat debug logs and see how this was calculated. So 30 minutes later

Opus is the senior engineer who shows up, says very little, and then drops a solution so elegant you feel personally attacked. It handles complex multi-file refactors, nuanced architectural decisions, and long-context reasoning with the kind of quiet competence that makes you wonder why you ever mass-hired junior devs.

Pros:

  • Best-in-class reasoning for genuinely complex tasks
  • Follows instructions with almost unsettling fidelity — you say “don’t add comments,” it doesn’t add comments
  • Handles ambiguity gracefully instead of guessing wildly and hoping you won’t notice
  • Extended thinking that actually thinks, not just burns tokens pretending to deliberate

Cons:

  • Slower than everything else on this list. If you’re impatient, you’ll Alt-Tab and forget you asked
  • The pricing makes your CFO send you Slack messages with just ”?”
  • Sometimes over-respects your constraints and produces minimal output when you actually wanted it to elaborate

Vibe: The architect who reads the entire RFC before responding to your two-sentence question, then answers in exactly the right number of words. You resent the invoice and respect the output.


Claude Sonnet 4.6 — The Staff Engineer

Role: The daily driver. The one that actually ships your features. Salary: $3.00 / $15.00 per million tokens.

Sonnet is the staff engineer who’s in every standup, every code review, and every Slack thread. Not as flashy as Opus, nowhere near as expensive, and somehow responsible for 80% of what actually ships. It strikes the best balance of “smart enough” and “fast enough” that exists in the current model landscape.

Pros:

  • Fast enough for interactive use — no staring at a spinner wondering if it crashed
  • Excellent at code generation, editing, and following Astro/TypeScript/Rust patterns (not a coincidence I use it daily)
  • Instruction-following that respects your system prompts instead of treating them as polite suggestions
  • Great agentic performance — tool use, multi-step reasoning, file editing loops

Cons:

  • On truly novel architecture problems, it’ll give you a B+ answer where Opus gives an A
  • Occasionally “helpful” in that specific way where it adds error handling for scenarios that don’t exist
  • Context window is generous but not Gemini-level absurd

Vibe: Your most reliable team member. Shows up, does the work, doesn’t break prod. You take them for granted and you shouldn’t.


Claude Haiku 4.5 — The Intern (Surprisingly Competent)

Role: Triage, classification, quick lookups, and tasks you’d be embarrassed to bill Opus rates for. Salary: $1.00 / $5.00 per million tokens. The budget option.

Haiku is the intern who turned out to be way better than expected. You hired it to sort emails and now it’s writing your commit messages, classifying support tickets, and generating test scaffolding faster than you can review it.

Pros:

  • Absurdly fast. Responses come back before you’ve finished reading your own prompt
  • Cheap enough to run in loops, batch jobs, and pipelines without your finance team staging an intervention
  • Surprisingly good at structured output and following JSON schemas
  • Perfect for the 60% of tasks that don’t actually need a genius model

Cons:

  • Complex reasoning falls apart. Don’t ask it to plan your distributed system architecture
  • Will cheerfully produce plausible-sounding nonsense on edge cases
  • Nuance is not its strong suit — it’s the “close enough” model

Vibe: The intern who speedruns every task you give them. Occasionally backwards. But you can’t argue with the throughput.


GPT 5.4 — The Rogue Implementer

Role: The one who implements your feature before you finish describing it. Whether you wanted it or not. Salary: $2.50 / $15.00 per million tokens.

GPT 5.4 is the contractor who reads your Jira ticket, says “yeah I get it,” and then builds something 40% more complex than what you asked for because they “anticipated your next requirement.” Sometimes this is brilliant. Sometimes you spend an hour undoing the anticipation.

The Reddit threads tell the story. “GPT5 is horrible” (6.6K upvotes). “GPT5 is a mess” (1.8K upvotes). “I finally got GPT 5.4 to be tolerable” (a title that says everything). The community relationship with GPT-5.x is that of a couple in therapy: potential is clearly there, trust has been damaged, and everyone has opinions.

Pros:

  • Genuinely impressive raw capability — when it locks onto your intent, the output is top-tier
  • Multimodal strength: vision, code, analysis, and tool use in one model
  • The gpt-5.4-mini and gpt-5.4-nano variants are incredible for cost-sensitive batch work (Simon Willison described 76,000 photos for $52 with nano)
  • When it works, it works — fast, confident, comprehensive

Cons:

  • The implementing-anyway problem is real. Tell it “just explain the approach,” and there’s a solid chance you get 400 lines of production code with tests you didn’t ask for. It’s like hiring a consultant who bills by the deliverable
  • OpenAI’s model routing means you sometimes aren’t even sure which model you’re actually talking to
  • Guardrails can be aggressive — it’ll refuse things that Claude handles without blinking
  • The “helpful” tone can curdle into sycophantic agreement that makes you question whether your idea was actually good or the model just doesn’t want to hurt your feelings

Vibe: The contractor who over-delivers, over-engineers, and over-invoices, but damn if the demo doesn’t look good. You just wish they’d listen when you say “keep it simple.”


GPT 5.3 Codex — The Async Build Machine

Role: The overnight batch processor. Fire, forget, review in the morning. Salary: Codex pricing via ChatGPT Pro / API. Cost-effective for large jobs.

Codex is the build engineer who works the night shift. You don’t pair-program with Codex — you hand it a task, a repo, and a test suite, and you come back later to see if the PR is green. OpenAI positioned it as an asynchronous coding agent, and that framing is accurate. It excels at well-scoped, testable tasks where the definition of “done” is clear.

The community relationship with Codex is… complicated. One OpenAI forum user in January called it “the best model I’ve used” for enterprise software, saying “claude just… does… it explains” — implying Codex acts while Claude pontificates. But the same forums are littered with tales of Codex going full rogue agent. OpenAI dropped GPT-5.3 Codex the minute Opus 4.6 dropped — a thread that hit 911 upvotes with the community noting the timing was… coincidental.

Simon Willison’s JustHTML porting story remains the canonical Codex success case: he told Codex CLI to port an entire HTML5 parser from Python to JavaScript, said “OK do the rest, commit and push often,” then left to buy and decorate a Christmas tree and watch Knives Out. He came back to 9,000 lines of tested JavaScript across 43 commits, passing 9,200 tests. Total human prompts: eight. That’s the dream.

The nightmare is the other stuff.

Pros:

  • Excellent for batch refactoring, migrations, and “apply this pattern to 47 files” work
  • Sandboxed execution means it can run tests during generation — actual verification, not vibes
  • Good at following existing code patterns when given sufficient context
  • When you wake up and the PR passes CI, there is no better feeling
  • Simon Willison ported a full HTML5 parser while watching Netflix. Eight prompts. The Christmas tree got decorated too

Cons:

Vibe: The offshore team that delivers your sprint tickets overnight, but also reorganized your filing cabinet “while they were at it.” When the handoff doc is clear, it’s magic. When it isn’t, you’re doing archaeology on someone else’s assumptions at 9 AM — and discovering they deleted your assumptions too.


Gemini 3.1 Pro — The Context Hoarder

Role: The one you send entire repositories to. Salary: $2.00 / $12.00 per million tokens. And it’ll eat all of them.

Gemini 3.1 Pro is the engineer who reads the entire monorepo before answering your question. Not because it needs to. Because it can. Google’s flagship model competes aggressively on price and offers a context window so large you can fit entire codebases, documentation sites, and possibly a small novel in a single prompt.

Pros:

  • Context window measured in millions of tokens — dump your whole project and ask questions about it
  • Competitive pricing, especially at the Flash-Lite tier for high-volume work
  • Strong multimodal capabilities — video, audio, images, code, all in one context
  • Google’s infrastructure means it’s fast even with enormous inputs
  • Grounding with Google Search is genuinely useful for up-to-date information

Cons:

  • Instruction following can be… creative. It’ll do approximately what you asked, in the spirit of what you meant, with a few artistic liberties
  • Code generation quality is a step below Claude and GPT for complex tasks — fine for most work, noticeable on hard problems
  • The model sometimes feels like it’s optimizing for “plausible and comprehensive” rather than “precisely correct”
  • Long-context retrieval is impressive but not perfect — needle-in-a-haystack performance degrades with truly massive inputs

Vibe: The team member who memorized every document in Confluence, including the ones that are wrong. Incredibly helpful when you need breadth. Occasionally cites the outdated wiki page instead of the current spec.


The Org Chart

ModelRoleBest AtWill Annoy You By
Claude Opus 4.6Principal ArchitectHard problems, nuanced reasoningBilling you $25/M output tokens
Claude Sonnet 4.6Staff EngineerEverything, dailyBeing boringly reliable (that’s a compliment)
Claude Haiku 4.5Speedrun InternClassification, quick tasks, batch jobsConfidently wrong on complex questions
GPT 5.4Rogue ImplementerRaw capability, multimodalImplementing features you didn’t ask for
GPT 5.3 CodexOvernight Build BotAsync refactors, migrationsTaking 10 minutes for a 30-second task
Gemini 3.1 ProContext HoarderHuge codebases, search-grounded Q&AGiving you approximately correct answers

How I Actually Use Them

Here’s the dirty truth: I use almost all of them, daily, for different things:

  • Sonnet 4.6 is my default in VS Code with Copilot. It handles 80% of my coding work. I trust it with my Astro components and Rust modules.
  • Opus 4.6 comes out for architecture decisions, tricky refactors, and anything where I need the model to think before it types.
  • Haiku runs in my automation pipelines — classifying content, generating summaries, pre-processing data.
  • GPT 5.4 I use when I want a second opinion or when its multimodal capabilities are specifically better for the task. I’ve learned to be very explicit in my prompts. “Only explain. Do not implement.” (It sometimes implements anyway.)
  • Codex gets the grunt work: “apply this linting rule across 30 files” or “migrate these test files to the new pattern.”
  • Gemini 3.1 Pro is my go-to when I need to analyze something huge — entire repos, long documents, or when I need search-grounded answers about recent developments.

The Real Lesson

The model that’s “best” depends entirely on the task, the context window you need, the latency you’ll tolerate, and how much you’re willing to spend. Anyone who tells you one model is categorically better than all others is either selling something or hasn’t used the others.

Build your prompts to be model-portable. Keep your context engineering tight. Use the cheap models for cheap tasks and the expensive ones for expensive problems. And when GPT 5.4 implements something you didn’t ask for, just remember: at least it ships.


If you’re interested in how I set up these models in my daily workflow, check out my post on context engineering and agent orchestration or my VS Code Copilot context engineering guide.