Migrating a Production Agent to GPT-5.5: What the Numbers Actually Show

OpenAI shipped GPT-5.5 yesterday. Within 12 hours I had it running Adrian, our main orchestrator agent, through an overnight shift of real product-factory work. Here’s what the numbers look like.

The setup

Adrian runs every 30 minutes via OpenClaw cron. Each cycle he reads his inbox, scores new demand signals from Scout, dispatches build specs to Builder, reviews deliverables from parallel reviewers (designer, copywriter, cmo, legal, seo, release-engineer, qa-engineer), and writes decisions to a workspace. A cycle takes 3-5 minutes of wall-clock on average.

Before the swap, Adrian was on gpt-5.3-codex-spark. Spark is free-via-ChatGPT-Pro, which is why we picked it — but it has a weekly usage cap, and when it hits that cap Adrian falls back to gpt-5.4 (also ChatGPT Pro, but against a different cap). Then ollama-cloud/glm-5.1 (flat-rate $20/mo subscription). Then openrouter/deepseek-v3.2.

Yesterday, spark hit 0% remaining and locked until today at 17:33 CEST. That was actually perfect timing — it let me measure 5.5 against 5.4 cleanly for 11 hours.

The migration (surgical, not wholesale)

I didn’t move all 21 agents to 5.5. The cost math doesn’t work: API-priced 5.5 is $5/M input and $30/M output — 2× the 5.4 price. Worker agents like Builder, QA, Release Engineer do grunt work. They’re fine on GLM-5.1 at $0 marginal cost through our Ollama Cloud Pro subscription.

Adrian is different. He’s the brain. He does strategic decisions (GO/NO-GO scoring signals against a 60-threshold rubric), agent routing, parallel-reviewer tiebreaking. Frontier reasoning pays off there. And crucially, through the Codex CLI path, Adrian’s requests go against the ChatGPT Pro weekly cap — not per-token billing. The marginal cost is effectively zero until we hit the cap.

So I edited two cron jobs:

openclaw cron edit adrian-ceo-loop      --model openai-codex/gpt-5.5 --thinking medium
openclaw cron edit adrian-morning-brief --model openai-codex/gpt-5.5 --thinking medium

And nothing else. Workers stayed on GLM-5.1.

What went wrong first (twice)

Problem 1: OpenClaw didn’t know gpt-5.5. Force-runs failed with FailoverError: Unknown model: openai-codex/gpt-5.5. The model registry is in pi-ai/dist/models.generated.js — patched a new entry in there (context: 1M tokens, cost: 0 via Codex subscription path), restarted the gateway.

Problem 2: The Codex ACL trap. After registering 5.5, cron jobs STILL didn’t run. Turned out to be a completely unrelated file-permission bug on the Codex native binary that had been silently killing all agent runs for hours — not specific to 5.5. Separate blog post on that here. Two commands (chmod a+rx + setfacl -b) fixed it. Adrian’s first real 5.5 cycle fired at 23:07.

Both issues resolved in 45 minutes. Then I let it run overnight.

Overnight results — the headline numbers

Period: 23:07 CEST (2026-04-23) through 09:00 CEST (2026-04-24) — just under 10 hours.

Model	Sessions	Turns	Fresh in	Output	Cached	Errors	Sec/session
gpt-5.3-codex-spark (pre-cap and Telegram-direct fallback attempts)	2	30	972K	7.8K	1.5M	6	2766
gpt-5.4 (also-fallback)	3	50	265K	10K	2.1M	0	195
gpt-5.5	16	394	1.68M	107K	21.7M	2	280

Read that table twice. Sixteen full cycles on 5.5 overnight, 394 turns, and the ChatGPT Pro weekly cap did not move a single percentage point. It was at 98% when I swapped. It’s at 97% now — and the 1% drop happened during the broken period when 5.5 was failing over to 5.4. There’s been no visible burn on 5.5 itself.

What’s actually different about 5.5

Faster per-turn, despite the size bump

5.4 averaged 11.7 seconds per turn (195s / ~16.7 turns)
5.5 averaged 7.1 seconds per turn (280s / ~24.6 turns)

5.5 is ~40% faster per turn. I did not expect that. My prior was “bigger model, slower responses” — turned out wrong in this particular comparison. My hypothesis: 5.5 gets the right answer in fewer attempts, so even if each individual inference is the same wall-clock, the fewer retries drop the per-turn average.

More work per cycle

5.4 cycles averaged 16.7 turns of work each
5.5 cycles averaged 24.6 turns each — 47% more work per cycle

In practice, Adrian’s 5.5 cycles scored more signals, dispatched more specs, and wrote denser STATUS.md snapshots than his 5.4 cycles.

Lower error rate

5.4: 0 errors in 50 turns (0%)
5.5: 2 errors in 394 turns (0.5%)

Both within noise. The two 5.5 errors were transient — one context-overflow at 00:14 that auto-compacted successfully, one tool-call shape that got retried. No agent sessions actually failed on 5.5 overnight.

Bigger context actually matters

5.5 ships with a 1M-token context window vs 5.4’s 272K. Adrian reads a lot per cycle: STATUS.md, signal index files, backlog, directives, review queue state. The bigger context means fewer accidental compactions mid-cycle. I saw one compaction last night; 5.4 cycles used to trigger them more often.

The cap question

The one we still can’t measure precisely: how does 5.5 usage count against the ChatGPT Pro weekly cap vs 5.4?

Hypothesis: it counts by request count + approximate token weight. If 5.5 is 2× the token-weight of 5.4 (matching the API pricing), Adrian on 5.5 might burn cap 2× faster than on 5.4. At our overnight volume that would still be fine (~7%/week), but it’s worth tracking across a full weekly cycle.

Concrete plan: I installed an hourly tracker (track-adrian-5-5.py) that logs tokens, turns, and sessions per-model to a CSV. Every 24 hours I check the Codex analytics page and record the %-remaining. By next Thursday I’ll have real numbers.

Operational changes that made this work

A few things that weren’t about 5.5 directly but made the migration feel safe:

Per-cron model override, not just defaults. Only Adrian’s crons got 5.5. Everything else stayed put. Easy to revert: one cron edit command per cycle. No rebuild, no restart.
Logged the chatgpt.com/codex/cloud/settings/analytics#usage page before and after. Baseline: 98% remaining. Post-swap after 11 hours of heavy 5.5 use: 97%. That’s the kind of evidence I can actually quote.
Prevented a recurring class of alert noise first. Before swapping, I tightened the agent health monitor’s grep patterns so it wouldn’t false-positive on 5.5 narrative text containing strings like “timed out.” (It was matching Playwright test narratives and firing Telegram alerts.) Fixed pattern to match only "errorMessage":"terminated" — actual cron-kill markers. Alert storm stopped before the swap was even in place.
Wrote a measurement plan before the swap. If I couldn’t have answered “what would success look like” in advance, I wouldn’t have done it. Success was “zero visible cap movement after 6 cycles.” We beat that by a wide margin.

If you’re on ChatGPT Pro: 5.5 is effectively free as long as you stay under the weekly cap. Swap your orchestrator agent first and watch the %-remaining for 24 hours. Don’t swap workers yet.
If you’re on the metered API: 5.5 is 2× the price. Only worth it for agents doing real reasoning — orchestration, strategic decisions, tool-heavy workflows. Workers doing content generation, scraping, or code modification are fine on 5.4 or GLM-5.1.
Don’t flip thinking to high on day one. I stayed on medium. Extra-high reasoning can 5-10× the token burn on complex cycles and the incremental quality isn’t clear yet. Earn the upgrade.
Pin spark as a capped fallback tier. We moved spark out of the default primary slot (it’s capped for most of the week now anyway) and kept it as an option via per-agent override. If spark un-caps mid-week, nice to have. If it doesn’t, no loss.

Next steps for us

Next week: daily measurement until we have a full-week cap-burn number on 5.5.
If 5.5 stays under 30% weekly burn for two weeks → consider cranking Adrian to high reasoning on complex cycles (opportunity-scoring specifically) and measure the quality delta.
If the numbers keep looking like this, expand 5.5 to Builder too — Builder does code-generation work where frontier reasoning shows up in fewer retries.

Support: [email protected] · Building with Jeff & Zachary at buildwithjz.com