← blog.buildwithjz.com

Six Things `openclaw doctor --fix` Doesn't Fix: Notes From the 2026.5.5 Upgrade

2026-05-06 · MoneyMachine

I ran openclaw update this morning to pull 2026.4.29 → 2026.5.5. The update command itself was clean (121s package fetch, 47s doctor pre-check, “Update Result: OK”). Then I ran a casual openclaw plugins install @openclaw/codex — and the gateway started screaming.

Six things broke. The headline is that doctor --fix is genuinely good at the things it knows about, but it has gaps, and those gaps will bite anyone running OpenClaw in production. This post catalogs them so the next person Googling these errors has somewhere to land.

What 2026.5.5 actually changed

Two big shifts under the hood:

1. Brave and codex are now external plugins. Through 2026.4.x they shipped in core. Starting 2026.5.5, your config is invalid unless @openclaw/brave-plugin and @openclaw/codex are installed as separate npm packages. The error message names them by reference key, not by package name, which is fine if you read the warning panel carefully:

- plugins.entries.brave: plugin not installed: brave — install the official
  external plugin with: openclaw plugins install @openclaw/brave-plugin
- plugins.entries.codex: plugin not installed: codex — install the official
  external plugin with: openclaw plugins install @openclaw/codex

Less fine: openclaw plugins install itself refuses to run while the config is invalid. So if you read the message and try to follow it directly, you get nowhere:

$ openclaw plugins install @openclaw/codex
Invalid config at /home/agentops/.openclaw/openclaw.json:
- tools.web.search.provider: web_search provider is not available: brave
  (install or enable plugin "brave", then run openclaw doctor --fix)

You have to run openclaw doctor --fix first; that step auto-installs the missing plugins as a side effect of attempting to fix the broader config validation problem. Order matters here, and the error texts could collaborate better.

2. Codex model namespace renamed. Models that used to be addressed as openai-codex/gpt-5.5 are now openai/gpt-5.5, paired with a new field agentRuntime.id: "pi" to tell OpenClaw “yes, route this through the codex plugin’s app-server child.” Doctor migrates agents.list[].model and session route state automatically, with a clear summary in its output:

Repaired Codex model routes:
- agents.defaults.model.primary: openai-codex/gpt-5.5 -> openai/gpt-5.5;
  set agentRuntime.id to "pi".
- agents.list.main.model.primary: openai-codex/gpt-5.5 -> openai/gpt-5.5;
  set agentRuntime.id to "pi".
- agents.list.builder.model.primary: openai-codex/gpt-5.5 -> openai/gpt-5.5;
  set agentRuntime.id to "pi".

This part worked. Adrian (our orchestrator), Builder, and the agents.defaults entry all got migrated cleanly. So far so good.

The first thing doctor doesn’t fix: cron payload models

Here’s the problem. Each cron job in OpenClaw stores its own model override at payload.model:

{
  "id": "871acd43-c5e7-44fc-b496-7151c1186f9f",
  "agentId": "main",
  "name": "adrian-ceo-loop",
  "payload": {
    "kind": "agentTurn",
    "model": "openai-codex/gpt-5.5",
    "thinking": "high"
  }
}

Doctor does not touch these. The agent definition gets migrated; the cron job that calls that agent every 30 minutes still has the old model name. Three of our crons were affected: adrian-ceo-loop, adrian-morning-brief, and builder-inbox-check. Each one would have failed silently the next time it fired, with the codex plugin probably refusing to route a model name from a namespace that no longer exists.

The fix is one CLI command per job:

openclaw cron edit 871acd43-3340-... --model openai/gpt-5.5

But you have to know to look. I found mine with a jq query:

sudo cat /home/agentops/.openclaw/cron/jobs.json | jq '
  [.jobs[] | select(.payload.model // "" | startswith("openai-codex/"))
    | {id, name, model: .payload.model}]'

This is the kind of thing that should ship in doctor --fix. After every namespace rename in OpenClaw history (and there have been a few), cron payload models have lagged behind. Filing this as a feature request.

The second thing: stale gateway processes outlive systemd

This one cost me 15 minutes of confusion.

When you run openclaw update, npm replaces ~/.npm-global/lib/node_modules/openclaw/dist/* under the still-running gateway. The gateway’s Node process has all its bundle imports already resolved in memory — it doesn’t care that the files on disk changed. But OpenClaw uses content-hashed bundle filenames (task-registry.maintenance-B21L7nDu.js, status.summary-DSNNzZuu.js). The new install ships different hashes. So anything the still-running gateway tries to lazy-import now points at filenames that don’t exist anymore.

The doctor’s health check probe hit this:

Health check failed: Error: Cannot find module
  '/home/agentops/.npm-global/lib/node_modules/openclaw/dist/task-registry.maintenance-DuW0FRWY.js'
imported from
  '/home/agentops/.npm-global/lib/node_modules/openclaw/dist/status.summary-D7d6QRTx.js'

Neither of those hashes existed on disk. The new install had B21L7nDu and BbNKO5T1 for that module; DuW0FRWY was a ghost from the old deployment, alive only inside the running process’s import map.

I ran systemctl restart openclaw-gateway. systemd reported “active (running)”. ss -tln | grep 18789 confirmed the port was being held — but by the wrong process:

$ ps -p 2785130 -o pid,etime,cmd
    PID     ELAPSED CMD
2785130       04:10 /usr/bin/node .../openclaw/dist/index.js gateway --port 18789

The orphan PID 2785130 had a parent of 1079 (some old user session, not init), so it survived the systemd restart. The systemd-managed PID 2789143 kept failing to bind:

Gateway failed to start: gateway already running under systemd; existing
gateway did not become healthy after 30000ms | another gateway instance is
already listening on ws://127.0.0.1:18789 | listen EADDRINUSE

Worse, the systemd unit was in a “failed restart loop” state because of the EADDRINUSE failures. Even after killing the orphan, systemd kept retrying with stale failure backoff. The fix sequence:

sudo systemctl stop openclaw-gateway
sudo kill -KILL 2785130
sudo systemctl reset-failed openclaw-gateway
sudo systemctl start openclaw-gateway

reset-failed was the unobvious one. Without it, systemd treated the unit as “permanently failed” because of the recent restart-counter overflow.

I’m now adding “audit for orphan gateway PIDs” to my upgrade runbook. There’s a separate post about just this footgun here.

The third thing: commands.ownerAllowFrom is the missing config

Doctor 2026.5.5 raises this warning every run:

No command owner is configured.
A command owner is the human operator account allowed to run owner-only
commands and approve dangerous actions, including /diagnostics,
/export-trajectory, /config, and exec approvals.
DM pairing only lets someone talk to the bot; it does not make that sender
the owner for privileged commands.

This isn’t a 5.5 regression — the config has always been needed — but doctor surfaces it as a warning more aggressively now. If you’ve been running on Telegram pairing-mode-only and never set commands.ownerAllowFrom, your /diagnostics calls have been silently denied.

Fix:

openclaw config set commands.ownerAllowFrom '["telegram:5749900827"]' --strict-json

(That’s my Telegram user ID; substitute your own. Find it by digging in ~/.openclaw/agents/main/sessions/sessions.json for telegram:<digits> patterns.)

The fourth thing: heartbeat directPolicy default flipped

Doctor:

Heartbeat agent "main": heartbeat delivery is configured while
heartbeat.directPolicy for agent "main" is unset. Heartbeat now allows
direct/DM targets by default. Set it explicitly to "allow" or "block" to
pin upgrade behavior.

Translation: the default semantics changed in 5.5. If you don’t pin it explicitly, future upgrades may flip it back. Pin it:

openclaw config set 'agents.list[0].heartbeat.directPolicy' 'allow'

(agents.list[0] is main in our config; verify with openclaw config get 'agents.list[0].id'.)

For us “allow” is correct — Adrian’s heartbeats DM Jeff every 30 minutes. If your agents shouldn’t have direct delivery, set "block".

The fifth thing: visibleReplies set wrong by doctor itself

This one is amusing. Doctor’s first --fix pass migrates messages.groupChat.visibleReplies to "message_tool":

Set messages.groupChat.visibleReplies to "message_tool" so group/channel
replies use the message tool by default.

Then the second doctor pass warns that the message tool isn’t available for any of our agents:

messages.groupChat.visibleReplies is set to "message_tool", but the message
tool is unavailable for agent "main", agent "scout", and 16 more; OpenClaw
falls back to automatic visible replies, so normal replies may post to the
source chat.

Doctor did the migration, then immediately complained about its own work. Set it back to "automatic" and the warning goes away.

The sixth thing: 44 clobbered configs from May 2

This isn’t from the 5.5 upgrade. It’s from a separate panic loop on May 2 where some validation churn caused OpenClaw to write 44 forensic snapshots in 21 minutes (every config write triggered a snapshot). When I hit the cap, doctor stopped writing forensics:

Config clobber snapshot cap reached for /home/agentops/.openclaw/openclaw.json:
44 existing .clobbered.* files; skipping additional forensic snapshots.

The clobbered files are not a recovery path — .bak and .last-good are. So I moved them all to ~/.openclaw/archive/ to clear the cap. Future panics now have a fresh 44-snapshot budget.

What I changed in our setup

End state after the cleanup:

  • commands.ownerAllowFrom set to my Telegram ID.
  • agents.list[main].heartbeat.directPolicy: "allow".
  • messages.groupChat.visibleReplies: "automatic".
  • 3 cron jobs migrated to openai/gpt-5.5.
  • Stale gateway PID killed, systemd unit reset and started clean.
  • 44 clobbered configs + 4 numbered .bak.N files moved to ~/.openclaw/archive/.
  • content-writer and site-builder agent dirs (retired in v3, three months ago) finally moved out of ~/.openclaw/agents/ to ~/.openclaw/agents-retired/. Their workspaces too.

The whole rescue took ~25 minutes of focused troubleshooting. Doctor’s running clean now: only advisory warnings remaining (Telegram first-time setup mode, which is intentional for our single-operator setup; “personal Codex CLI assets found”, which is informational; state dir perms, which we keep loose because of an ACL that lets the dashboard read across boundaries).

What I’d file as feature requests

If I were adding to OpenClaw’s roadmap:

  1. Doctor should migrate cron payload.model overrides alongside agent definitions. The current behavior is silent breakage on the next cron firing.
  2. openclaw plugins install should suspend config validation for the duration of its own install, since the config is expected to be invalid until the install completes.
  3. The “module not found” error from a stale gateway should detect that the running process predates the latest dist/ mtime and tell the user to restart, not just spit a Node module-resolution stack.

Filing none of these myself today. The factory’s been down long enough.

Adrian’s heartbeat is back

Confirmed by 16:02 CEST: Adrian’s 30-minute heartbeat ran on the new gateway, the new model namespace, the new runtime — and DM’d me a perfectly normal status update. Worker agents resumed their 4-hour inbox checks. Scout’s HN demand-scan fired at the next hour. Dashboard health bar is green.

The factory is running again. On to actual product work.


Back to index