Why I’m Staying on OpenClaw (Even After 5 Days of Pain)

Short version: my AI factory runs 21 agents on OpenClaw. In the last 10 days OpenClaw has been responsible for one 11-hour silent outage and one two-day Codex-subprocess-crash saga. I spent last night evaluating alternatives. I’m staying. This post is the cold-eyed reasoning behind that call, plus the specific escape-hatch plan if OpenClaw causes another serious incident in the next 30 days.

If you’re running a multi-agent AI system and considering migration from OpenClaw to Hatchet / Hermes / Temporal / something else, this post is my working notes. Take what’s useful.

The pain OpenClaw has caused

Quick inventory before evaluating alternatives:

April 20, 11-hour silent outage (full postmortem). One agent’s session file grew unbounded to 1,797 messages (1.8 MB). Gateway hit memory pressure, systemd killed it, watchdog only checked process existence, no alert fired. 11 hours of nothing.
April 21-22, Codex harness subprocess crashes (full postmortem). Four of my agents (Adrian, Builder, Release Engineer, QA Engineer) use openai-codex/gpt-5.3-codex-spark as primary — free tokens via my ChatGPT Pro subscription. When that subprocess crashed on startup with exit code 1, or hit a rate limit, the fallback chain to Ollama Cloud + OpenRouter silently failed because the Codex subprocess’s internal provider registry doesn’t know about external providers. 77 consecutive identical errors, zero Telegram alerts, zero deploys for most of a day.
Observability is a filesystem dump. I discovered both incidents above by opening log files and running grep. There’s no “52 consecutive errors in agent X” metric surface. There’s no SLO dashboard. Agent session lengths aren’t tracked. Fallback usage isn’t tracked. This isn’t an OpenClaw bug exactly — it’s an observability gap at the layer where I should probably build my own monitoring. But it means OpenClaw didn’t give me the gift of visibility I’d expect from a production-grade framework.
Stuck runningAtMs state. When a cron job dies between the gateway lane and the cron scheduler, the runningAtMs flag persists in cron/jobs-state.json. The cron refuses to retry (“already-running”). Recovery requires openclaw cron disable <id>; openclaw cron enable <id>. Non-obvious, not documented, took me a while to figure out.
Deprecated config warnings on every restart. config.modelFallbackPolicy is deprecated and no longer changes runtime behavior fires dozens of times per startup. If a config is fully deprecated, warn ONCE and tell me what to do. Don’t flood the log — it dilutes real signal.
Hidden knob problem. The “fix” for the Codex subprocess issue turned out to be an undocumented config value (embeddedHarness.fallback: "pi") that’s mentioned in docs only as “disable this with none.” The positive-case behavior is nowhere. I only found it by reading issue #35220 (closed as “not planned”) and a DeepWiki article. This kind of tribal knowledge is a lurking tax.

That’s five days of my recent life. At 21 agents and ~$235/month infrastructure cost, it’s disproportionate. So I went looking.

The alternatives I looked at

Hatchet (https://hatchet.run)

What it is: Postgres-backed durable workflow engine written in Go. Core primitives: Workflows (DAGs), Steps (retryable units), Events (triggers), Schedules (cron), Workers (language-specific SDKs via gRPC). MIT licensed, production-ready at v0.50+.

Pain-point coverage:

Subprocess crash detection: Yes. Step heartbeats; missed heartbeats mark runs as failed with configurable alerts.
Silent consecutive failures: Yes. Step-level retry policies with backoff, exhaustion fires alerts. Built-in dashboard.
Stuck state: Yes. TTLs on runs, step-level timeouts, no “stuck flag” pattern.
Unbounded sessions: N/A at framework layer (still my app’s problem), but Hatchet’s workflow chunking gives fresh context per step.
Observability: Built-in web dashboard, per-step logs, metrics, run history in Postgres.

Cost: OSS is free. Self-host on a single VPS: ~500 MB RAM for engine + ~200 MB per worker. Fits my 48 GB setup easily.

SDKs: Python, TypeScript, Go. Never have to write Go.

Surprising limitations I found in research:

Polling workers generate significant Postgres traffic. At ~30+ concurrent workers you want pgBouncer. I’m at 21 agents so this is future-me’s problem.
Default retention keeps all run history. 21 agents × every-10-min = ~3,000 runs/day. Will balloon disk in months unless retention is tuned from day 1.
The dashboard can’t share auth with my Tailscale-gated MoneyMachine dashboard — I’d run two UIs.

Verdict: This is the right framework if I migrate. Addresses 4 of my 5 pain points directly. Self-host story is clean. License is MIT. Community is active.

Hermes Agent (https://github.com/NousResearch/hermes-agent)

What it is: Python-based agent framework from NousResearch. Wide native provider support (OpenRouter 200+, Nous Portal, NVIDIA NIM, z.ai/GLM, Kimi, MiniMax, OpenAI, custom). SQLite + FTS5 for sessions + JSONL transcripts. Built-in Telegram/Discord/Slack/WhatsApp gateway. Built-in cron scheduler. Auto-session-reset with a save-before-reset turn.

Pain-point coverage:

Unbounded sessions: Directly solves — auto-reset is native with a save-memory hook.
Subprocess crashes: No Codex harness, no subprocess. Removes the whole class of bug.
Multi-provider fallback: Partial. Providers all live in one registry, BUT explicit fallback chains aren’t first-class — it’s manual “switch model” per session or a “smart_model_routing” feature that’s cheap-vs-expensive routing, not primary → fallback chains.
Observability: Thin — /usage, /compress, /insights. No metrics/alerts/dashboard.
Stuck state: Different scheduler, different bugs, but none of OpenClaw’s specific ones.

Critical open issue (#5563): state.db corruption after extended sessions. That’s the exact pain I’m trying to escape. Hermes has its own version of my OpenClaw problem — they just haven’t fixed it yet either.

Migration cost: 2-4 weeks of full-time porting. Rewrite all 21 agent definitions (SKILL.md format), approval-queue workflow, cron syntax, dashboard data source (SQLite not JSONL), Stripe/Cloudflare/GitHub integration scripts (BYO on Hermes — not built in). No Postgres backend documented.

Verdict: Hermes solves my session-bloat pain cleanly but introduces a comparable pain (state.db corruption). And the migration cost is 3x the cost of patching OpenClaw. Skip.

Hermes Agent Self-Evolution

7 commits on main, no releases, created 2026-03-09. DSPy + GEPA + “Darwinian Evolver” for prompt and code evolution. Outputs submitted as PRs against your hermes-agent repo for human review. Phase 1 implemented; phases 2-5 planned. Research-grade, $2-10/run. Not relevant for my production factory.

Temporal (https://temporal.io)

Battle-tested at Stripe, Netflix, Snap. MIT licensed engine.

The ops complexity is wrong for my single-VPS setup. Self-host needs Cassandra OR Postgres + Elasticsearch + 5 processes (frontend, history, matching, worker, web). Minimum realistic footprint: 3-4 GB RAM, 5+ processes. Free Temporal Cloud tier was killed in 2025. For 21 agents on one VPS, Temporal is operational overkill. Skip.

Inngest (https://www.inngest.com)

Event-driven durable functions with a nice DX. Single-binary self-host exists, but the architecture is serverless/edge-first. No Postgres-native story — uses SQLite or its own store. Works on VPS but I’d be swimming against the current. Skip unless going serverless.

Restate (https://restate.dev)

Rust engine, Postgres-backed durable execution. Impressive tech but too young (2023), smaller community, SDK churn. Revisit in 12 months.

DBOS Transact (https://github.com/dbos-inc/dbos-transact)

Postgres-native durable workflows, Python/TS SDKs, workflows-as-rows via Postgres transactions. Clever architecture. Optimized for short OLTP-style workflows, not long-running LLM-agent loops. Limited cron story, weaker observability UI. Interesting, wrong fit.

Building it myself on Postgres

~400 lines of Python for a minimum viable durable task queue. tasks table + task_runs table + SELECT FOR UPDATE SKIP LOCKED worker loop + APScheduler for cron + Prometheus exporter for metrics. Saner than Temporal. Less sane than Hatchet. I would re-invent retry backoff, deadlines, a dashboard. Don’t.

The decision

I’m staying on OpenClaw. Three reasons:

1. The migration cost is 2-3 weeks of zero revenue progress.

I’m at $0 revenue after 7 weeks. Any week spent on infrastructure migration is a week not spent on the actual product flywheel: scout signals → score opportunities → ship products → distribute → sell. Even a clean migration would bleed a significant chunk of my remaining runway budget. The OpenClaw pain is visible and concrete; the migration pain is diffuse but real. Diffuse-but-real often wins vs visible-but-concrete in attention but loses in opportunity cost.

2. OpenClaw’s pain points are patchable without migrating.

Session bloat → 200-msg soft cap in agent DIRECTIVES + hourly cron hard cap at 400 msgs. Already done.
Silent consecutive failures → a 150-line Python cron that queries jobs-state.json every 5 minutes and sends Telegram alerts on consecutiveErrors ≥ 10. Already done — caught the Adrian heartbeat bug 15 minutes after I deployed it.
Codex subprocess issues → yank Codex off the three worker agents that don’t need it (Release Engineer, QA Engineer, Builder), keep it only on Adrian where the free ChatGPT-Pro tokens actually matter. Already done.
Stuck cron state → documented in Release Engineer DIRECTIVES with the exact disable/enable recovery commands.
Deprecated config warnings → I’ll live with them.
Undocumented knobs → write public blog posts about them (you’re reading one). Tribal knowledge escapes the tribe eventually.

Each of these is cheaper than migration. Together, they’re the work of a couple of focused sessions — not weeks. And once built, each patch continues to pay dividends.

3. Every alternative has its own pain.

Hatchet has Postgres load issues at scale and a separate dashboard auth system. Hermes has state.db corruption. Temporal has ops overhead. Inngest is serverless-first. Restate is immature. DBOS is schema-opinionated. Building my own is NIH.

The grass isn’t greener. It’s just different grass. If I’m going to deal with pain anyway, I’d rather deal with the pain I already know the shape of.

The escape hatch

I’m not blindly committed. Here’s the specific trigger for pulling the rip cord:

Calendar reminder set for 2026-05-22 (30 days from tonight). At that date, I review:

Dashboard consecutive-error metrics (now live)
Incident log since 2026-04-22
Any new pain I haven’t yet quantified

If OpenClaw has caused another >4-hour incident OR >30 silent consecutive errors in any agent in the intervening month, I migrate — and the target is Hatchet, two-tier:

Hatchet outside, as the durable orchestration layer. It owns: cron scheduling, retries, timeouts, alerts, run history, dashboard.
OpenClaw inside, as the LLM execution layer. It owns: agent session state, model fallback within a single attempt, the actual LLM calls.
Hatchet triggers an HTTP call to OpenClaw for each agent “turn.” OpenClaw becomes a dumb LLM runner. Hatchet’s observability covers the gaps.
Estimated migration cost: ~1 week for this pattern (vs 2-3 weeks for full replacement)
I don’t have to rewrite a single agent’s prompt, memory, or skill

If OpenClaw’s reliability improves over the next 30 days with the patches I’ve deployed, I extend the review by another 30 days and continue to patch.

What I’d tell you if you’re thinking about the same migration

Don’t migrate during active product work. Every week spent on infrastructure is a week of opportunity cost. If your business is still pre-revenue (like mine), the migration tax compounds faster than the pain tax.

Patch the observability gap first. Most “framework X is broken” complaints, mine included, reduce to “I didn’t know framework X was failing.” Fix the knowing before you fix the framework. It might turn out framework X was fine and you just couldn’t see what it was doing. For us, the consecutive-errors Telegram alert caught three real bugs on day one.

Quantify the pain before you commit to the replacement. “OpenClaw is annoying” is not a migration reason. “OpenClaw caused 2 incidents >4h each in 30 days” is. Write down specific failure patterns and their frequencies. If the count is low, patch. If the count is growing, migrate.

Pick a migration target that addresses your specific pain, not the best-on-paper framework. Temporal is better than Hatchet in the abstract but wrong for a single-VPS setup. Hermes is better at sessions than OpenClaw but worse at observability. Match the tool to the shape of your pain.

Set a specific decision date. Open-ended “I’ll reconsider later” becomes “I’ll reconsider never.” Put a date on your calendar with specific metrics. Show up on that date and actually re-read this post.

I’ll report back on May 22 either way.

— Jeff