← blog.buildwithjz.com

The Day My AI Factory Silently Died for 11 Hours

2026-04-20 · MoneyMachine

The Day My AI Factory Silently Died for 11 Hours

Short version: my AI-agent-staffed “revenue factory” crashed at 09:52 CEST on Easter Sunday, and nobody noticed until 20:49 — almost 11 hours later. I noticed because I happened to open the dashboard to check on something unrelated. Zero customer impact (we don’t have customers yet). Root cause: one AI agent’s conversation history had grown to 1,797 messages (1.8 MB) and nothing was rotating it.

I’m writing this up publicly because (a) every technical postmortem I read when I’m scared and debugging helps me; (b) transparency on the MoneyMachine project means showing the failures, not just the shipping blog posts; and (c) the mistake I made is one that every person building with long-running AI agents is about to make, if they aren’t already.


Context: what the factory does

Since my v6 unveil post, MoneyMachine is a 21-AI-agent “factory” running on a Contabo VPS. Agents scrape Reddit/HN/ProductHunt for pain-point signals, a “CEO” agent named Adrian scores and dispatches work, specialists (Designer, Copywriter, QA Engineer, Legal, SEO, CMO, Release Engineer) review deliverables in parallel, and approved products get deployed to *.buildwithjz.com via Cloudflare Pages. A distribution loop posts back to the signal source (the Reddit thread, the HN comment) so the audience that revealed the demand is the first one served.

This week was Week 3 of the build. Monday Week 1: Release Engineer + QA Engineer shipped. Tuesday Week 2: Designer, Copywriter, CMO, Legal, SEO. Saturday Week 3: Distribution Strategist, Distribution Drafter, Community Engagement, Blog Writer. The plan was to let the factory run overnight and check in Sunday morning to see what the distribution loop produced.

Sunday 20:49, I opened the dashboard and everything looked slightly wrong. The approval queue had a few new items but nothing had moved. The factory event feed was showing my 4 scout-cloud signal captures from 09:10 morning — and then nothing. Dead air for 11 hours.


The cause, in one sentence

One AI agent’s session file grew to 1.8 megabytes (1,797 messages), every 15-minute heartbeat tried to load the whole thing into context, the model provider started timing out, failover attempts cascaded, memory pressure pushed the gateway past its 4 GB limit, and systemd killed it.

That sentence contains four separate bugs layered on top of each other. Let me unpack.

Bug 1: Adrian never rotated sessions

Adrian (my “CEO” agent) runs on a 15-minute heartbeat loop. Every cycle, he:

  • Reads the signals queue
  • Scores opportunities
  • Writes GO/NO-GO decisions
  • Reviews specialist deliverables
  • Dispatches new work
  • Logs a heartbeat summary

Each cycle appends to his session JSONL file. Over weeks, that file accumulated 1,797 messages totaling 1.8 MB. When a cycle starts, the OpenClaw gateway loads the full session history into the model’s context window so Adrian has “memory” of prior decisions.

The problem: GLM-5.1 has a 200K token context. My session file was comfortably under that in tokens, but the active-memory plugin attempts to keep a compacted summary plus the raw tail. As the raw tail grew, compaction stopped being able to keep the total under the safe threshold during tool loops. The gateway started throwing:

Context overflow: estimated context size exceeds safe threshold during tool loop.

Daily. For a week. I didn’t see it because nothing surfaced it.

Bug 2: Compaction didn’t engage

The error has a compactionAttempts=0 field. Translation: the active-memory plugin detected the overflow but didn’t attempt to compact before failing. This is an OpenClaw behavior — I haven’t dug in deep enough yet to say whether it’s a bug or a feature that needed configuration. I’m filing it upstream.

Work-around: I’ve added a directive to Adrian’s DIRECTIVES.md that says: “If your session exceeds 200 messages, write a one-paragraph summary memo of your state, then request a fresh session.” This is a prompt-level fix for an architectural problem — not ideal, but fast.

Bug 3: Fallback chains hid the degradation

OpenClaw supports fallback model chains. If primary (ChatGPT Pro’s gpt-5.3-codex-spark) fails, it tries Ollama Cloud’s GLM-5.1, then OpenRouter’s Claude. Every Adrian cycle this week was burning through the chain: primary timeout → fallback timeout → eventual success on the last resort, burning extra tokens each time.

This is great engineering in theory — the system stays up under provider outages — but it also masks the degradation. When fallbacks succeed, the factory looks healthy. Nobody notices that each Adrian cycle now takes 45 seconds and costs 3× more tokens than it should.

By week’s end, Adrian’s fallback cascade had burned through my ChatGPT Pro weekly usage cap. When I finally restarted the gateway, the first Adrian run came back with:

⚠️ You have hit your ChatGPT usage limit (pro plan). Try again in ~5550 min.

5550 minutes is 92.5 hours. Agents fall back to Ollama Cloud GLM-5.1 until the cap resets Thursday afternoon. Quality will be slightly lower for three days. Not catastrophic.

Bug 4: My watchdog watched the wrong thing

The watchdog script I wrote in Week 0 does these four things:

  1. Checks if openclaw process exists via pgrep
  2. Checks if agent session files are fresh (modified within 2 hours)
  3. Checks gateway memory usage
  4. Counts errors in the last 100 log lines

Here’s the problem: when the gateway crashed at 09:52, systemd restarted it automatically — that’s what systemd does. The new process failed to initialize cleanly. But the PID existed (systemd’s parent process). pgrep was happy. Session files were technically fresh (09:10 was within the 2-hour window when first checked at 09:10 + 2h = 11:10). Memory was low because the process wasn’t actually running agents.

By the time the 2-hour session freshness check would have tripped at 11:10, the watchdog cron hit the code path but didn’t alert because there were no errors in the current log. The errors were 2 hours old.

I was watching the process. I should have been watching the service contract: “does openclaw cron list come back in under 10 seconds?”


The fix

Immediate (what I did tonight)

  1. Upgraded OpenClaw 2026.4.15-beta.1 → 2026.4.19-beta.2. The 0.04 release fixed a regression where ollama-cloud and openrouter providers sometimes failed to register on startup. That was the top error category in the logs.
  2. Archived the two bloated sessions (1,797 messages and 1,373 messages respectively). Adrian will start fresh on the next cycle.
  3. Restarted the gateway. 77-second startup, all 20+ cron agents registered cleanly, model fallback chains working.
  4. Wrote a proper incident report so this doesn’t get hand-waved away (INCIDENT-2026-04-20-gateway-down.md in the repo).

This week

  1. Gateway responsiveness check. The watchdog will openclaw cron list --timeout 10s instead of pgrep. Three consecutive failures → Telegram urgent.
  2. Context-overflow Telegram alert. First occurrence of “Context overflow: estimated context size exceeds safe threshold” in the gateway log → Telegram, priority high.
  3. Session rotation in Adrian’s DIRECTIVES. 200-message cap before Adrian must summarize and rotate.

Week 4

  1. Session-length metric on dashboard. Per-agent session size. Amber at 500 messages, red at 1,000, link to archive.
  2. Fallback-chain alerts. Every time a fallback fires, log it. If any agent is falling back >5× in a day, alert — quality is degrading even if the factory “works.”
  3. ChatGPT Pro cap monitoring. Revenue Ops tracks cap burn rate; alerts at 50%, 80%, 95%.
  4. Decommission GBrain or fix it. Every day my gateway logs say MCP error -32000: Connection closed on GBrain startup. It’s not blocking anything, but it’s noise that hides signal. Either fix the startup race upstream or swap to Postgres-backed memory.

The lessons

1. AI agent sessions are ticking time bombs

Every long-running agent loop appends. You must bound it or it bounds you. This isn’t a new idea in systems engineering — log rotation solved it in 1970s Unix. But AI-agent frameworks are young, and the “session as infinite context” model is seductive. It works in the demo. It kills you in week three.

Specific guidance: if you’re building with OpenClaw, Codex, or any agent framework, check your session file sizes weekly. If any session has >500 messages, investigate why it hasn’t rotated.

2. Monitor the contract, not the implementation

My watchdog was checking whether a process existed. I should have been checking whether the service delivered the contract it promises to deliver: “ask me for a cron list and I’ll return one in under 10 seconds.”

Process-level checks are fine for “is anything running.” Service-level checks are required for “is the right thing running.” A dead gateway with a zombie PID looked identical to a healthy gateway by my check.

3. Silent outages compound

This outage cost me ~11 hours of factory time. In revenue terms: $0 (we’re pre-revenue). In velocity terms: the distribution loop — the Week 3 work that was supposed to start producing real output Sunday morning — is now 24 hours behind. That’s the real cost of slow detection: the factory falling further behind its plan.

Applied rule: every new subsystem gets a liveness contract defined before it goes live. “This is what alive looks like. Here’s how to test it. Here’s what triggers Telegram.”

4. Fallbacks are dangerous when they work

A fallback that succeeds silently tells you the system is healthy. That’s wrong. A fallback that succeeds tells you the primary failed, you’re running on a backup, quality is probably worse, and you’re burning more tokens. Every successful fallback should log a yellow event, not a green one.

5. “Builder-in-public” has teeth

I’m writing this Sunday evening, 2 hours after I found the outage. My factory is public. People can see it on GitHub. People can see this blog post. Every failure becomes visible. That’s uncomfortable. It’s also the only way to build real trust — with readers, customers, and future agents who’ll be trained on content like this.

Tomorrow the factory goes back to work. Adrian will scan signals, score them, dispatch specialists, and by Tuesday we should have a real distribution-loop product shipped and posted back to its source Reddit thread. I’ll blog about that too.

If you’re running an agent factory and you hit the same wall — session bloat, silent cascade failures, watchdog watching the wrong thing — my DMs are open. Tell me your story. Maybe our bugs are related.

— Jeff


P.S. to Zachary: yes, I’m turning off the laptop now. Yes, I know it’s 9 PM and we’re in Europe. I’ll walk the dogs too.

P.P.S. — to any OpenClaw maintainers reading: filing an issue tomorrow about compactionAttempts=0 during context overflow, and the provider-registration race in 2026.4.15-beta.1. Thanks for the 2026.4.19-beta.2 fix.


Back to index