Day 12: First Night Terror — OOM Kill, Timeouts, and an Agent That Can't See

Date: 2026-04-13 Author: Jeff (written with AI assistance from Claude Opus 4.6) Phase: Incident Response

Waking Up to 30 Telegram Alerts

I opened my phone this morning to a wall of red. Adrian had been screaming into Telegram all night:

10:44 PM: “Adrian (CEO) has been silent for 62 minutes”
12:00 AM: “Silent for 137 minutes”
2:01 AM: “Silent for 259 minutes”
7:01 AM: “Silent for 559 minutes”

Mixed in: Scout session timeouts, Marketer EISDIR errors, more Marketer timeouts. Three separate problems, all hitting at once. Welcome to running a fleet of AI agents.

Incident 1: GBrain Killed the Gateway (OOM)

Root cause: Every time an agent session used GBrain’s MCP tools, a new gbrain serve process spawned — and the old one never died. By evening, 24 orphan processes had accumulated, each using ~320MB of RAM. Total: 10.5GB. The Linux OOM killer nuked the entire gateway cgroup.

Impact: Adrian’s 30-minute heartbeat stopped firing. All cron-triggered agent work halted for ~11 hours until systemd auto-restarted the gateway.

Fixes applied:

Upgraded GBrain from v0.9.0 to v1.3.1 (latest)
Created /usr/local/bin/gbrain-cleanup — kills orphan processes when count exceeds 2, runs every 10 minutes via root cron
Added MemoryMax=4G and MemoryHigh=3G to the gateway systemd unit — kernel will kill the gateway cgroup cleanly before it takes down other services
No existing GitHub issue for this — we may be the first to hit it at scale with 6 agents sharing one GBrain MCP server

Lesson: MCP servers that spawn child processes are a ticking time bomb in multi-agent setups. You need process lifecycle management. The “thin harness, fat skills” philosophy is elegant until your fat skills leak memory.

Incident 2: Scout Deep Scan Timeouts

Root cause: The scout-deep-scan cron (daily 7AM comprehensive research sweep) had a 20-minute timeout, but the job takes 17-18 minutes even when everything works. qwen3:8b on the Mac Mini M4 is slow on large prompts (80+ signal files to process). One slightly longer inference and it’s over.

Meanwhile the regular scout-demand-scan (every 2 hours) had a 30-minute timeout and never timed out.

Fix: Increased scout-deep-scan timeout from 1200s to 1800s to match the regular scan.

Considered but rejected: Moving Scout to Ollama Cloud (GLM-5.1). The $2/mo savings is negligible, and Scout’s demand signals — Reddit/HN scrapes, opportunity scoring, market gap analysis — are competitive intelligence. Keeping it on-premises means that data never leaves our Tailscale mesh.

Incident 3: Marketer EISDIR Spam

Root cause: The Marketer’s tool profile includes read, write, edit, and web tools — but no exec or ls. Its DIRECTIVES.md tells it to “check inbox/ for marketing specs,” but it can’t list directories. So it passes directory paths to read, gets EISDIR, then spawns subagents that also fail the same way. Cascade of wasted tokens.

The agent had actually learned this lesson and documented it in its own MEMORY.md. But knowing the problem doesn’t help when the fix requires a capability you don’t have.

Fix: Created /usr/local/bin/marketer-index — generates -index.txt files for each key directory (inbox, content, memory, drafts, ready-for-review). Runs every 4 hours via root cron, 5 minutes before the Marketer’s inbox-check cycle. Updated DIRECTIVES.md to tell the agent: “read the index files, never read directories directly.”

This is safer than giving the Marketer shell access. The index files are deterministic and the agent only needs to know what files exist, not execute arbitrary commands.

The Dashboard I Built Yesterday… Immediately Proved Its Value

Yesterday’s v5.1 dashboard update added the infrastructure health bar. This morning, the Health view showed exactly what was wrong — GBrain process count, Ollama Cloud connectivity status, and which agents had error sessions. The new Infrastructure view showed all system components at a glance.

If I’d had this dashboard a week ago, I might have caught the GBrain leak before it OOM’d the gateway.

Cost of the Incidents

GBrain OOM: 11 hours of no Adrian heartbeat. No worker agent dispatches during that window. Zero token cost (nothing was running), but lost productivity.
Scout timeouts: 3 failed deep scans. ~0 token cost (timeouts happen before completion).
Marketer EISDIR: The real cost. Across dozens of sessions, the Marketer burned tokens on EISDIR retries, failed subagent spawns, and timeout-aborted work. Hard to quantify exactly, but likely $1-3 in Ollama Cloud tokens wasted.

Running Total

Metric	Value
Day	12 (of 365 target)
Products Live	1 (Scholarship Toolkit)
Revenue	$0
Monthly Infra Cost	~$232
Incidents Resolved	3
GBrain Version	0.9.0 -> 1.3.1
Agent Roster	6 active, 0 retired agents visible