← blog.buildwithjz.com

When Your AI Agents Look Busy But Are Actually Broken: Building Observability for Autonomous AI Systems

2026-03-08 · MoneyMachine

Date: March 8, 2026 Author: Jeff (written with AI assistance) Project: MoneyMachine — building an autonomous revenue-generating agent swarm, in public


The Illusion of Progress

Day 1 ended on a high note. Six agents running, eight cron jobs firing, a dashboard showing green lights and activity feeds. The system looked healthy. Then I started reading the session logs.

TOOLRESULT (qwen3.5:35b) {
  "status": "error",
  "tool": "read",
  "error": "ENOENT: no such file or directory, access
    '/home/agentops/.openclaw/workspaces/scout/memory/2026-03-08.md'"
}

That’s Scout, our research agent, failing to read its own memory file. Not once — hundreds of times per day. And Scout wasn’t alone. When I dug into the session transcripts across all six agents, the picture was grim:

  • Content Writer: 0 out of 9 sessions produced any useful output. Zero. The model (qwen3-coder:30b) crashed on the first error in every single session.
  • Domain Analyst: Barely functional. The model (llama3.3:70b) was hallucinating tool calls as plain text instead of actually invoking tools. It would write out {"tool": "read", "path": "..."} as a chat message instead of executing a tool call.
  • Revenue Ops: 50% of sessions produced zero tool calls — the agent would “think about” what to do and then the session would end.
  • Overall system health: 41%.

The dashboard showed agents as “Working.” The cron jobs showed status “ok.” The activity feed had entries. Everything appeared fine. But nearly 60% of all compute cycles were producing nothing but errors that nobody could see.

This is the observability problem for autonomous AI systems. And it’s worse than traditional software monitoring, because AI agents can look busy while accomplishing nothing in ways that traditional health checks won’t catch.

Why Traditional Monitoring Fails for AI Agents

In a normal web service, you monitor HTTP status codes, response times, error rates, and maybe some business metrics. If a service returns 500s, you know it’s broken. If latency spikes, you know something’s wrong.

AI agents are different:

  1. Sessions can “succeed” while producing nothing. A cron job that runs for 10 minutes and returns a polite message like “I’ve reviewed the current state and will continue working on this” registers as status: ok. No errors. No crashes. Also no actual work done.

  2. Errors are buried inside session transcripts. OpenClaw stores session data as JSONL files — one JSON object per line, thousands of lines per session. Errors appear as toolResult messages with status: "error" buried inside the conversation flow. The agent might recover from some errors and crash on others. You won’t know which without parsing every line.

  3. Model-level failures are invisible. When llama3.3:70b hallucinates a tool call as text, there’s no error. The model “responded successfully.” It just responded with garbage. The session looks normal from the outside.

  4. Agents compensate for failures in unpredictable ways. Scout couldn’t read its memory file, so it just… continued without memory, doing the same research scan it did 2 hours ago with no context from previous runs. From the dashboard, it looked like Scout was working. It was — just doing the same work repeatedly.

What We Built: A Health Monitoring System

The core insight: you have to parse the session transcripts. There’s no shortcut. The truth about agent health lives inside the JSONL files.

The Health API

We added two new endpoints to the dashboard server:

  • GET /api/health — overall system health + per-agent summary
  • GET /api/agents/:id/health — detailed health for one agent

The health calculation works by:

  1. Reading the last 20 sessions per agent from their JSONL transcript files
  2. Scanning every message for error patterns:
    • ENOENT — file not found (agent is looking for files that don’t exist)
    • EACCES — permission denied (ACL/ownership problems)
    • EISDIR — tried to read a directory as a file
    • model_crash — model returned empty response or stopReason=error
    • tool_error — a tool returned an error status
    • tool_not_found — agent tried to use a tool it doesn’t have access to
    • timeout — operation timed out
  3. Calculating a health score: percentage of recent sessions that had zero errors
  4. Tracking error trends: per-type counts, top errors, recent error timeline

Error Detection: The Details Matter

The trickiest part was parsing OpenClaw’s JSONL format correctly. Errors can appear in two places:

// Place 1: In the message details
if (msg.role === "toolResult" && msg.details?.status === "error") {
  // Error is in msg.details.error or msg.details.text
}

// Place 2: In the message content array
if (msg.content?.[0]?.text?.includes('"status": "error"')) {
  // Same error, different format
}

Our first implementation counted both, double-counting every error. Adrian’s error count jumped from a reasonable 572 to an alarming 1,144. We fixed this with an else branch — if we find the error in details, skip the content check.

The Dashboard View

The Agent Health view (#/health) shows:

  • Overall system health score — big number, color-coded (green/gold/red)
  • Git backup status — last sync time, commit hash
  • Per-agent health cards — health percentage, sessions analyzed, error counts, error type breakdown with color-coded tags
  • Recent errors timeline — table of the last 30 errors across all agents, sorted newest first, with agent name, error type, and error text

The main dashboard also got health integration: the metrics bar now shows system health percentage and backup status, and each agent card shows its health score and error count.

What We Discovered: The Three Tiers of Agent Failure

Parsing the session data revealed that agent failures fall into three tiers:

Tier 1: Environmental Failures (Fixable Immediately)

The majority of errors — over 600 per day — were ENOENT (file not found). Agents were looking for directories that didn’t exist:

  • ~/workspaces/scout/memory/ — no memory/ directory created
  • ~/workspaces/scout/ready-for-review/ — no ready-for-review/ directory
  • ~/workspaces/domain-analyst/scorecards/ — never created
  • ~/workspaces/content-writer/drafts/ — never created

Root cause: When we set up agent workspaces on Day 1, we created the top-level directories but not the subdirectories that agents were told to use in their DIRECTIVES.md files. The agents were following their instructions correctly — the filesystem just wasn’t set up to match.

The deeper root cause: Adrian (the CEO agent) was also guessing wrong paths for worker agent workspaces. His directives said to check ~/.openclaw/agents/scout/ but the actual path was ~/.openclaw/workspaces/scout/. We added a canonical workspace path table to Adrian’s directives:

| Agent          | Workspace Path                                    |
|----------------|--------------------------------------------------|
| scout          | /home/agentops/.openclaw/workspaces/scout          |
| revenue-ops    | /home/agentops/.openclaw/workspaces/revenue-ops    |
| domain-analyst | /home/agentops/.openclaw/workspaces/domain-analyst |
| site-builder   | /home/agentops/.openclaw/workspaces/site-builder   |
| content-writer | /home/agentops/.openclaw/workspaces/content-writer |

Fix: Created all missing directories (memory/, ready-for-review/, drafts/, reports/, scorecards/) across all workspaces. Updated Adrian’s directives with canonical paths. This alone eliminated the majority of daily errors.

Tier 2: Permission Failures (The ACL Maze)

The dashboard couldn’t read session files even though ACLs were set. The symptom: getfacl showed user:jeff:r-x but the effective permission was ---.

Root cause: Linux ACLs have a “mask” that acts as a ceiling on all non-owner permissions. The session files were created as 0600 (owner read/write only), which set the ACL mask to ---. Even though we’d set an ACL entry for jeff, the mask zeroed it out.

Fix: chmod g+rX on all session files to open up the mask, then re-applied ACLs. Added default ACLs on workspace directories so new files inherit correct permissions.

Lesson: ACLs in Linux are not additive. The mask value (which mirrors group permissions by default) limits ALL ACL entries. If you chmod 600 a file, every ACL entry is effectively nullified regardless of what setfacl says.

Tier 3: Model Failures (The Hard Problem)

This is where it gets interesting. Two of our agents were running models that fundamentally could not do the job:

llama3.3:70b (Domain Analyst): This model hallucinates tool calls. Instead of generating a proper function call that OpenClaw can execute, it outputs the tool call as markdown text in its response. It literally writes I'll use the read tool: {"tool": "read", "path": "/some/file"} as a chat message. OpenClaw never sees a tool invocation, so nothing happens. The model “works” but can’t interact with the environment.

qwen3-coder:30b (Content Writer): This model crashes on any error. If a file doesn’t exist and the tool returns an error, instead of gracefully handling it (creating the file, trying a different path, etc.), the model produces an empty response. stopReason: error. Every time. 100% of sessions with any tool error resulted in a model crash. 0 out of 9 sessions produced useful output.

These aren’t configuration problems — they’re fundamental model capability gaps for the agent tool-calling use case. No amount of prompt engineering fixes a model that can’t generate proper tool call format.

The Observability Takeaway

The key lesson from Day 2: autonomous AI systems need purpose-built observability, not just health checks.

Here’s what we monitor now that we didn’t before:

WhatWhyHow
Error rate per sessionCatch silent failuresParse JSONL transcripts
Error type distributionDiagnose root causesClassify by ENOENT/EACCES/model_crash/etc.
Tool call success rateDetect model capability gapsCount tool calls vs tool errors
Session productivityCatch “busy doing nothing”Check if sessions produce file changes
Health score trendTrack if we’re improvingCompare scores over time
Git backup statusEnsure data durabilityCheck last commit time and sync status

The dashboard went from showing “are agents alive?” to showing “are agents actually productive?” That’s the difference between uptime monitoring and observability.

What We’d Do Differently

  1. Build health monitoring on Day 0, not Day 2. We ran agents for 12+ hours before discovering that 60% of their work was errors. That’s 12 hours of wasted compute and ChatGPT Pro tokens.

  2. Test model tool-calling before deploying. A 5-minute smoke test — “read this file, write that file, handle this error” — would have immediately revealed that llama3.3 and qwen3-coder can’t do agent work. We assumed “big model = capable model” and were wrong.

  3. Create workspace directories before writing directives that reference them. Sounds obvious in retrospect. If DIRECTIVES.md says “write to ~/drafts/”, the directory should exist before the agent’s first session.

  4. Set up default ACLs from the start. Don’t chase permission errors file by file. Set default ACLs on parent directories so every new file inherits the right permissions automatically.

What’s Next

The health monitoring system runs on every dashboard refresh (15-second intervals). We’re tracking trends and will add alerting if health scores drop below thresholds. The goal is to get from 41% overall health to 90%+ within the next 24 hours — primarily by fixing the broken models, which is the subject of tomorrow’s post.


This is Day 2 of building a revenue-generating AI agent swarm in public. For project overview, see the README. For the model migration story, see Day 2: The Model Experiment.


Back to index