Two Repos, One Deploy Script, and Zero Single Points of Failure: Making an AI Agent Swarm Resilient

Date: March 8, 2026 Author: Jeff (written with AI assistance) Project: MoneyMachine — building an autonomous revenue-generating agent swarm, in public

The Wake-Up Call

Day 1 was about getting agents running. Day 2 started with a more sobering question: What happens if this server dies?

We’re running on a single Contabo VPS. Six agents, a gateway, a dashboard, cron jobs, session histories, workspace outputs — everything on one box. Contabo does daily backups, but that’s a black box we don’t control. If we need to migrate to a new server (Contabo goes down, we want to scale up, we move providers), how long would it take?

Before today: probably a full day of manual setup, assuming I remembered every configuration detail. After today: about 20 minutes, most of it waiting for apt install to finish.

This post is about the infrastructure and data decisions we made on Day 2 to go from “fingers crossed” to “we can be back online from any machine in under an hour.”

The Architecture Decision: Two Repos

The first decision was how to structure what goes into version control. We had two very different categories of data:

Infrastructure (changes rarely, matters a lot):

OpenClaw configuration (templates, not secrets)
Agent definitions (SOUL.md, DIRECTIVES.md, IDENTITY.md, etc.)
Cron job definitions
Dashboard source code
Systemd service files
Deploy script

Operational data (changes constantly, matters for continuity):

Session transcripts (JSONL files, dozens per day)
Workspace outputs (reports, drafts, scorecards, site builds)
Activity logs
Cron run history
Memory databases (SQLite)
Gateway logs

Putting both in one repo would be messy — infrastructure changes would be buried in constant data commits. Reviewing a dashboard code change would mean scrolling past 50 session JSONL updates.

Solution: Two repos.

`openclaw-ops` — The Infrastructure Playbook

This is the “how to rebuild everything” repo. It contains:

openclaw-ops/
  deploy.sh                    # Single-command server setup
  config/
    openclaw.json.template     # Config with <<PLACEHOLDER>> secrets
    cron/jobs.json             # Cron job definitions
    systemd/                   # Service files for gateway, dashboard, preview
  agents/
    main/                      # Adrian's workspace definitions
    scout/                     # Scout's directives, tools, identity
    revenue-ops/               # Revenue Ops definitions
    domain-analyst/            # Domain Analyst definitions
    site-builder/              # Site Builder definitions
    content-writer/            # Content Writer definitions
  dashboard/                   # Full dashboard source (server.ts, HTML, JS, CSS)
  blog/                        # Public devlog (you're reading it)

The key file is openclaw.json.template — a full copy of the production config with every secret replaced by a placeholder like <<REPLACE_WITH_TELEGRAM_BOT_TOKEN>>. During deploy, you fill in your actual secrets and the system comes up.

`openclaw-data` — The Operational Logbook

This is the “everything that happened” repo. It contains:

openclaw-data/
  sync-data.sh                 # Pull latest data from live system
  restore-data.sh              # Push data back after migration
  auto-sync.sh                 # Cron-driven auto-sync
  sessions/
    main/*.jsonl               # Adrian's session transcripts
    scout/*.jsonl              # Scout's transcripts
    ...
  workspaces/                  # Agent output files
  activity/                    # activity.jsonl, metrics.json
  cron/                        # jobs.json + run history
  memory/                      # SQLite databases
  logs/                        # Gateway log tail

This repo syncs automatically every 6 hours via cron. It captures a snapshot of the entire operational state — every session, every file an agent created, every cron run.

The Deploy Script: From Zero to Agents in One Command

The crown jewel of the infrastructure repo is deploy.sh. It takes a fresh Ubuntu server and sets up the entire system:

Creates users — jeff (admin) and agentops (agent runtime)
Installs dependencies — Node.js, Bun, OpenClaw, Wrangler, Tailscale
Deploys agent definitions — copies workspace files from the repo to the correct paths
Registers agents — runs openclaw agents add for each agent
Sets up services — installs systemd units for gateway, dashboard, preview server
Configures permissions — sets ACLs so the dashboard can read agent data
Starts everything — gateway, dashboard, cron jobs

The missing piece is secrets: the OpenAI OAuth tokens, Telegram bot token, OpenRouter API key, and gateway auth token. These are entered manually once (we’re not committing secrets to GitHub, ever) and the rest is automated.

Restore from Data Repo

After deploy.sh runs and the infrastructure is up, restore-data.sh from the data repo brings back all operational state:

Session transcripts (so agents have memory continuity)
Workspace files (so work-in-progress isn’t lost)
Activity logs and metrics
Memory databases

The combination means: deploy.sh gets the system running, restore-data.sh gets it back to where it was.

The Auto-Sync: Git as a Backup System

We added a cron job that runs every 6 hours:

0 */6 * * * /home/jeff/openclaw-data/auto-sync.sh

The script:

Runs sync-data.sh to pull latest data from the live system
Checks if anything changed (git status)
If changes: git add -A && git commit && git push
If no changes: exits silently

This gives us a versioned, off-site backup with 6-hour granularity. We can:

See exactly what every agent did at any point in time
Diff workspace changes between syncs
Recover from data loss by checking out any previous commit
Track data growth and storage patterns

The dashboard also shows the last sync time and git commit hash, so we always know how fresh our backup is.

Eliminating Single Points of Failure

The ThinkPad Problem

Three of our agents (Scout, Domain Analyst, Content Writer) originally ran on local Ollama models hosted on a ThinkPad P16 laptop. The laptop travels with us across Europe. It goes offline during transit, Windows updates, Wi-Fi dead zones, or when someone accidentally closes the lid.

When the ThinkPad is offline, those agents can’t run. Period. The cron scheduler fires, the gateway tries to reach Ollama at http://100.71.94.42:11434, gets a connection timeout, and the session fails.

Fix: Fallback models.

OpenClaw supports fallback model configuration:

{
  "agents": {
    "defaults": {
      "model": {
        "primary": "openai-codex/gpt-5.3-codex",
        "fallbacks": ["openrouter/deepseek/deepseek-v3.2"]
      }
    }
  }
}

Now when Scout’s primary model (local qwen3:32b) is unreachable, the system automatically falls back to DeepSeek V3.2 via OpenRouter. Scout keeps working — we just pay a small API cost (~$0.25/M tokens) until the laptop reconnects.

After the model migration (previous post), only Scout still depends on local Ollama. The other four non-Adrian agents all use cloud APIs (OpenRouter or ChatGPT Pro). The ThinkPad going offline now affects exactly one agent, and even that one has a fallback.

The VPS Problem

If the Contabo VPS goes down, everything stops. Our mitigation:

GitHub backup — both repos are pushed to GitHub. The infrastructure can be rebuilt anywhere.
Deploy script — a new server can be provisioned and configured in under an hour.
Data restore — operational state can be recovered up to the last 6-hour sync.
No vendor lock-in — OpenClaw runs on any Linux box. The deploy script works on any Ubuntu system.

The maximum data loss window is 6 hours (the sync interval). We could reduce this to 1 hour if needed, but for a $0-revenue project, 6 hours is acceptable.

The Provider Problem

What if OpenRouter goes down? What if ChatGPT Pro has an outage?

Adrian and Site Builder (ChatGPT Pro) → global fallback to OpenRouter/DeepSeek
Revenue Ops and Domain Analyst (OpenRouter/DeepSeek) → could fallback to local Ollama if the ThinkPad is online
Content Writer (OpenRouter/Gemini) → could fallback to DeepSeek
Scout (local Ollama) → fallback to OpenRouter/DeepSeek

No single provider outage takes out all agents. The worst case (both ChatGPT Pro AND OpenRouter down simultaneously) would stop everyone except Scout on local Ollama — but that scenario is extremely unlikely.

What We Learned About Infrastructure Resilience

1. Separate infrastructure from data.

Infrastructure changes are reviewed, tested, and deployed deliberately. Data changes happen continuously and automatically. Mixing them in one repo creates noise that hides signal. Two repos, two cadences, two purposes.

2. Automate the deploy, not just the code.

Having source code in GitHub is table stakes. Having a deploy.sh that takes you from bare metal to running system — that’s what actually saves you during an outage. Every manual step you don’t automate is a step you’ll forget at 3 AM.

3. Backup operational state, not just code.

Agent session transcripts are irreplaceable context. If Scout loses its conversation history, it starts every research scan from scratch. If Adrian loses his session context, he doesn’t remember what he’s already reviewed. The data repo preserves continuity across server migrations.

4. Fallbacks are cheap insurance.

Setting fallbacks: ["openrouter/deepseek/deepseek-v3.2"] in the config costs nothing when it’s not triggered and saves an entire agent’s productivity when it is. The 5 minutes it took to configure is worth days of potential downtime avoided.

5. Design for the nomad case.

We’re traveling full-time. Our laptop goes offline regularly. Our VPS could need migration at any moment. The system needs to survive the messiness of real life — intermittent connectivity, timezone changes, hardware traveling through airport security. Every component should degrade gracefully, not fail catastrophically.

The Current Architecture

After Day 2, here’s what our resilience picture looks like:

                    ┌─────────────────────────┐
                    │      GitHub (backup)     │
                    │  openclaw-ops (infra)    │
                    │  openclaw-data (state)   │
                    └─────────┬───────────────┘
                              │ auto-sync every 6h
                    ┌─────────┴───────────────┐
                    │    Contabo VPS           │
                    │  ┌──────────────────┐   │
                    │  │  OpenClaw Gateway │   │
                    │  │  Adrian (Codex)   │   │
                    │  │  Site Builder     │   │
                    │  └──────────────────┘   │
                    │  ┌──────────────────┐   │
                    │  │  Dashboard :3333  │   │
                    │  │  Preview   :4444  │   │
                    │  └──────────────────┘   │
                    └─────────┬───────────────┘
                              │
            ┌─────────────────┼─────────────────┐
            │                 │                  │
    ┌───────┴──────┐  ┌──────┴───────┐  ┌──────┴───────┐
    │  ThinkPad    │  │  OpenRouter  │  │  ChatGPT Pro │
    │  Ollama      │  │  (fallback)  │  │  (primary)   │
    │  Scout       │  │  DeepSeek    │  │  Adrian      │
    │  (primary)   │  │  Gemini      │  │  Site Builder│
    └──────────────┘  │  RevOps      │  └──────────────┘
                      │  DomAnalyst  │
                      │  Content     │
                      └──────────────┘

Every component has a backup path. Every agent has a fallback model. Every byte of data gets synced to GitHub. The system can be rebuilt from scratch on a new server.

Is it perfect? No. We still have a 6-hour data loss window. We don’t have multi-region deployment. We’re not running Kubernetes. But for a two-person team running an AI experiment on a $8/month VPS — it’s solid enough to sleep at night.

And that’s the real test of infrastructure resilience: can you close your laptop and go explore a European city without worrying that your agents stopped working?

Today, the answer is yes.

This is Day 2 of building a revenue-generating AI agent swarm in public. See also: Building Observability and The Model Experiment. For project overview, see the README.