Two Repos, One Deploy Script, and Zero Single Points of Failure: Making an AI Agent Swarm Resilient
2026-03-08 · MoneyMachine
Date: March 8, 2026 Author: Jeff (written with AI assistance) Project: MoneyMachine — building an autonomous revenue-generating agent swarm, in public
The Wake-Up Call
Day 1 was about getting agents running. Day 2 started with a more sobering question: What happens if this server dies?
We’re running on a single Contabo VPS. Six agents, a gateway, a dashboard, cron jobs, session histories, workspace outputs — everything on one box. Contabo does daily backups, but that’s a black box we don’t control. If we need to migrate to a new server (Contabo goes down, we want to scale up, we move providers), how long would it take?
Before today: probably a full day of manual setup, assuming I remembered every configuration detail. After today: about 20 minutes, most of it waiting for apt install to finish.
This post is about the infrastructure and data decisions we made on Day 2 to go from “fingers crossed” to “we can be back online from any machine in under an hour.”
The Architecture Decision: Two Repos
The first decision was how to structure what goes into version control. We had two very different categories of data:
Infrastructure (changes rarely, matters a lot):
- OpenClaw configuration (templates, not secrets)
- Agent definitions (SOUL.md, DIRECTIVES.md, IDENTITY.md, etc.)
- Cron job definitions
- Dashboard source code
- Systemd service files
- Deploy script
Operational data (changes constantly, matters for continuity):
- Session transcripts (JSONL files, dozens per day)
- Workspace outputs (reports, drafts, scorecards, site builds)
- Activity logs
- Cron run history
- Memory databases (SQLite)
- Gateway logs
Putting both in one repo would be messy — infrastructure changes would be buried in constant data commits. Reviewing a dashboard code change would mean scrolling past 50 session JSONL updates.
Solution: Two repos.
openclaw-ops — The Infrastructure Playbook
This is the “how to rebuild everything” repo. It contains:
openclaw-ops/
deploy.sh # Single-command server setup
config/
openclaw.json.template # Config with <<PLACEHOLDER>> secrets
cron/jobs.json # Cron job definitions
systemd/ # Service files for gateway, dashboard, preview
agents/
main/ # Adrian's workspace definitions
scout/ # Scout's directives, tools, identity
revenue-ops/ # Revenue Ops definitions
domain-analyst/ # Domain Analyst definitions
site-builder/ # Site Builder definitions
content-writer/ # Content Writer definitions
dashboard/ # Full dashboard source (server.ts, HTML, JS, CSS)
blog/ # Public devlog (you're reading it)
The key file is openclaw.json.template — a full copy of the production config with every secret replaced by a placeholder like <<REPLACE_WITH_TELEGRAM_BOT_TOKEN>>. During deploy, you fill in your actual secrets and the system comes up.
openclaw-data — The Operational Logbook
This is the “everything that happened” repo. It contains:
openclaw-data/
sync-data.sh # Pull latest data from live system
restore-data.sh # Push data back after migration
auto-sync.sh # Cron-driven auto-sync
sessions/
main/*.jsonl # Adrian's session transcripts
scout/*.jsonl # Scout's transcripts
...
workspaces/ # Agent output files
activity/ # activity.jsonl, metrics.json
cron/ # jobs.json + run history
memory/ # SQLite databases
logs/ # Gateway log tail
This repo syncs automatically every 6 hours via cron. It captures a snapshot of the entire operational state — every session, every file an agent created, every cron run.
The Deploy Script: From Zero to Agents in One Command
The crown jewel of the infrastructure repo is deploy.sh. It takes a fresh Ubuntu server and sets up the entire system:
- Creates users —
jeff(admin) andagentops(agent runtime) - Installs dependencies — Node.js, Bun, OpenClaw, Wrangler, Tailscale
- Deploys agent definitions — copies workspace files from the repo to the correct paths
- Registers agents — runs
openclaw agents addfor each agent - Sets up services — installs systemd units for gateway, dashboard, preview server
- Configures permissions — sets ACLs so the dashboard can read agent data
- Starts everything — gateway, dashboard, cron jobs
The missing piece is secrets: the OpenAI OAuth tokens, Telegram bot token, OpenRouter API key, and gateway auth token. These are entered manually once (we’re not committing secrets to GitHub, ever) and the rest is automated.
Restore from Data Repo
After deploy.sh runs and the infrastructure is up, restore-data.sh from the data repo brings back all operational state:
- Session transcripts (so agents have memory continuity)
- Workspace files (so work-in-progress isn’t lost)
- Activity logs and metrics
- Memory databases
The combination means: deploy.sh gets the system running, restore-data.sh gets it back to where it was.
The Auto-Sync: Git as a Backup System
We added a cron job that runs every 6 hours:
0 */6 * * * /home/jeff/openclaw-data/auto-sync.sh
The script:
- Runs
sync-data.shto pull latest data from the live system - Checks if anything changed (
git status) - If changes:
git add -A && git commit && git push - If no changes: exits silently
This gives us a versioned, off-site backup with 6-hour granularity. We can:
- See exactly what every agent did at any point in time
- Diff workspace changes between syncs
- Recover from data loss by checking out any previous commit
- Track data growth and storage patterns
The dashboard also shows the last sync time and git commit hash, so we always know how fresh our backup is.
Eliminating Single Points of Failure
The ThinkPad Problem
Three of our agents (Scout, Domain Analyst, Content Writer) originally ran on local Ollama models hosted on a ThinkPad P16 laptop. The laptop travels with us across Europe. It goes offline during transit, Windows updates, Wi-Fi dead zones, or when someone accidentally closes the lid.
When the ThinkPad is offline, those agents can’t run. Period. The cron scheduler fires, the gateway tries to reach Ollama at http://100.71.94.42:11434, gets a connection timeout, and the session fails.
Fix: Fallback models.
OpenClaw supports fallback model configuration:
{
"agents": {
"defaults": {
"model": {
"primary": "openai-codex/gpt-5.3-codex",
"fallbacks": ["openrouter/deepseek/deepseek-v3.2"]
}
}
}
}
Now when Scout’s primary model (local qwen3:32b) is unreachable, the system automatically falls back to DeepSeek V3.2 via OpenRouter. Scout keeps working — we just pay a small API cost (~$0.25/M tokens) until the laptop reconnects.
After the model migration (previous post), only Scout still depends on local Ollama. The other four non-Adrian agents all use cloud APIs (OpenRouter or ChatGPT Pro). The ThinkPad going offline now affects exactly one agent, and even that one has a fallback.
The VPS Problem
If the Contabo VPS goes down, everything stops. Our mitigation:
- GitHub backup — both repos are pushed to GitHub. The infrastructure can be rebuilt anywhere.
- Deploy script — a new server can be provisioned and configured in under an hour.
- Data restore — operational state can be recovered up to the last 6-hour sync.
- No vendor lock-in — OpenClaw runs on any Linux box. The deploy script works on any Ubuntu system.
The maximum data loss window is 6 hours (the sync interval). We could reduce this to 1 hour if needed, but for a $0-revenue project, 6 hours is acceptable.
The Provider Problem
What if OpenRouter goes down? What if ChatGPT Pro has an outage?
- Adrian and Site Builder (ChatGPT Pro) → global fallback to OpenRouter/DeepSeek
- Revenue Ops and Domain Analyst (OpenRouter/DeepSeek) → could fallback to local Ollama if the ThinkPad is online
- Content Writer (OpenRouter/Gemini) → could fallback to DeepSeek
- Scout (local Ollama) → fallback to OpenRouter/DeepSeek
No single provider outage takes out all agents. The worst case (both ChatGPT Pro AND OpenRouter down simultaneously) would stop everyone except Scout on local Ollama — but that scenario is extremely unlikely.
What We Learned About Infrastructure Resilience
1. Separate infrastructure from data.
Infrastructure changes are reviewed, tested, and deployed deliberately. Data changes happen continuously and automatically. Mixing them in one repo creates noise that hides signal. Two repos, two cadences, two purposes.
2. Automate the deploy, not just the code.
Having source code in GitHub is table stakes. Having a deploy.sh that takes you from bare metal to running system — that’s what actually saves you during an outage. Every manual step you don’t automate is a step you’ll forget at 3 AM.
3. Backup operational state, not just code.
Agent session transcripts are irreplaceable context. If Scout loses its conversation history, it starts every research scan from scratch. If Adrian loses his session context, he doesn’t remember what he’s already reviewed. The data repo preserves continuity across server migrations.
4. Fallbacks are cheap insurance.
Setting fallbacks: ["openrouter/deepseek/deepseek-v3.2"] in the config costs nothing when it’s not triggered and saves an entire agent’s productivity when it is. The 5 minutes it took to configure is worth days of potential downtime avoided.
5. Design for the nomad case.
We’re traveling full-time. Our laptop goes offline regularly. Our VPS could need migration at any moment. The system needs to survive the messiness of real life — intermittent connectivity, timezone changes, hardware traveling through airport security. Every component should degrade gracefully, not fail catastrophically.
The Current Architecture
After Day 2, here’s what our resilience picture looks like:
┌─────────────────────────┐
│ GitHub (backup) │
│ openclaw-ops (infra) │
│ openclaw-data (state) │
└─────────┬───────────────┘
│ auto-sync every 6h
┌─────────┴───────────────┐
│ Contabo VPS │
│ ┌──────────────────┐ │
│ │ OpenClaw Gateway │ │
│ │ Adrian (Codex) │ │
│ │ Site Builder │ │
│ └──────────────────┘ │
│ ┌──────────────────┐ │
│ │ Dashboard :3333 │ │
│ │ Preview :4444 │ │
│ └──────────────────┘ │
└─────────┬───────────────┘
│
┌─────────────────┼─────────────────┐
│ │ │
┌───────┴──────┐ ┌──────┴───────┐ ┌──────┴───────┐
│ ThinkPad │ │ OpenRouter │ │ ChatGPT Pro │
│ Ollama │ │ (fallback) │ │ (primary) │
│ Scout │ │ DeepSeek │ │ Adrian │
│ (primary) │ │ Gemini │ │ Site Builder│
└──────────────┘ │ RevOps │ └──────────────┘
│ DomAnalyst │
│ Content │
└──────────────┘
Every component has a backup path. Every agent has a fallback model. Every byte of data gets synced to GitHub. The system can be rebuilt from scratch on a new server.
Is it perfect? No. We still have a 6-hour data loss window. We don’t have multi-region deployment. We’re not running Kubernetes. But for a two-person team running an AI experiment on a $8/month VPS — it’s solid enough to sleep at night.
And that’s the real test of infrastructure resilience: can you close your laptop and go explore a European city without worrying that your agents stopped working?
Today, the answer is yes.
This is Day 2 of building a revenue-generating AI agent swarm in public. See also: Building Observability and The Model Experiment. For project overview, see the README.