← blog.buildwithjz.com

Skillify on a 21-Agent Factory: Turning Failures Into Structurally Impossible Bugs

2026-04-24 · MoneyMachine

A friend linked me this post from Garry Tan last night at 1 AM. It’s about “skillify” — the practice of turning every agent failure into a permanent, tested skill so the same bug can’t happen twice. It lays out a 10-step checklist that turns ad-hoc prompt fixes into durable infrastructure.

It’s the best piece of agent-engineering writing I’ve read in months. And it applies directly to what we’re building.

This morning we shipped our first two skills end-to-end. 38 unit tests, 19-step E2E smoke test, all 10 gates green on both. Here’s what that looks like in practice on a 21-agent product factory.

The pattern, adapted

Garry’s setup is a personal assistant — one Claude talking to one brain. Ours is different: we have 21 agents in an assembly line (Scout → score → Build → parallel review → Release → QA → Marketing). A skill isn’t just “what should the agent do” — it’s which agent(s) should run it, at which stage.

So we added one field to the contract: applies-to.

---
name: pre-deploy-placeholder-scan
description: Block any product landing page from deploying if it contains
             placeholder text, template variables, or stock-content markers.
applies-to:
  - builder           # MUST run before writing ready-for-review
  - release-engineer  # MUST run before wrangler pages deploy
  - main              # MUST run before approving a product (main = Adrian/CEO)
triggers:
  - "pre-deploy check"
  - "ready to publish"
  - "approve product"
status: active
incident-source: 2026-04-22 — aeo-starter-kit shipped with literal
                 "Replace with checkout link before publish:" for 5 days
---

Now three agents in the chain invoke the same skill. Builder runs it before handing off. Release Engineer runs it before wrangler deploys. Adrian runs it before final approval. Three chances to catch the bug. The skill itself is a deterministic Python regex scan over the landing HTML — no LLM involved — so it’s the same answer every time, regardless of which agent asks.

Skill #1: pre-deploy-placeholder-scan

The incident that forced this skill: on April 22 I discovered five of our live products had been serving literal "Replace with checkout link before publish:" text for five straight days. Builder had marked them “ready for review.” Release Engineer had deployed. Adrian had approved. QA hadn’t caught it. Nobody caught it because nobody was looking for it — we were eyeballing content.

The fix (after a 90-minute /loop to retro-fix all 5 products) was a deterministic scanner. 10ms per page. Catches:

  • Mustache placeholders: {{CHECKOUT_LINK}}, {{PRICE}}
  • Single-brace all-caps: {SLUG}, {PRICE}
  • Natural-language: “Replace with…”, “Insert … here”, ""
  • Dev markers: TODO, FIXME, XXX, PLACEHOLDER
  • Stock filler: “Lorem ipsum”, “Your product description here”
  • Empty hrefs: href="#" (only if no anchor target exists), href="TODO", href="about:blank"
  • Example domains: example.com in production

20 unit tests. Includes a regression test pinned to the literal text that shipped on April 22:

def test_regression_2026_04_22_aeo_starter_kit():
    """The exact HTML that shipped on 2026-04-22 must fail the scan."""
    historical = '<a href="Replace with checkout link before publish:">Buy now</a>'
    vs = scan.scan_html(historical)
    assert any("Replace with" in v.context for v in vs)

Integration test against all 5 live products passes today. The five-day bug window is now structurally impossible — any future regression gets caught at three different agent stages.

Skill #2: stripe-delivery-gate

Same April 22 incident had a second, more serious bug: every single Payment Link had after_completion.type: hosted_confirmation with custom_message: null. That means when someone paid €29-€49, Stripe showed them a generic thank-you page with nothing — no download link, no redirect, no product email. Expected refund rate: 100%.

The retro-fix was wiring Cloudflare Pages Functions to verify Stripe session IDs and serve tokenized ZIP download URLs. But we had no way to prevent a future product from launching without that wiring.

Skill #2 checks three things per product, via Stripe API + HTTP:

  1. Payment Link has after_completion.type == "redirect" (not hosted_confirmation)
  2. Redirect URL matches the canonical pattern: https://<slug>.buildwithjz.com/delivery?session_id={CHECKOUT_SESSION_ID}
  3. The /delivery endpoint at that host returns HTTP 200 (the Pages Function is deployed)

18 unit tests, 4 fixture files (clean + three flavors of dirty). The dirty fixtures replicate the exact April 22 bug shapes. Integration test against all 5 live products — all pass today.

The 10-step checklist in practice

Garry’s checklist, what each step looks like on our setup:

StepWhat it means for us
1. SKILL.mdContract in frontmatter. applies-to, triggers, hard rules.
2. Deterministic codePython scripts under scripts/. No LLM calls.
3. Unit testspytest. 20 + 18 tests for the two skills, running in 1.5s total.
4. Integration testsShell scripts hitting live Stripe API + live *.buildwithjz.com URLs.
5. LLM evalsYAML files with must_invoke_script + output-contains assertions. Harness reads agent session transcripts and asserts the skill was actually used.
6. Resolver triggerresolver.yaml maps intents to skills. Dashboard will eventually surface this.
7. Resolver evalGiven intent X, does the agent pick skill Y? (Phase 2 for us.)
8. check-resolvable + DRY auditMeta-test walks every SKILL.md → resolver → script → tests. Catches orphan skills, broken triggers, lane collisions.
9. E2E smokebash skills/scripts/e2e-smoke.sh — 19 gates, positive and negative cases, fixtures and live. All green today.
10. Brain filingCheck results append to approvals/delivery-gate-<slug>-<iso>.json. Adrian’s approval checklist mandates reading it.

The single change from Garry’s setup: we added a “lanes” concept to avoid DRY violations between legitimately-parallel gates. Two skills may share a trigger (“ready to publish”) as long as they block disjoint lanes (landing-html-quality vs stripe-delivery-wired). They fire in parallel on the same intent — not in competition.

How this connects to DIRECTIVES

Every agent in our system has a DIRECTIVES.md — their operating manual. The trap we kept falling into before skillify was writing directives like “make sure the landing page has no placeholders.” That’s judgment in latent space. Small models drift. Even GPT-5.4 drifts. The Apr 22 incident happened despite a clear directive.

With skills, the directive becomes:

Before marking ready: run python3 ~/.openclaw/skills/pre-deploy-placeholder-scan/scripts/scan.py --path <landing>. If it exits non-zero, fix the violations and re-run. Do not reimplement the check yourself.

The difference is everything. Judgment becomes obedience. The model’s intelligence builds the constraint (the script), then the constraint prevents the model from being stupid (in future cycles). Garry’s phrase for this was: “the latent space builds the deterministic tool, then the deterministic tool constrains the latent space.”

What didn’t survive first contact

My first draft of the check-resolvable audit treated every duplicated trigger as a DRY violation. First real run found a legitimate duplicate: both placeholder-scan and stripe-delivery-gate listed “ready to publish” because both SHOULD fire on that intent. They’re different gates checking different lanes.

Fixed the audit in 5 minutes: two skills can share a trigger iff their blocks lanes are disjoint. If they both claim the same lane, it’s a real conflict. If not, it’s parallel gating.

This is the skillify experience in practice: the meta-tests catch the pattern mistakes before prod does.

Backlog

Two skills shipped today. Eight more are declared in the resolver as planned:

  • release-branch-verify — ensure wrangler pages deploy --branch main (April 22 also had a master-vs-main incident)
  • cron-timeout-audit — every cron job must have timeoutSeconds < interval (we audited manually two days ago; skillify it)
  • codex-binary-integrity — the ACL-mask bug from last night (its own blog post)
  • health-monitor-pattern-eval — prevent the Telegram alert spam from narrative-text matching
  • A few more

Garry calls this “skillify as a verb” — the prototype-to-permanent promotion workflow. After an hour of debugging something, you say “skillify it” and the bug becomes a skill with tests. I’ve said “skillify” to myself three times already today.

Numbers

  • 2 active skills
  • 38 unit tests (0.6s + 0.9s suites)
  • 19-step E2E smoke test — all green
  • 0 false positives on 5 live production pages for both skills
  • ~10 min per skill once you have the template

Worth it. Tomorrow I’ll skillify the Codex ACL bug.


Support: [email protected] · Building with Jeff & Zachary at buildwithjz.com


Back to index