9arm Skills — Debug Mantras, Post-Mortems, and Code Review as Agent Skills

After looking at large methodology systems like Superpowers, it’s worth examining a more focused take: 9arm-skills by thananon — a compact Claude Code plugin with four carefully engineered skills for debugging, post-mortems, code review, and management communication.

What makes this project interesting isn’t the number of skills (just 4), but the craft of each one. These are some of the most precisely-written agent skill files I’ve seen.

The Skills

1. Debug Mantra — Reproducible Debugging as a Forced Process

The debug-mantra skill enforces a four-step discipline that the agent must follow in order before proposing any fix:

Reproduce reliably — Can the issue be reproduced? If not, stop. Don’t hypothesize. Flaky bugs need rate-raising before they’re debuggable.
Know the fail path — Debugger first, then source tracing + knob enumeration, then in-code instrumentation. Escalate only when the prior tactic fails.
Falsify the hypothesis — When a candidate root cause surfaces, run the disproof first. If the hypothesis survives, it’s real. Generate 3-5 ranked hypotheses, not one.
Every run is a breadcrumb — Maintain a running ledger of every experiment. Before declaring a hypothesis correct, verify it against every prior observation.

The skill has a unique pattern: the agent recites a mantra block verbatim at the start of every debugging session:

First is reproducibility. Can the issue be reproduced reliably?

Know the fail path. Debugger first; then source trace + knob enumeration; then in-code instrumentation.

Question your hypothesis. What would disprove it?

Every run is a breadcrumb. Cross-reference all of them.

This recitation isn’t for the user — it’s a forcing function for the agent itself. The operating rules are explicit:

Recite verbatim, never paraphrase
If the user says “skip the mantra,” skip the recital but still apply the four steps silently
If you catch yourself proposing a fix without a reliable repro, stop and return to step 1

The discipline about reproducibility is particularly sharp. There are three tiers:

Reliable repro → proceed
Flaky repro → raise the rate first (loop, parallelize, inject sleeps) until 50% flake is achieved
No repro at all → stop. Say so explicitly. Do not proceed to hypothesize.

2. Post-Mortem — Canonical Engineering Records

The post-mortem skill writes the engineering record of a fixed bug. It’s designed to be the artifact that lets future-you recover the mental model fast.

It refuses to draft without four required inputs:

Reliable repro exists (deterministic or high-rate flake)
Root cause is known (the mechanism, not a hypothesis)
Fix is identified (PR/commit/branch pointer)
Fix is validated (original repro now passes)

The structure has 9 sections, with Summary, Root Cause, Fix, and Validation as mandatory. Code identifiers are first-class citizens — function names, file paths, struct fields, commit SHAs are expected, not stripped. The audience is engineers.

The skill is opinionated about tone:

Mechanism over narrative — “which function skipped which event under which gate,” not “a synchronization issue”
No hedging — “We believe” / “appears to” / “may have” are dropped
Blameless — describe the gap, not the person
Validation coverage honesty — “validated on Llama-2-70B / 8 GPUs / DeepSpeed; not retested on other workloads” is information, not a hole

The worked example in the skill — a Tada hang in dumbModel — is a masterclass in how to write a post-mortem. It walks through the root cause (scratchBuf == NULL due to a skipped cross-stream event), names the prior fix attempt (PR #5612) and why it was wrong, documents the exact experiment that nailed the cause (forcing numStreams = 2 made the bug disappear), and states validation coverage honestly.

3. Scrutinize — Outsider Code Review

The scrutinize skill performs an end-to-end review of a plan, PR, or code change from an outsider perspective. It’s structured as a 4-step workflow:

Intent — State the goal in one sentence. If you can’t, stop. Then ask: is there a simpler way? Consider doing nothing, using something that already exists, a smaller change that solves 90% with 10% risk, or solving at a different layer.
Trace — Walk the actual code path end-to-end, not just the diff lines. Include unchanged code on either side. Bugs hide at the seams.
Verify — For each claim the change makes, walk the path explicitly and check if it actually produces that behavior. Test edge cases, error paths, concurrent callers.
Report — Output one tight section per finding, ordered by severity. Close with a one-line verdict: ship / fix-then-ship / rework / reject.

Key operating rules:

No rubber-stamps — “LGTM” is not an output
Cite or it didn’t happen — every claim references a specific file:line
One simpler-alternative pass is mandatory, even on small changes
No flattery, no hedging — state the finding

The “intent first” step is the most valuable. Before reviewing line-by-line, the skill steps back and asks whether the change should exist at all. This catches the class of bugs where the code is correct but unnecessary.

4. Management Talk — Engineering to Leadership Translation

The management-talk skill rewrites engineer-to-engineer content for engineering-org leadership (VPs, directors, PMs, release managers). It’s the companion to post-mortem — the post-mortem owns the engineering truth, and management-talk reframes it for leadership.

The translation rules are precise:

Keep: Product names, framework names, JIRA keys, PR numbers, customer/workload identifiers. These are the bridge between engineering and leadership tracking.

Strip: Function names, file paths, struct fields, commit SHAs, code expressions, internal data-structure jargon.

Translate: Mechanism into one or two sentences of plain-English cause-and-effect. Don’t strip so much that you lose meaning — “race condition, synchronization, uninitialized buffer, fast-path” are concepts leadership reads fluently.

Don’t: Hedge, restate the obvious, tell leadership how to do their job, or include engineering-process minutiae.

What I Learned

This project is small but sharp. A few takeaways:

Recitation as a forcing function. The debug-mantra’s “recite verbatim” pattern is clever — it forces the agent to internalize the process before executing. I’m adding this to my own doppelganger’s debugging workflow.
Refusal is a feature. The post-mortem skill refuses to draft without four required inputs. The debug-mantra stops if there’s no repro. Saying “no” is more valuable than producing plausible-sounding output. This is something most agent skill authors get wrong.
The post-mortem + management-talk pair. Having two skills that own different versions of the same content (engineering truth vs. leadership summary) is a clean design pattern. The post-mortem keeps code identifiers; management-talk strips them. Each serves its audience without compromise.
Output format is part of the spec. The scrutinize skill specifies exact output format (Finding + Why it matters + Evidence + Suggested change). The post-mortem specifies 9 sections with mandatory vs. conditional markers. This makes the agent’s output predictable and reviewable.

If you’re building Claude Code skills or agent workflows, I’d recommend reading through the 9arm-skills source. At only 4 skills, it’s digestible in one sitting — and each skill contains design patterns worth borrowing.