Spec Debt: The costliest new tech debt in the age of AI agents
TL;DR
- ๐นHumans author the rules of the game: specs, policies, evals, and risk tiers.
- ๐นAgents play the game: plan โ scaffold โ implement โ test โ deploy.
- ๐นAny time a human writes implementation code, treat it as spec debt - a signal that a spec, policy, or eval was missing or ambiguous.
Why This Matters Now
Software used to be โhumans write code.โ In AI-native teams, the contract flips: humans define intent, guardrails, and acceptance, while agents translate those decisions into working systems.
When humans keep patching code, you have found a governance gap. That gap is spec debt, and like any debt it accrues interest through slower iteration, inconsistent behavior, and compliance anxiety. Worse, it undercuts agent leverage because people are doing what automated toolchains should already handle.
A Precise Definition
Spec debt is the delta between the behavior you expect a system to produce - codified in executable specs, policies, evals, and risk tiers - and what your agents can reliably deliver. It is the governance gap between intent and automated execution.
You can spot it whenever the rules are missing, ambiguous, or unenforced. It shows up as inconsistent outputs, manual firefights, and an inability to prove conformance with receipts or telemetry. Use these recurring signals as your alarm:
- ๐นA human ships hand-written implementation logic instead of updating the spec, policy, or eval that would let an agent generate it.
- ๐นAn agent requests clarification or produces divergent results between runs because the spec left room for interpretation.
- ๐นYou cannot prove conformance - no receipts, no telemetry - without ad-hoc sleuthing.
The Iron Rule
If a human writes implementation code, log spec debt. The issue is either missing or ambiguous, and every deferment compounds the cost. Frontier R&D and tight performance kernels can be temporary carve-outs, but they must still converge to hardened specs fast.
โIf a human writes implementation code, log spec debt.โ
Taxonomy of Spec Debt
- ๐นMissing spec: No canonical contract exists, so agents guess.
- ๐นAmbiguous spec: Natural language or partial schemas produce multiple valid outputs.
- ๐นNon-executable spec: Docs that cannot run leave agents unable to enforce.
- ๐นUngraded behavior: Without evals, quality is subjective or learned in production.
- ๐นUn-tiered risk: Lacking risk tiers forces trivial and critical changes through the same gates.
- ๐นReceipt gaps: Outputs lack verifiable receipts or telemetry, so drift hides in shadows.
- ๐นStale spec: Behavior evolved while the spec froze, forcing agents to hot patch or fail.
The Interest You Pay
- ๐นCycle-time drag: more clarifications, retries, and manual fixes.
- ๐นEntropy: divergent implementations across teams and agents.
- ๐นCompliance risk: missing proofs for decisions, data paths, and access.
- ๐นAgent plateau: models look weak because the rules are weak or implicit.
- ๐นOnboarding friction: new agents and teammates cannot infer the playbook.
Make It Measurable (Scoreboard)
Turn spec debt into a measurable backlog so teams prioritize the work instead of debating anecdotes, and update the scoreboard every time a gap closes.
- ๐นSCR (Spec Coverage Ratio): # behaviors governed by promoted executable specs / total behaviors; target >0.9 for core surfaces.
- ๐นEC (Eval Coverage): # behaviors with automated evals / total behaviors; target >0.8 for user-visible or risky flows.
- ๐นAAR (Agent-Authorship Rate): % of code lines or changes authored by agents from specs vs humans; a falling AAR signals rising spec debt.
- ๐นMTTS (Mean Time to Spec): Time from first human workaround to committed spec, policy, or eval; keep it in hours or days.
- ๐นRC (Receipt Completeness): % of critical actions producing signed, queryable receipts; no receipts means you are guessing.
- ๐นRCR (Risk Conformance Rate): % of changes routed through the correct risk tiers and gates; low RCR equals governance theater.
Pipe these metrics into dashboards or your existing telemetry so leadership can see debt trending alongside delivery velocity.
The Pay-Down Loop
Run this five-step loop every time spec debt appears so you can replace heroics with predictable execution.
- ๐นCapture a receipt: Reproduce the case and log a verifiable receipt - inputs, steps, outputs, decisions, context - using existing telemetry.
- ๐นExtract the delta: Compare intended versus produced behavior, note the missing or fuzzy rule, and author a minimal failing scenario.
- ๐นAuthor the rule: Update the spec, policy, and evals; assign the correct risk tier so agents know how tightly to gate the change.
- ๐นGenerate & gate: Let agents re-scaffold and implement from the new rules, and require evals plus risk gates to pass before merge.
- ๐นPromote & observe: Promote the spec to source of truth, emit receipts for ongoing drift detection, and mark the Spec-Debt-ID closed.
Rituals That Keep You Honest
- ๐นUpdate the pull request template so it captures the spec or policy change, covering evals, the risk tier, and receipt links; any human-authored code logs a Spec-Debt-ID.
- ๐นHost a ten-minute Spec Standup: review new Spec-Debt-IDs, MTTS performance, top failing evals by risk tier, and the current AAR trend.
- ๐นInvoke spec bankruptcy when SCR or AAR slide for two weeks; pair that freeze with your next MGP Sprint so the loudest debts convert into executable rules.
Anti-Patterns to Eliminate
- ๐นโWe'll spec it later.โ Interest compounds every day the gap persists.
- ๐นComment-as-spec. Notes and wiki pages are not executable; agents cannot enforce them.
- ๐นDoc-driven theater. A pretty wiki without gates is governance cosplay.
- ๐นOne giant risk tier. Treating everything as "medium" is equivalent to "none.""
- ๐นMystery metrics. If evals cannot trace to receipts and deployments, you are guessing.
The Frontier Exception
Some work still needs direct human authorship today, but treat it as an exception rather than an operating model. Frontier research, novel interfaces, or handmade performance kernels may fall outside existing spec languages for a moment.
Even then, log the Spec-Debt-ID, attach receipts, and write the eval that describes the rule you wish existed. Emergency patches to halt an incident qualify, yet they should convert to hardened specs before planning resumes. If you are still patching the same surface two weeks later, you are normalizing debt.
Org Design for a Spec-First World
- ๐นSpec Editors (PM/domain + staff engineering): Own domain contracts, statecharts, and the evolution path.
- ๐นPolicy Authors (risk/compliance/trust): Encode policies, risk tiers, residency rules, and privacy constraints in executable form.
- ๐นEval Engineers (QA/ML ops): Build the test oracles, golden sets, and scenario libraries that agents must pass.
- ๐นAgent Operators (platform): Keep the agent toolchain fast, observable, deterministic, and easy to debug.
- ๐นEveryone: Logs Spec-Debt-IDs whenever they author implementation code; zero shame, high tempo.
A Must-Adopt Checklist
- ๐นExecutable specs for your top ten surfaces: schemas, statecharts, and contracts.
- ๐นA policy engine with tiered gates across low, medium, high, and critical flows.
- ๐นAn eval suite that blocks merges on critical behaviors, not just happy paths.
- ๐นReceipts for proofs: inputs, decisions, outputs, approvers, and versions.
- ๐นA scoreboard that broadcasts SCR, EC, AAR, MTTS, RC, and RCR to the full team.
- ๐นA pull request template wired to specs, policies, evals, receipts, and risk tiers, plus a spec bankruptcy ritual when metrics slide.
A Quick Example
A team ships a pricing endpoint and sprinkles hand-written if/else blocks for promotions and regions. It works until a different agent โfixesโ EU pricing and breaks APAC because the underlying rules live in brittle code. Symptoms pile up: divergent logic, fragile tests, confused engineers, and late-night pages.
The fix is to write pricing_rules.v2 with region-to-promotion precedence, declare pricing as a high risk tier, add evals for EU VAT, APAC rounding, and U.S. promo stacking, regenerate scaffolds, and block merges on failing evals. Every decision emits receipts keyed to rule versions, so agents stay aligned and humans edit rules instead of patching glue code.
Pushback and Responses
- ๐นโThis will slow us down.โ Once specs run, speed increases because agents stop asking clarifying questions and divergent code stops shipping.
- ๐นโOur domain is too nuanced to encode.โ Then encode the nuance, starting with the 20 percent of rules that cause 80 percent of confusion.
- ๐นโWe're creative; specs will box us in.โ Creativity belongs in designing better rules and frontier ideas, not retyping the same logic each quarter.
The Commitment
Hold quarterly SLOs that keep spec debt from creeping back: zero new spec debt for tier-high behaviors, MTTS under three days, SCR at or above 0.9, and AAR trending upward. Track the loop so teams know when the system is healthy versus slipping into fire drills.
In AI-native teams, code is no longer the product - the rules are. Agents can author endless code, but only humans decide which behaviors are allowed, how they are graded, and what โgoodโ means.
What to Do Next
Request a working session or demo to see the stack in action, and reach out via contact us to align OutcomeOS, Receipt Studio, and your governance rituals.
Treat every line of human implementation as a spec alarm.