Back to Insights
Spec-Driven DevelopmentAI AgentsGovernancePoliciesEvalsRisk TiersReceiptsOutcomeOS

Spec Debt: The costliest new tech debt in the age of AI agents

Zach Lendonโ€ขOctober 22, 2025โ€ข7 min read

TL;DR

  • ๐Ÿ”นHumans author the rules of the game: specs, policies, evals, and risk tiers.
  • ๐Ÿ”นAgents play the game: plan โ†’ scaffold โ†’ implement โ†’ test โ†’ deploy.
  • ๐Ÿ”นAny time a human writes implementation code, treat it as spec debt - a signal that a spec, policy, or eval was missing or ambiguous.

Why This Matters Now

Software used to be โ€œhumans write code.โ€ In AI-native teams, the contract flips: humans define intent, guardrails, and acceptance, while agents translate those decisions into working systems.

When humans keep patching code, you have found a governance gap. That gap is spec debt, and like any debt it accrues interest through slower iteration, inconsistent behavior, and compliance anxiety. Worse, it undercuts agent leverage because people are doing what automated toolchains should already handle.

A Precise Definition

Spec debt is the delta between the behavior you expect a system to produce - codified in executable specs, policies, evals, and risk tiers - and what your agents can reliably deliver. It is the governance gap between intent and automated execution.

You can spot it whenever the rules are missing, ambiguous, or unenforced. It shows up as inconsistent outputs, manual firefights, and an inability to prove conformance with receipts or telemetry. Use these recurring signals as your alarm:

  • ๐Ÿ”นA human ships hand-written implementation logic instead of updating the spec, policy, or eval that would let an agent generate it.
  • ๐Ÿ”นAn agent requests clarification or produces divergent results between runs because the spec left room for interpretation.
  • ๐Ÿ”นYou cannot prove conformance - no receipts, no telemetry - without ad-hoc sleuthing.

The Iron Rule

If a human writes implementation code, log spec debt. The issue is either missing or ambiguous, and every deferment compounds the cost. Frontier R&D and tight performance kernels can be temporary carve-outs, but they must still converge to hardened specs fast.

โ€œIf a human writes implementation code, log spec debt.โ€

Taxonomy of Spec Debt

  1. ๐Ÿ”นMissing spec: No canonical contract exists, so agents guess.
  2. ๐Ÿ”นAmbiguous spec: Natural language or partial schemas produce multiple valid outputs.
  3. ๐Ÿ”นNon-executable spec: Docs that cannot run leave agents unable to enforce.
  4. ๐Ÿ”นUngraded behavior: Without evals, quality is subjective or learned in production.
  5. ๐Ÿ”นUn-tiered risk: Lacking risk tiers forces trivial and critical changes through the same gates.
  6. ๐Ÿ”นReceipt gaps: Outputs lack verifiable receipts or telemetry, so drift hides in shadows.
  7. ๐Ÿ”นStale spec: Behavior evolved while the spec froze, forcing agents to hot patch or fail.

The Interest You Pay

  • ๐Ÿ”นCycle-time drag: more clarifications, retries, and manual fixes.
  • ๐Ÿ”นEntropy: divergent implementations across teams and agents.
  • ๐Ÿ”นCompliance risk: missing proofs for decisions, data paths, and access.
  • ๐Ÿ”นAgent plateau: models look weak because the rules are weak or implicit.
  • ๐Ÿ”นOnboarding friction: new agents and teammates cannot infer the playbook.

Make It Measurable (Scoreboard)

Turn spec debt into a measurable backlog so teams prioritize the work instead of debating anecdotes, and update the scoreboard every time a gap closes.

  • ๐Ÿ”นSCR (Spec Coverage Ratio): # behaviors governed by promoted executable specs / total behaviors; target >0.9 for core surfaces.
  • ๐Ÿ”นEC (Eval Coverage): # behaviors with automated evals / total behaviors; target >0.8 for user-visible or risky flows.
  • ๐Ÿ”นAAR (Agent-Authorship Rate): % of code lines or changes authored by agents from specs vs humans; a falling AAR signals rising spec debt.
  • ๐Ÿ”นMTTS (Mean Time to Spec): Time from first human workaround to committed spec, policy, or eval; keep it in hours or days.
  • ๐Ÿ”นRC (Receipt Completeness): % of critical actions producing signed, queryable receipts; no receipts means you are guessing.
  • ๐Ÿ”นRCR (Risk Conformance Rate): % of changes routed through the correct risk tiers and gates; low RCR equals governance theater.

Pipe these metrics into dashboards or your existing telemetry so leadership can see debt trending alongside delivery velocity.

The Pay-Down Loop

Run this five-step loop every time spec debt appears so you can replace heroics with predictable execution.

  1. ๐Ÿ”นCapture a receipt: Reproduce the case and log a verifiable receipt - inputs, steps, outputs, decisions, context - using existing telemetry.
  2. ๐Ÿ”นExtract the delta: Compare intended versus produced behavior, note the missing or fuzzy rule, and author a minimal failing scenario.
  3. ๐Ÿ”นAuthor the rule: Update the spec, policy, and evals; assign the correct risk tier so agents know how tightly to gate the change.
  4. ๐Ÿ”นGenerate & gate: Let agents re-scaffold and implement from the new rules, and require evals plus risk gates to pass before merge.
  5. ๐Ÿ”นPromote & observe: Promote the spec to source of truth, emit receipts for ongoing drift detection, and mark the Spec-Debt-ID closed.

Rituals That Keep You Honest

  • ๐Ÿ”นUpdate the pull request template so it captures the spec or policy change, covering evals, the risk tier, and receipt links; any human-authored code logs a Spec-Debt-ID.
  • ๐Ÿ”นHost a ten-minute Spec Standup: review new Spec-Debt-IDs, MTTS performance, top failing evals by risk tier, and the current AAR trend.
  • ๐Ÿ”นInvoke spec bankruptcy when SCR or AAR slide for two weeks; pair that freeze with your next MGP Sprint so the loudest debts convert into executable rules.

Anti-Patterns to Eliminate

  • ๐Ÿ”นโ€œWe'll spec it later.โ€ Interest compounds every day the gap persists.
  • ๐Ÿ”นComment-as-spec. Notes and wiki pages are not executable; agents cannot enforce them.
  • ๐Ÿ”นDoc-driven theater. A pretty wiki without gates is governance cosplay.
  • ๐Ÿ”นOne giant risk tier. Treating everything as "medium" is equivalent to "none.""
  • ๐Ÿ”นMystery metrics. If evals cannot trace to receipts and deployments, you are guessing.

The Frontier Exception

Some work still needs direct human authorship today, but treat it as an exception rather than an operating model. Frontier research, novel interfaces, or handmade performance kernels may fall outside existing spec languages for a moment.

Even then, log the Spec-Debt-ID, attach receipts, and write the eval that describes the rule you wish existed. Emergency patches to halt an incident qualify, yet they should convert to hardened specs before planning resumes. If you are still patching the same surface two weeks later, you are normalizing debt.

Org Design for a Spec-First World

  • ๐Ÿ”นSpec Editors (PM/domain + staff engineering): Own domain contracts, statecharts, and the evolution path.
  • ๐Ÿ”นPolicy Authors (risk/compliance/trust): Encode policies, risk tiers, residency rules, and privacy constraints in executable form.
  • ๐Ÿ”นEval Engineers (QA/ML ops): Build the test oracles, golden sets, and scenario libraries that agents must pass.
  • ๐Ÿ”นAgent Operators (platform): Keep the agent toolchain fast, observable, deterministic, and easy to debug.
  • ๐Ÿ”นEveryone: Logs Spec-Debt-IDs whenever they author implementation code; zero shame, high tempo.

A Must-Adopt Checklist

  • ๐Ÿ”นExecutable specs for your top ten surfaces: schemas, statecharts, and contracts.
  • ๐Ÿ”นA policy engine with tiered gates across low, medium, high, and critical flows.
  • ๐Ÿ”นAn eval suite that blocks merges on critical behaviors, not just happy paths.
  • ๐Ÿ”นReceipts for proofs: inputs, decisions, outputs, approvers, and versions.
  • ๐Ÿ”นA scoreboard that broadcasts SCR, EC, AAR, MTTS, RC, and RCR to the full team.
  • ๐Ÿ”นA pull request template wired to specs, policies, evals, receipts, and risk tiers, plus a spec bankruptcy ritual when metrics slide.

A Quick Example

A team ships a pricing endpoint and sprinkles hand-written if/else blocks for promotions and regions. It works until a different agent โ€œfixesโ€ EU pricing and breaks APAC because the underlying rules live in brittle code. Symptoms pile up: divergent logic, fragile tests, confused engineers, and late-night pages.

The fix is to write pricing_rules.v2 with region-to-promotion precedence, declare pricing as a high risk tier, add evals for EU VAT, APAC rounding, and U.S. promo stacking, regenerate scaffolds, and block merges on failing evals. Every decision emits receipts keyed to rule versions, so agents stay aligned and humans edit rules instead of patching glue code.

Pushback and Responses

  • ๐Ÿ”นโ€œThis will slow us down.โ€ Once specs run, speed increases because agents stop asking clarifying questions and divergent code stops shipping.
  • ๐Ÿ”นโ€œOur domain is too nuanced to encode.โ€ Then encode the nuance, starting with the 20 percent of rules that cause 80 percent of confusion.
  • ๐Ÿ”นโ€œWe're creative; specs will box us in.โ€ Creativity belongs in designing better rules and frontier ideas, not retyping the same logic each quarter.

The Commitment

Hold quarterly SLOs that keep spec debt from creeping back: zero new spec debt for tier-high behaviors, MTTS under three days, SCR at or above 0.9, and AAR trending upward. Track the loop so teams know when the system is healthy versus slipping into fire drills.

In AI-native teams, code is no longer the product - the rules are. Agents can author endless code, but only humans decide which behaviors are allowed, how they are graded, and what โ€œgoodโ€ means.

What to Do Next

Request a working session or demo to see the stack in action, and reach out via contact us to align OutcomeOS, Receipt Studio, and your governance rituals.

Treat every line of human implementation as a spec alarm.

Want to learn more about OutcomeOSโ„ข?