BlueFolders - AI-Native Venture Studio

TL;DR

🔹Humans author the rules of the game: specs, policies, evals, and risk tiers.
🔹Agents play the game: plan → scaffold → implement → test → deploy.
🔹Any time a human writes implementation code, treat it as spec debt - a signal that a spec, policy, or eval was missing or ambiguous.

Why This Matters Now

Software used to be “humans write code.” In AI-native teams, the contract flips: humans define intent, guardrails, and acceptance, while agents translate those decisions into working systems.

When humans keep patching code, you have found a governance gap. That gap is spec debt, and like any debt it accrues interest through slower iteration, inconsistent behavior, and compliance anxiety. Worse, it undercuts agent leverage because people are doing what automated toolchains should already handle.

A Precise Definition

Spec debt is the delta between the behavior you expect a system to produce - codified in executable specs, policies, evals, and risk tiers - and what your agents can reliably deliver. It is the governance gap between intent and automated execution.

You can spot it whenever the rules are missing, ambiguous, or unenforced. It shows up as inconsistent outputs, manual firefights, and an inability to prove conformance with receipts or telemetry. Use these recurring signals as your alarm:

🔹A human ships hand-written implementation logic instead of updating the spec, policy, or eval that would let an agent generate it.
🔹An agent requests clarification or produces divergent results between runs because the spec left room for interpretation.
🔹You cannot prove conformance - no receipts, no telemetry - without ad-hoc sleuthing.

The Iron Rule

If a human writes implementation code, log spec debt. The issue is either missing or ambiguous, and every deferment compounds the cost. Frontier R&D and tight performance kernels can be temporary carve-outs, but they must still converge to hardened specs fast.

“If a human writes implementation code, log spec debt.”

Taxonomy of Spec Debt

🔹Missing spec: No canonical contract exists, so agents guess.
🔹Ambiguous spec: Natural language or partial schemas produce multiple valid outputs.
🔹Non-executable spec: Docs that cannot run leave agents unable to enforce.
🔹Ungraded behavior: Without evals, quality is subjective or learned in production.
🔹Un-tiered risk: Lacking risk tiers forces trivial and critical changes through the same gates.
🔹Receipt gaps: Outputs lack verifiable receipts or telemetry, so drift hides in shadows.
🔹Stale spec: Behavior evolved while the spec froze, forcing agents to hot patch or fail.

The Interest You Pay

🔹Cycle-time drag: more clarifications, retries, and manual fixes.
🔹Entropy: divergent implementations across teams and agents.
🔹Compliance risk: missing proofs for decisions, data paths, and access.
🔹Agent plateau: models look weak because the rules are weak or implicit.
🔹Onboarding friction: new agents and teammates cannot infer the playbook.

Make It Measurable (Scoreboard)

Turn spec debt into a measurable backlog so teams prioritize the work instead of debating anecdotes, and update the scoreboard every time a gap closes.

🔹SCR (Spec Coverage Ratio): # behaviors governed by promoted executable specs / total behaviors; target >0.9 for core surfaces.
🔹EC (Eval Coverage): # behaviors with automated evals / total behaviors; target >0.8 for user-visible or risky flows.
🔹AAR (Agent-Authorship Rate): % of code lines or changes authored by agents from specs vs humans; a falling AAR signals rising spec debt.
🔹MTTS (Mean Time to Spec): Time from first human workaround to committed spec, policy, or eval; keep it in hours or days.
🔹RC (Receipt Completeness): % of critical actions producing signed, queryable receipts; no receipts means you are guessing.
🔹RCR (Risk Conformance Rate): % of changes routed through the correct risk tiers and gates; low RCR equals governance theater.

Pipe these metrics into dashboards or your existing telemetry so leadership can see debt trending alongside delivery velocity.

The Pay-Down Loop

Run this five-step loop every time spec debt appears so you can replace heroics with predictable execution.

🔹Capture a receipt: Reproduce the case and log a verifiable receipt - inputs, steps, outputs, decisions, context - using existing telemetry.
🔹Extract the delta: Compare intended versus produced behavior, note the missing or fuzzy rule, and author a minimal failing scenario.
🔹Author the rule: Update the spec, policy, and evals; assign the correct risk tier so agents know how tightly to gate the change.
🔹Generate & gate: Let agents re-scaffold and implement from the new rules, and require evals plus risk gates to pass before merge.
🔹Promote & observe: Promote the spec to source of truth, emit receipts for ongoing drift detection, and mark the Spec-Debt-ID closed.

Rituals That Keep You Honest

🔹Update the pull request template so it captures the spec or policy change, covering evals, the risk tier, and receipt links; any human-authored code logs a Spec-Debt-ID.
🔹Host a ten-minute Spec Standup: review new Spec-Debt-IDs, MTTS performance, top failing evals by risk tier, and the current AAR trend.
🔹Invoke spec bankruptcy when SCR or AAR slide for two weeks; pair that freeze with your next MGP Sprint so the loudest debts convert into executable rules.

Anti-Patterns to Eliminate

🔹“We'll spec it later.” Interest compounds every day the gap persists.
🔹Comment-as-spec. Notes and wiki pages are not executable; agents cannot enforce them.
🔹Doc-driven theater. A pretty wiki without gates is governance cosplay.
🔹One giant risk tier. Treating everything as "medium" is equivalent to "none.""
🔹Mystery metrics. If evals cannot trace to receipts and deployments, you are guessing.

The Frontier Exception

Some work still needs direct human authorship today, but treat it as an exception rather than an operating model. Frontier research, novel interfaces, or handmade performance kernels may fall outside existing spec languages for a moment.

Even then, log the Spec-Debt-ID, attach receipts, and write the eval that describes the rule you wish existed. Emergency patches to halt an incident qualify, yet they should convert to hardened specs before planning resumes. If you are still patching the same surface two weeks later, you are normalizing debt.

Org Design for a Spec-First World

🔹Spec Editors (PM/domain + staff engineering): Own domain contracts, statecharts, and the evolution path.
🔹Policy Authors (risk/compliance/trust): Encode policies, risk tiers, residency rules, and privacy constraints in executable form.
🔹Eval Engineers (QA/ML ops): Build the test oracles, golden sets, and scenario libraries that agents must pass.
🔹Agent Operators (platform): Keep the agent toolchain fast, observable, deterministic, and easy to debug.
🔹Everyone: Logs Spec-Debt-IDs whenever they author implementation code; zero shame, high tempo.

A Must-Adopt Checklist

🔹Executable specs for your top ten surfaces: schemas, statecharts, and contracts.
🔹A policy engine with tiered gates across low, medium, high, and critical flows.
🔹An eval suite that blocks merges on critical behaviors, not just happy paths.
🔹Receipts for proofs: inputs, decisions, outputs, approvers, and versions.
🔹A scoreboard that broadcasts SCR, EC, AAR, MTTS, RC, and RCR to the full team.
🔹A pull request template wired to specs, policies, evals, receipts, and risk tiers, plus a spec bankruptcy ritual when metrics slide.

A Quick Example

A team ships a pricing endpoint and sprinkles hand-written if/else blocks for promotions and regions. It works until a different agent “fixes” EU pricing and breaks APAC because the underlying rules live in brittle code. Symptoms pile up: divergent logic, fragile tests, confused engineers, and late-night pages.

The fix is to write pricing_rules.v2 with region-to-promotion precedence, declare pricing as a high risk tier, add evals for EU VAT, APAC rounding, and U.S. promo stacking, regenerate scaffolds, and block merges on failing evals. Every decision emits receipts keyed to rule versions, so agents stay aligned and humans edit rules instead of patching glue code.

Pushback and Responses

🔹“This will slow us down.” Once specs run, speed increases because agents stop asking clarifying questions and divergent code stops shipping.
🔹“Our domain is too nuanced to encode.” Then encode the nuance, starting with the 20 percent of rules that cause 80 percent of confusion.
🔹“We're creative; specs will box us in.” Creativity belongs in designing better rules and frontier ideas, not retyping the same logic each quarter.

The Commitment

Hold quarterly SLOs that keep spec debt from creeping back: zero new spec debt for tier-high behaviors, MTTS under three days, SCR at or above 0.9, and AAR trending upward. Track the loop so teams know when the system is healthy versus slipping into fire drills.

In AI-native teams, code is no longer the product - the rules are. Agents can author endless code, but only humans decide which behaviors are allowed, how they are graded, and what “good” means.

What to Do Next

Request a working session or demo to see the stack in action, and reach out via contact us to align OutcomeOS, Receipt Studio, and your governance rituals.

Treat every line of human implementation as a spec alarm.

Spec Debt: The costliest new tech debt in the age of AI agents