Harnessing AI with Spec-Driven Development
I’ve been working with AI coding assistants for a while now. Copilot, Claude Code, Cursor — you name it, I’ve probably tried it. And if you’ve done the same, you know the pattern: it starts great, then the AI goes off the rails. It forgets what you told it three prompts ago. It hallucinates a function that doesn’t exist. It rewrites half your codebase when you asked for a one-line fix.
The tool isn’t the problem. The structure around it is.
So I built a framework at work to fix this — a spec-driven development toolkit designed to harness AI assistants into producing consistent, high-quality work. It was heavily inspired by Kiro and spec-kit, two projects that showed me the power of structured, spec-driven approaches to AI coding. I can’t share the proprietary details, but I want to walk through the architecture and the patterns that made the biggest difference.
The Problem with Unstructured AI
When you drop an AI into a codebase with no guidance, it does what any new developer would do with no onboarding — it guesses. It guesses your conventions, your architecture, your testing strategy. Sometimes it guesses right. Often it doesn’t.
The real issue is context. AI models have a limited context window, and most of the time we’re wasting it. We dump entire files in, hope the model figures out what matters, and then wonder why the output is inconsistent.
What if instead of hoping the AI understands your project, you told it — in a structured, decomposed way?
Two Layers of Context
The framework is built around a two-layer context model: steering documents and rules.
graph TB
subgraph Steering["Steering Docs (Project-Level)"]
P[Product Vision]
T[Tech Stack]
S[Structure]
A[Architecture]
TS[Testing Strategy]
PR[Principles]
end
subgraph Rules["Rules (Path-Scoped)"]
R1["code-style.md
applies to: src/**"]
R2["testing.md
applies to: tests/**"]
R3["security.md
applies to: api/**"]
end
Steering -->|"loaded every session"| AI[AI Agent]
Rules -->|"loaded conditionally
based on file path"| AI
style P fill:#3498db,color:#fff,stroke:#2980b9
style T fill:#3498db,color:#fff,stroke:#2980b9
style S fill:#3498db,color:#fff,stroke:#2980b9
style A fill:#3498db,color:#fff,stroke:#2980b9
style TS fill:#3498db,color:#fff,stroke:#2980b9
style PR fill:#3498db,color:#fff,stroke:#2980b9
style R1 fill:#5dade2,color:#fff,stroke:#3498db
style R2 fill:#5dade2,color:#fff,stroke:#3498db
style R3 fill:#5dade2,color:#fff,stroke:#3498db
style AI fill:#2c3e50,color:#fff,stroke:#34495e
Steering documents are project-level context files. Instead of one massive README, the framework decomposes your project context into ~6 focused files: product vision, tech stack, directory structure, architecture decisions, testing strategy, and core principles. These get loaded into the AI’s context at the start of every session.
Rules are path-scoped conventions. Different parts of your codebase have different needs — your API layer has different conventions than your frontend components. Rules are loaded conditionally based on which files the AI is working with. This keeps the context window lean and relevant.
The key insight: decompose context so the AI loads only what it needs, when it needs it.
The System Prompt Effect
Two of these steering documents — the product vision and the principles — are wired directly into the AI’s system prompt. That means they’re present in every single conversation, not just when you remember to mention them. This has a massive impact on output quality.
Here’s a concrete example. Say you ask the AI: “Add a user preferences endpoint.”
Without steering docs in the system prompt:
The AI has no product context. It might:
- Create a generic REST endpoint with a structure that doesn’t match your existing API patterns
- Store preferences in a new SQLite database when your project uses DynamoDB
- Skip authentication entirely because it doesn’t know your security model
- Name things inconsistently with the rest of your codebase (
userPrefsvs your convention ofuser_preferences)
You end up spending more time correcting the AI than you would have writing it yourself.
With product.md and principles.md in the system prompt:
The AI knows your product’s domain, your users, your constraints. It knows that your API follows a specific pattern, that all endpoints require auth middleware, that you use a specific data layer. Now the same prompt produces:
- An endpoint that follows your existing route structure and naming conventions
- Data stored in the same database and table patterns as the rest of your features
- Auth middleware applied by default because the principles say “all endpoints require authentication”
- Error handling that matches your established patterns
The difference isn’t subtle — it’s the difference between getting a pull request from a new hire on day one versus a teammate who’s been on the project for six months.
Another example: refactoring. Ask the AI to “refactor the notification service.”
Without context, the AI might extract a clean abstraction that completely breaks your event-driven architecture because it didn’t know that notifications are consumed asynchronously by three other services. With the architecture and principles loaded, it respects the boundaries and refactors within the existing patterns.
To understand why this matters so much, it helps to visualize how an AI coding assistant like Claude Code actually prioritizes context:
block-beta
columns 1
block:L1["1. Base System Prompt & System Prompt Files"]
columns 2
L1A["Agent instructions\nCore capabilities"]
L1B["product.md\nprinciples.md"]
end
block:L2["2. Tool Definitions (~50 tools)"]
columns 1
L2A["Bash, file system, grep, MCP tools, etc."]
end
block:L3["3. User Content"]
columns 1
L3A["CLAUDE.md, memory\nLoaded every session"]
end
block:L4["4. Conversation History"]
columns 2
L4A["Messages, reasoning\ntool calls"]
L4B["tech.md, architecture.md\nrules (loaded by skills)"]
end
block:L5["5. Attachments"]
columns 1
L5A["Per-turn specs, @-mentions, parameters"]
end
block:L6["6. Skills"]
columns 1
L6A["Relevant or user-specified skills appended at the end"]
end
style L1 fill:#2c3e50,color:#fff
style L1B fill:#f39c12,color:#fff,stroke:#f5b041,stroke-width:3px
style L2 fill:#34495e,color:#fff
style L3 fill:#3498db,color:#fff
style L4 fill:#5dade2,color:#fff
style L4B fill:#f5b041,color:#fff,stroke:#f39c12,stroke-width:2px
style L5 fill:#85c1e9,color:#2c3e50
style L6 fill:#aed6f1,color:#2c3e50
As Drew Breunig explains in his excellent breakdown of how Claude Code builds a system prompt, the system prompt alone has ~30 components assembled dynamically, with ~50 tool definitions, and about a dozen methods for compacting and summarizing the conversation as it grows.
Not all steering docs are created equal — and the framework is deliberate about where each one lives:
- product.md and principles.md are wired as
systemPromptFiles, which places them at layer 1 — the system prompt itself. They’re always present, never truncated, and they shape every response the AI gives. These are your “who we are and what we never compromise on” docs. - CLAUDE.md and memory live at layer 3 — loaded every session but at a slightly lower priority. This is where project-specific instructions and persistent notes go.
- The rest of the steering docs (tech stack, architecture, testing strategy) and path-scoped rules are loaded by skills at runtime and land in layer 4 — conversation history. They’re only brought in when relevant, and they get compacted and summarized as the context window fills up.
This layering is intentional. You don’t need the full testing strategy in every conversation — but you always need the AI to know your product vision and your hard constraints. By placing only the two most critical docs at the highest priority and loading everything else on demand, the framework stays token-efficient while keeping the AI aligned.
This is why a random prompt like “add a user preferences endpoint” produces wildly different results depending on what’s loaded in those top layers. The AI isn’t smarter with the framework — it just has the right context at the right priority level.
The lesson: the system prompt is the most valuable real estate in your AI workflow. Don’t waste it on generic instructions. Fill it with your product’s identity and your team’s hard-won engineering principles.
The Execution Flow
Instead of freeform prompting, the framework defines a structured workflow. Each step is a discrete “skill” — a focused operation the AI knows how to execute with clear inputs and outputs.
graph LR
A["Idea"] --> B["Brainstorm"]
B --> C["Backlog"]
C --> D["Spec"]
D --> E["TODO"]
E --> F["Implement"]
F --> G["Review"]
G --> H["Changelog"]
style A fill:#2c3e50,color:#fff,stroke:#34495e
style B fill:#5dade2,color:#fff,stroke:#3498db
style C fill:#5dade2,color:#fff,stroke:#3498db
style D fill:#3498db,color:#fff,stroke:#2980b9
style E fill:#3498db,color:#fff,stroke:#2980b9
style F fill:#2c3e50,color:#fff,stroke:#34495e
style G fill:#f39c12,color:#fff,stroke:#e67e22
style H fill:#2c3e50,color:#fff,stroke:#34495e
Here’s how it works:
- Brainstorm — Validate an idea against the product vision. Is it worth building? Does it conflict with existing plans?
- Backlog — A living document of discovered issues and feature requests, categorized by priority.
- Spec — The most important step. Every feature gets a formal specification before any code is written.
- TODO — A focused sprint plan pulled from the backlog, scoped to what can be done now.
- Implement — The AI writes code guided by the spec, steering docs, and rules.
- Review — Automated review against principles and conventions. Not just linting — architectural compliance.
- Changelog — Completed work gets archived with context for future reference.
The magic is that each skill has a constrained scope. The brainstorm skill doesn’t write code. The review skill doesn’t add features. This prevents the AI from going off-script.
The 3-File Spec
This is probably the pattern I’m most proud of. Every feature gets decomposed into three files:
graph LR
subgraph Spec["Feature Spec"]
REQ["requirements.md
WHAT
User story
Acceptance criteria"]
DES["design.md
HOW
Technical approach
Affected files"]
TSK["tasks.md
DO
Ordered checklist
Atomic steps"]
end
REQ --> DES --> TSK
style REQ fill:#f39c12,color:#fff,stroke:#e67e22
style DES fill:#3498db,color:#fff,stroke:#2980b9
style TSK fill:#2c3e50,color:#fff,stroke:#34495e
- requirements.md (WHAT) — The user story and acceptance criteria. What does this feature do? How do we know it’s done?
- design.md (HOW) — The technical approach. Which files are affected? What patterns should be used? What are the trade-offs?
- tasks.md (DO) — An ordered checklist of atomic implementation steps. Each task is small enough to verify independently.
This mirrors how a senior engineer naturally decomposes work. The beauty is that each file serves the AI at a different phase — requirements during planning, design during implementation, tasks during execution and review.
Harnessing Patterns That Actually Work
Beyond the structure, there are specific patterns that dramatically improved the quality of AI output:
Deterministic Hooks
Not everything needs an LLM. The framework uses simple shell scripts to enforce preconditions — checking that a changelog was updated before committing, validating that spec files have the right sections, ensuring tests exist for new features. These run at zero token cost and catch issues before the AI even sees them.
Dynamic Context Injection
Instead of loading everything into the context window and hoping for the best, the framework figures out which steering docs and rules are relevant to the current task and injects only those. Working on the API layer? You get the architecture doc and the security rules. Working on tests? You get the testing strategy and test conventions. This keeps the context focused and the output consistent.
Assumption Confirmation
This one saved us from countless wrong turns. When the AI hits an ambiguous decision — which module should own this feature? should this be async or sync? — it surfaces the decision to the human with a recommended default instead of silently guessing. The human stays in the loop on decisions that matter, and the AI handles the execution.
Multi-Agent Parallel Execution
For discovery tasks like scanning a codebase for issues, the framework spawns multiple specialized agents in parallel — each focused on a different domain (security, performance, code quality, etc.). Think of it like CI jobs: if two agents don’t touch the same concerns, why serialize them? The results get consolidated into a single prioritized backlog.
Skill-Based Orchestration
Instead of one mega-prompt that tries to do everything, work is broken into discrete skills with focused purposes. Each skill has defined inputs, outputs, and constraints. The scan skill discovers issues. The spec skill creates specifications. The review skill validates changes. This composability is what makes the framework reliable — each piece is small enough to be predictable.
What I Learned
Building this framework changed how I think about AI-assisted development. A few takeaways:
Structure beats prompting. The best prompt in the world won’t save you if the AI doesn’t understand your project’s architecture, conventions, and constraints. Invest in context, not cleverness.
Decomposition is everything. Breaking context into focused documents, breaking workflows into discrete skills, breaking specs into three files — every time I decomposed something, the output quality jumped.
Keep humans in the loop on decisions, not execution. The AI is great at writing code, running tasks, following patterns. It’s terrible at making architectural decisions without context. The assumption confirmation pattern — where the AI surfaces ambiguity instead of guessing — was the single biggest quality improvement.
Deterministic validation saves tokens and sanity. Not every check needs an LLM. A shell script that validates file structure is faster, cheaper, and more reliable than asking the AI to self-check.
The best AI hack isn’t a better prompt — it’s a better process.