Home / Blog / How Spec-Driven Development Keeps AI-Generated Code Production-Ready: A Conversation with AI R&D Head at Itexus

July 1, 2026

How Spec-Driven Development Keeps AI-Generated Code Production-Ready: A Conversation with AI R&D Head at Itexus

July 1, 2026

Read 10 min

AI-Powered Data Management and Analytics Platform for a Large Financial Holding

View Case

Interview with Sergey Privalov, Co-Founder and AI R&D Head at Itexus

AI-generated code has changed software delivery. Teams can now produce implementation much faster, but generating code is only part of the process. They also need a reliable way to prove that the generated code still matches product, business, and compliance requirements.

We spoke with Sergey Privalov, Co-Founder and AI R&D Head at Itexus, about spec-driven development (hereafter referred to as SDD), behavior-driven development, and executable specifications in AI-assisted delivery.

Itexus: We see more teams moving from casual AI prompting to structured AI-assisted delivery. Spec-driven development looks like a clear response to vibe coding: first define the spec, then let the agent plan and implement. From your perspective, is Spec-driven development actually new, or is it a familiar practice with a new AI layer?

Sergey Privalov: I am still unsure there is a final answer. I have been building specs for a small project lately, and I keep returning to the same question: is this new wave of spec-driven development genuinely new, or is it behavior-driven development with a fresh badge and an agent attached?

Even the people who shaped these practices do not fully agree. Gojko Adzic, who wrote extensively about executable specifications, looked at GitHub’s Spec Kit and questioned whether SDD is waterfall returning through the back door or BDD taken up a level. Dan North, who coined the term BDD, discussed the same uncertainty with him. When the originators are still debating it, I think we should treat this as an open discussion.

Itexus: That uncertainty is useful. In client work, especially in fintech, we rarely see a pure methodology problem. The real issue is whether the process creates enough clarity, traceability, and control for the product being built. What does SDD get right?

Sergey Privalov: SDD is a real step up from prompting an agent and hoping for the best. GitHub Spec Kit, AWS Kiro, and the broader pattern follow a structured flow: specify, plan, break into tasks, then implement. Each phase produces a Markdown artifact that feeds the next.

That helps because it forces clarity on what and why before the agent starts inventing how. Gojko’s interpretation is that SDD formalizes the patterns that already work when people coach an agent through a build.

Itexus: The benefit is structure. But in real delivery, especially in regulated products, a written spec can quickly become outdated. We often see documentation turn into something teams trust less over time. Is that the weak point?

Sergey Privalov: That is exactly what bothers me. The artifacts are prose. Prose drifts away from code the day after you write it, and nothing mechanically ties the two back together.

You can re-audit by spawning more agent runs, but that burns tokens, scales poorly, and gives you another opinion instead of a guarantee. The spec can quietly become dead weight in the repo. If it becomes subtly stale, the agent may still trust it.

Itexus: So the bottleneck moves. Teams used to worry about how fast developers could write code. Now they need to worry about how fast they can verify it.

Sergey Privalov: Yes. The numbers point in that direction. Reporting through 2026 suggests AI now writes around half of new code, while the share of code deleted or rewritten within thirty days has increased sharply. That suggests a churn problem rather than a pure productivity win.

This is also why people like Dave Farley argue that the bottleneck has shifted. If developers can generate thousands of lines of code a day, careful manual review no longer scales. Trust has to come from executable specifications and continuous verification.

Even OpenAI’s Codex team reportedly built over a million lines with no human-written code and concluded that a flat instruction document rots and cannot be mechanically verified.

So SDD gives you a plan. What it does not give you is a way to prove the code still honors that plan.

Itexus: Is this where Behavior-Driven Development, or BDD, comes back into the conversation? Many teams still associate it with traditional testing, but it seems useful for a very current AI problem: turning requirements into something both humans and machines can check.

Sergey Privalov: That is why I think BDD deserves another look. I do not see it as a rival to SDD. I see it as the part SDD is missing.

The point of Given-When-Then is that the same artifact is readable by a human and executable by a machine. You describe behavior from the outside, in a black-box and intent-first way. Then a runner turns those scenarios into end-to-end checks.

If a scenario breaks, the build breaks. The specification and the verification become the same object. That is the property prose specs lack.

Itexus: What changes when an AI agent works against Given-When-Then scenarios rather than a prose spec?

Sergey Privalov: An agent is good at slicing a high-level requirement into Given-When-Then scenarios and then writing code to satisfy them. In that workflow, scenarios double as a harness the agent can run repeatedly during a session.

The spec stops being a plan the code might respect. It becomes a test the build enforces.

Itexus: That also makes the spec useful outside engineering. In fintech projects, business analysts, developers, QA, compliance, and product owners often look at the same feature from different angles. How does BDD help reduce translation loss between those roles?

Sergey Privalov: That is one of the strongest arguments for BDD. Given-When-Then came out of the “three amigos” idea: business, development, and testing working from one scenario.

The same scenario becomes the requirement the BA owns, the build target the developer codes to, and the test case QA verifies. Gojko’s earlier work focused on bridging the communication gap between customers, analysts, developers, and testers. One artifact serves three readers, with less translation loss.

Itexus: In fintech or healthcare, that shared artifact can also become a control point. The cost of wrong behavior is higher: financial loss, legal exposure, or patient harm. Is this where human-readable specs become critical?

Sergey Privalov: That is why the human-readable layer matters. In fintech and healthcare, teams often say they want a human in the loop. But that only means something if the human can actually read and challenge what the agent produced.

Approving twelve thousand lines of generated code a day does not create meaningful oversight. A Given-When-Then scenario gives reviewers a different review surface. A domain expert, compliance reviewer, or QA lead can challenge a line that says the surviving spouse receives the higher of two benefit amounts instead of the sum. They do not need to read the implementation.

The scenario is where their judgment gets exercised, recorded, and made traceable.

Itexus: Regulation is moving in the same direction: human oversight has to be demonstrable, especially for high-risk AI systems. What kind of artifact can show that oversight happened?

Sergey Privalov: The NIST AI Risk Management Framework asks for human oversight on high-risk uses. The EU AI Act’s Article 14 makes demonstrable oversight of high-risk systems a legal expectation. Under the Digital Omnibus adopted in June 2026, the high-risk obligations were deferred to December 2027, but the requirement itself did not change — the industry simply got more runway to build oversight that will actually hold up

That is where a readable, executable acceptance suite becomes useful. An auditor can review it, a domain expert can challenge it, and the engineering team can run it against the code.

The pitch to a delivery team should avoid “adopt AI to go faster.” I would frame it this way: write the behavior down in a structured form your BA and QA already understand, put it in the repo where the agent can read it, and acceleration comes with an audit trail.

Itexus: Do you have a real project where this approach changed how the team worked?

Sergey Privalov: Yes. We are rebuilding a legacy retirement-income planning tool: a deterministic cashflow projection product with annuity modeling. We had no source code and no documentation, so we treated the existing application as a black box.

We reverse-engineered its behavior from screen recordings and transcripts. Before writing replacement code, we captured what we observed as Given-When-Then scenarios.

One immediate benefit was that domain rules became executable checks instead of assumptions. One scenario captures how the system calculates a simple roll-up income base from deposit, bonus, growth rate, and deferral. Another captures the survivor benefit switching rule: when the client passes away, the surviving spouse receives the higher of the two benefit amounts, and the system does not add both amounts together.

That second rule is easy to implement incorrectly during a reimplementation and expensive to fix after release. As an executable check, it fails the build when an agent drifts. As a sentence, a domain expert can review it in seconds.

Itexus: So the same scenario becomes useful for several roles at once: a domain expert can review the rule, developers can implement it, and QA can verify it.

Sergey Privalov: Yes, and the second property is traceability. Every scenario carries a pointer back to the evidence we observed: a timestamp in a recording or a page in a screenshot set.

That is what makes me trust the output. I am not relying on faith in the model. Each requirement is traceable to a source and mechanically verifiable against the code.

Itexus: This sounds useful, but probably not frictionless. BDD has a history of becoming heavy when teams maintain too many step definitions. Some product people also see Gherkin as too technical. What risks should teams take seriously?

Sergey Privalov: Those objections are real. BAs and product owners sometimes resist Gherkin because they see it as too technical or outside their role. The readability I am describing is not automatic. It is a habit the team has to build.

There is also the inverse risk. Putting someone “in the loop” without giving them a real decision to make is liability dressed up as process.

I also have my own bias here. I have seen Gherkin step-definition layers become a maintenance burden as products grow. I am still questioning whether, in an agent-native setup, that readable layer earns its keep or whether the agent should generate test code directly.

My instinct is that the human-readable contract matters more now because it is the shared surface across roles. But that is an instinct, not a finding.

Itexus: There is also a broader question. Models are improving quickly. Some scaffolding that feels necessary today may become unnecessary later.

Sergey Privalov: Exactly. North and Adzic talk about treating each agent session like onboarding a new developer. Anthropic’s harness writing makes a similar point: every piece of scaffolding assumes the model cannot do something yet, and that assumption can expire as models improve.

So how much of this BDD harness is durable, and how much are we building out of habit for a model that may not need it next year? I do not know.

Itexus: If we bring this back to delivery teams building financial, healthcare, or other regulated software, the practical takeaway is clear: AI can speed up implementation, but speed alone does not make generated code production-ready. The missing layer is verification, traceability, and a review surface people can actually use.

Sergey Privalov: My tentative position is that SDD got the instinct right: stop letting the agent improvise the entire system from a one-line prompt. But SDD reaches for something BDD and Specification by Example largely worked out twenty years ago: a spec that can verify itself.

The bonus, and maybe the real point in regulated domains, is that this kind of spec was always meant to be written with the business and QA, not just for the machine. That makes it a credible human-in-the-loop surface in exactly the places where reliability is tied to financial, legal, or safety outcomes.

But I would rather argue about this than assert it. If teams have run BDD-style executable specs with coding agents, especially in fintech, healthtech, or another regulated environment, I want to know whether it felt like the missing part or like forcing two workflows together. And if someone works as a BA or in QA, I would ask: did this feel familiar, or did it feel like a new burden?

That is where the discussion should continue.

Itexus: Thank you, Sergey. We have a feeling this conversation is only the beginning. As AI-assisted development continues to evolve, we’ll return to these questions with new experiments, real projects, and hopefully a few more answers.

Liked the article? Rate us

Average rating: 0 (0 votes)

The AI Agent Infrastructure Stack in 2026: Protocols, Frameworks, and Models

All
Fintech

This Week in Fintech: The Layers Beneath a Product Are Getting a Price Tag

All
Fintech

How Spec-Driven Development Keeps AI-Generated Code Production-Ready: A Conversation with AI R&D Head at Itexus

Recent Articles

The AI Agent Infrastructure Stack in 2026: Protocols, Frameworks, and Models

This Week in Fintech: The Layers Beneath a Product Are Getting a Price Tag

Multi-family Office Software: How to Build a Modern Wealth Management Platform