Home / Blog / From Copilot to Colleague: How AI Agents Are Changing the Way We Build Software

March 31, 2026

From Copilot to Colleague: How AI Agents Are Changing the Way We Build Software

March 31, 2026

Read 17 min

AI-Powered Financial Analysis and Recommendation System

The system uses machine learning techniques to process various content feeds in realtime and boost the productivity of financial analysts and client relationship managers in domains such as wealth management, commercial banking, and fund distribution.

View Case

What happens when you stop treating AI as an autocomplete tool and start treating it as an engineering partner — and what your team needs to get there.

The Rules Have Changed

In February 2026, OpenAI published a landmark article called Harness Engineering. Their team had built and shipped a real software product — used daily by hundreds of people — with zero lines of hand-written code. Every line was authored by AI agents. Humans set direction. Agents executed.

The result? An estimated 10x speed improvement. Roughly 3.5 pull requests per engineer per day. A million lines of code in five months with a three-person team.

But here’s the part that matters most for business leaders: the speed didn’t come from a better AI model. It came from a better-designed environment for the AI to work in.

At Itexus, we’ve been proving this out on real client-grade work — most recently, building an autonomous foreign-exchange trading platform called Apex Sentinel. This article shares what we’ve learned about the emerging discipline of harness engineering: what it is, why it matters, and how to apply it to complex, domain-intensive software projects.

What Is Harness Engineering?

Think of it this way: a general-purpose AI coding agent is like a brilliant new hire who knows every programming language but has never seen your codebase, your industry regulations, or your architecture decisions.

Harness engineering is the discipline of building the onboarding materials, guardrails, and feedback loops that turn that brilliant generalist into a productive specialist.

OpenAI identified four core principles:

Make knowledge discoverable. Everything the agent needs should live in the repository — not in Slack threads, meeting notes, or someone’s head.
Enforce boundaries, not micromanage. Define the rules of the road (architecture, security, naming conventions) and let the agent figure out how to drive within them.
Give context progressively. Don’t dump a thousand-page manual on the agent. Give it a map and let it load details on demand.
Clean up continuously. Treat technical debt like interest on a loan — pay it down daily in small increments, not in painful quarterly sprints.

These principles are sound. But we found that for industries like fintech, healthcare, and enterprise software — where domain expertise, compliance, and architectural rigor are non-negotiable — the general approach needs to be extended with domain-specific customization layers.

Three Pillars That Make AI Agents Actually Useful

1. AI Presets: Teaching the Agent Your Rules

A general AI model knows how to code. It doesn’t know that your trading platform must use decimal-safe math for all monetary calculations, that vendor-specific data formats must never leak into your core business logic, or that your risk engine requires immutable audit trails for every state transition.

We built a library of 18 composable preset files — think of them as specialized briefing documents — organized by concern:

Architecture presets define how the system is structured: which components can talk to each other, how data flows, where boundaries lie.
Security presets encode OWASP Top 10 protections, secret management rules, and access control patterns.
Domain presets capture industry-specific rules: order lifecycle management, risk controls, regulatory precision requirements.
Integration presets document how to work with specific third-party services while keeping the core system vendor-neutral.
Workflow presets change the agent’s behavior mode — telling it whether to write code, review code, or clean up technical debt.

The key design decision: presets load on demand, not all at once. Before writing any production code, the agent is required to load at minimum one architecture preset, one tech stack preset, one security preset, and one data preset. If it’s connecting to an external service, vendor isolation rules load automatically.

This directly addresses what OpenAI discovered the hard way: “When everything is ‘important,’ nothing is.” By loading only what’s relevant, the agent stays focused on the constraints that actually matter for the task at hand.

Each preset includes not just positive guidance (“do this”) but also explicitly prohibited patterns (“never do that”). We found prohibited patterns to be dramatically more effective at preventing mistakes, because AI models are trained on vast amounts of code where common anti-patterns appear frequently. An explicit prohibition overrides that training bias.

2. Memory Banking: Giving AI a Long-Term Memory

AI agents are stateless by default. Every new conversation starts from scratch. For any project that lasts more than a single afternoon — which is every real project — this is a critical limitation.

Drawing on the cursor-memory-bank framework’s approach, we maintain a set of structured files that serve as the agent’s persistent working memory:

Task Board — Every work item tracked with status, ID, and dependencies. Over 100 items on Apex Sentinel. The agent reads this at the start of every session to understand what’s done, what’s in progress, and what to avoid touching.
Active Context — The current focus: what’s in scope, what’s explicitly out of scope, known risks, and recent decisions. This prevents the agent from “helpfully” refactoring something it shouldn’t.
Product Context — The product vision, target users, success metrics, and business constraints. This grounds technical decisions in business reality.
Audit Log — A chronological record of every significant action the agent has taken, with rationale. Essential for regulated industries and invaluable for team onboarding.

All of these files are version-controlled alongside the source code. They evolve with the project and are always available to the agent.

The compounding effect is remarkable. By session 50, the agent has accumulated a rich decision history — why certain architectural patterns were chosen, how naming conventions evolved, which trade-offs were made and why. It respects all of those decisions in every subsequent session without being reminded. Working with the agent in session 50 feels fundamentally different from session 1: it’s the difference between collaborating with a teammate who has context and briefing a contractor who doesn’t.

3. Automated Harnesses: Guardrails That Never Sleep

Documentation and presets guide the agent. Harnesses enforce the rules mechanically.

We built two types:

Architectural fitness checks — An automated script that verifies structural invariants after every coding session. Does any component violate the dependency rules? Has any file grown too large? Are there duplicate data definitions across system layers? Has a security boundary been crossed? The script catches violations instantly — before they can compound into larger problems.

End-to-end testing sandbox — A containerized environment that spins up the entire system (databases, services, browser-based UI tests) in complete isolation. The agent runs tests, captures screenshots and traces, analyzes failures, and fixes issues — all without touching the production environment. A single command orchestrates the entire lifecycle.

These harnesses embody a key insight from the OpenAI article: “Agents are most effective in environments with strict boundaries and predictable structure.” The agent doesn’t need to understand every architectural rationale — it just needs immediate, unambiguous feedback when it produces something that violates the rules.

Since implementing our structural fitness harness, we’ve had zero architectural regressions — a claim few teams can make even with purely human development.

Human in the Loop: Autonomy with Accountability

Speed and automation are compelling. But for anyone responsible for a production system — especially in a regulated industry — a natural question follows: who’s in charge?

The answer in our approach is unambiguous: the human is always in charge. AI agents operate with significant autonomy on routine tasks, but the system is explicitly designed to escalate to a human decision-maker whenever the stakes are high. We call this the Human-in-the-Loop (HITL) principle, and it’s not a nice-to-have — it’s a hard-coded constraint.

Where the Line Is Drawn

Not every action requires human approval. Asking for permission on every line of code would erase the speed advantage entirely. The key is drawing the line in the right place: routine work is autonomous, consequential work requires consent.

On Apex Sentinel, our control plane preset defines a clear escalation policy:

Autonomous zone. The agent writes code, runs tests, fixes linter errors, updates documentation, refactors within established patterns. These are high-volume, low-risk activities where the automated harnesses provide sufficient oversight.
Approval required. Before the agent executes any destructive operation — dropping a database, mass-deleting files, modifying infrastructure configuration, upgrading a major dependency — it must pause, explain its rationale, and wait for explicit human confirmation. The work doesn’t proceed until a person says “yes.”

This isn’t a soft guideline the agent might occasionally forget. It’s a structural constraint enforced by the control plane: the agent is programmed to halt and request approval before executing any action classified as high-risk.

The Audit Trail: Every Decision Is Traceable

The HITL principle extends beyond real-time approval gates. Every significant action the agent takes is logged in a chronological audit trail — what was done, which files were affected, what outcome was expected, and why the action was taken.

On Apex Sentinel, this audit log now contains dozens of entries spanning months of development: major refactors, architectural decisions, dependency changes, security-sensitive modifications. Each entry is timestamped and versioned alongside the code.

For regulated industries, this is a critical capability. When an auditor or compliance officer asks “why was this change made and who approved it?” — the answer is documented, traceable, and version-controlled. The agent doesn’t just do the work; it maintains a clear chain of custody for every consequential decision.

Why HITL Matters for Organizational Trust

From a business perspective, the HITL principle addresses the single biggest barrier to AI adoption in serious engineering organizations: trust.

Engineering leaders are rightfully cautious about handing significant autonomy to AI agents. “What if it deletes something important?” “What if it introduces a breaking change to a production system?” “What if it makes a decision that violates a regulatory requirement?”

A well-designed HITL system answers all of these questions:

Destructive operations can’t happen without human consent. The system is designed so that the worst-case failure mode of the agent acting alone is wasted time — not data loss, not security breaches, not compliance violations.
The audit trail provides institutional memory. Even when team members change, the rationale behind every major decision is preserved and discoverable.
Escalation is built into the workflow, not bolted on. The agent doesn’t need to be “caught” doing something wrong. It proactively identifies when a decision is above its authority and asks for guidance.

OpenAI’s harness engineering article puts it well: “Humans always remain in the loop, but work at a different layer of abstraction.” The human’s job is no longer to write every line of code — it’s to set direction, make high-stakes judgment calls, and validate outcomes. That’s a better use of expensive human expertise.

The Practical Balance

In our experience, the HITL gates trigger relatively rarely — perhaps a few times per day during active development. The vast majority of the agent’s work falls within the autonomous zone and flows smoothly through the preset and harness system without interruption. But when those gates do trigger, they’re catching exactly the right things: the moments where human judgment, domain expertise, and accountability genuinely matter.

The goal isn’t to slow the agent down. The goal is to ensure that speed and safety scale together.

Writing Secure Code with AI Agents: Trust but Verify

Speed means nothing if the code is full of vulnerabilities. This is one of the most common — and most legitimate — concerns business leaders raise about AI-generated code. If the agent is writing thousands of lines a day, who’s making sure those lines are secure?

Our answer: security is not a review step at the end. It’s a constraint baked into every line from the start.

The Problem with “Review It Later”

In traditional development, security often enters the picture late — a penetration test before launch, a compliance audit before a release. When AI agents dramatically increase the volume of code being produced, this “check it later” model breaks down. There’s simply too much output for a human security reviewer to examine line by line.

The harness engineering approach flips this: instead of reviewing code for security after it’s written, you prevent insecure code from being written in the first place.

How We Encode Security Into the Agent’s DNA

On Apex Sentinel, we built dedicated security presets that the agent must load before writing any production code. These presets cover the OWASP Top 10 — the industry-standard list of the most critical web application security risks — and translate each one into concrete rules and prohibited patterns the agent understands:

Injection prevention. The agent is instructed to always use parameterized queries and ORM abstractions — never construct database queries by concatenating user input. In fact, raw database queries of any kind are prohibited across the entire codebase; all data access must go through a managed abstraction layer. This eliminates the single most exploited vulnerability class in software history.
Secret management. Credentials, API keys, and tokens must never appear in source code, configuration files, or log output. The agent is told to source all secrets from a dedicated secret manager and to mask sensitive fields in any logging or diagnostic output. Our review-mode preset includes an explicit checklist item: “Exposed secrets — hardcoded credentials, token leakage in logs or configs.”
Access control. The presets enforce a deny-by-default model: no endpoint or resource is accessible unless explicitly permitted. The agent is taught to guard against insecure direct object references — a class of vulnerability where an attacker manipulates an identifier to access someone else’s data.
Input validation at boundaries. All external data — whether from users, APIs, or third-party services — must be validated and sanitized at the system boundary before it enters the core business logic. Our Anti-Corruption Layer preset enforces this structurally: vendor data is mapped and normalized by dedicated adapter components, so untrusted input never reaches the domain layer in raw form.
Cryptographic integrity. For our FX trading platform, all broker credentials are encrypted at rest with AES-256. The agent knows this requirement because it’s encoded in the integration preset — not left as a vague best practice, but stated as a mandatory constraint with a specific standard.

The Adversarial Reviewer

Beyond presets, we built a dedicated Reviewer Persona — a behavioral mode the agent enters when asked to audit code. In this mode, the agent is explicitly prohibited from writing new features. Its sole job is to act as an adversarial security auditor:

Cross-reference every code change against the loaded security presets
Check for missing error handling paths that could leak internal details
Verify that database operations have proper rollback safety
Confirm that third-party integrations don’t leak vendor-specific data into the core system
Flag any finding where a required security preset wasn’t loaded — surfacing gaps in coverage, not just gaps in code

Findings are ranked by severity (critical, high, medium, low), each with a specific violated rule, the business impact, and a concrete fix recommendation. This isn’t a vague “looks good” approval — it’s a structured, traceable audit.

Mechanical Enforcement: The Safety Net Under the Safety Net

Even with security presets and adversarial reviews, we don’t rely on the agent “remembering” to follow the rules. Our automated harnesses mechanically verify security-relevant invariants:

No raw network-level code in security-sensitive infrastructure components
No cross-boundary references that could create unauthorized access paths
Composition root isolation — ensuring the trading engine and the admin interface can’t accidentally expose each other’s internal capabilities

These checks run automatically. They don’t get tired on Friday afternoons. They don’t skip checks when deadlines are tight. They enforce the same standard on every line of code, in every session, at every hour.

Why This Matters for Regulated Industries

In fintech, healthcare, and enterprise software, a security vulnerability isn’t just a bug — it’s a potential regulatory incident, a reputational crisis, or a direct financial loss. The traditional approach of relying on developer discipline and periodic audits was already strained before AI agents entered the picture. With agents producing code at 10x speed, the old model simply doesn’t scale.

The harness engineering approach scales security alongside velocity. Every preset, every prohibited pattern, every harness check applies to every line the agent produces — whether it writes 100 lines that day or 10,000. Security posture doesn’t degrade as throughput increases. That’s a property no manual review process can guarantee.

How It All Works in Practice

We structured our workflow around distinct modes that change the agent’s behavior:

Mode	Purpose	Agent Behavior
Plan	Break down work, update task board	Reads product context, creates actionable work items
Code	Implement features with test-first discipline	Loads relevant presets, writes tests before code
Test	Validate in isolated sandbox	Runs full test suite, analyzes failures, iterates
Review	Adversarial quality and security audit	Critiques code against security and architecture rules — cannot write new features
Cleanup	Technical debt reduction	Removes dead code, consolidates duplicates — cannot add new abstractions

The behavioral modes are essential. Without them, AI agents have a well-known tendency to “helpfully” add features when asked to review, or to introduce new abstractions when asked to simplify. Explicit mode constraints prevent this drift.

For high-stakes operations, the HITL principle applies across all modes: destructive actions require human approval, and the audit trail captures every consequential decision regardless of which mode the agent is operating in.

Seven Lessons for Business and Technology Leaders

1. The investment is in the environment, not the model

The difference between mediocre AI-assisted development and exceptional AI-assisted development has almost nothing to do with which AI model you use. It has everything to do with the quality of the presets, memory systems, and harnesses you build around it. This is the new competitive moat.

2. Domain knowledge is the highest-leverage investment

Our fintech trading preset — a concise set of industry-specific rules about order lifecycle, risk controls, and regulatory precision — prevents more defects per line than any other artifact in the project. General-purpose models are weakest in specialized domains. That’s exactly where customization delivers the most value.

3. Security must be a constraint, not a checkpoint

If security is something you “check later,” AI-generated code will overwhelm your review capacity. If security is baked into the presets and harnesses from day one, it scales automatically with throughput. This is the only model that works when agents are writing code at 10x speed.

4. Human-in-the-Loop isn’t optional — it’s the trust foundation

AI agents should be autonomous for routine work and deferential for consequential decisions. A well-designed HITL system with clear escalation boundaries and a traceable audit trail is what makes the difference between “impressive demo” and “production-ready process.” It’s also what makes AI adoption palatable to compliance teams, board members, and enterprise clients.

5. Memory banking transforms multi-week projects

Without persistent memory, every AI session starts from zero. With it, each session builds on the accumulated context of every previous session. For any project longer than a few days, this is the single biggest productivity multiplier.

6. Automated guardrails eliminate entire categories of risk

Architectural drift, security boundary violations, compliance gaps — these are the kinds of problems that are expensive to find in code review and devastating to discover in production. Mechanical enforcement catches them in seconds, every time, without human fatigue.

7. The ROI increases over time

Unlike traditional tooling investments that depreciate, harness engineering assets appreciate: every preset, every prohibited pattern, every harness check makes every future session more productive and more reliable. The system genuinely gets smarter with use.

Where This Is Heading

OpenAI’s harness engineering article described agents working autonomously for six hours at a stretch — often while the humans were sleeping. They reported throughput increasing as the team grew, because the scaffolding scaled independently of headcount.

We see the same trajectory at Itexus. But we also see something the OpenAI post doesn’t fully address: most organizations aren’t building generic products. They’re building trading platforms, clinical systems, supply chain engines — software where domain expertise, regulatory compliance, and architectural rigor matter enormously.

For these organizations, a general AI setup isn’t enough. You need the kind of domain-specific customization layer we’ve described here: composable presets that encode your industry’s rules, memory banking that preserves institutional knowledge across sessions, automated harnesses that enforce the constraints that keep you compliant and secure, and Human-in-the-Loop governance that ensures humans retain authority over the decisions that matter most.

This is the new competitive advantage in software delivery. It’s not about which team has the most developers. It’s about which team has built the best environment for AI agents to do excellent work — with the right safeguards to do it responsibly.

How Itexus Can Help

At Itexus, we’ve spent over a decade building complex software for fintech, healthcare, and enterprise clients. We understand these domains deeply — the regulations, the edge cases, the architectural patterns that work at scale.

Now we’re applying that domain expertise to this new paradigm. We design and implement the presets, memory systems, harnesses, and HITL governance frameworks that turn general-purpose AI agents into domain-specialized engineering partners. Whether you’re launching a new product and want to build with AI agents from day one, or looking to accelerate an existing codebase, we can help you build the scaffolding that makes it possible.

The code writes itself. The environment that makes it possible doesn’t.

Let’s talk about what harness engineering can do for your next project →

Liked the article? Rate us

Average rating: 0 (0 votes)

From Copilot to Colleague: How AI Agents Are Changing the Way We Build Software

How to Develop a Healthcare App: Defining Core Features for Success

Uncategorized

From Copilot to Colleague: How AI Agents Are Changing the Way We Build Software

The Rules Have Changed

What Is Harness Engineering?

Three Pillars That Make AI Agents Actually Useful

1. AI Presets: Teaching the Agent Your Rules

2. Memory Banking: Giving AI a Long-Term Memory

3. Automated Harnesses: Guardrails That Never Sleep

Human in the Loop: Autonomy with Accountability

Where the Line Is Drawn

The Audit Trail: Every Decision Is Traceable

Why HITL Matters for Organizational Trust

The Practical Balance

Writing Secure Code with AI Agents: Trust but Verify

The Problem with “Review It Later”

How We Encode Security Into the Agent’s DNA

The Adversarial Reviewer

Mechanical Enforcement: The Safety Net Under the Safety Net

Why This Matters for Regulated Industries

How It All Works in Practice

Seven Lessons for Business and Technology Leaders

1. The investment is in the environment, not the model

2. Domain knowledge is the highest-leverage investment

3. Security must be a constraint, not a checkpoint

4. Human-in-the-Loop isn’t optional — it’s the trust foundation

5. Memory banking transforms multi-week projects

6. Automated guardrails eliminate entire categories of risk

7. The ROI increases over time

Where This Is Heading

How Itexus Can Help

Recent Articles

From Copilot to Colleague: How AI Agents Are Changing the Way We Build Software

How to Develop a Healthcare App: Defining Core Features for Success

Image AI: A Guide to Artificial Intelligence in Image Recognition and Analysis