Agentic Engineering in Practice: Building Real Software with AI

This post was originally published on O’Reilly Radar and is being republished here with the author’s permission.

This is the first article in a series on agentic engineering and AI-driven development. Look for the next article on O’Reilly Radar.

There’s been a lot of hype about AI and software development, and it comes in two flavors. One says, “We’re all doomed, that tools like Claude Code will make software engineering obsolete within a year.” The other says, “Don’t worry, everything’s fine, AI is just another tool in the toolbox.” Neither is honest.

I’ve spent over 20 years writing about software development for practitioners, covering everything from coding and architecture to project management and team dynamics. For the last two years I’ve been focused on AI, training developers to use these tools effectively, writing about what works and what doesn’t in books, articles, and reports. And I kept running into the same problem: I had yet to find anyone with a coherent answer for how experienced developers should actually work with these tools. There are plenty of tips and plenty of hype but very little structure, and very little you could practice, teach, critique, or improve.

I’d been observing developers at work using AI with various levels of success, and I realized we need to start thinking about this as its own discipline. Andrej Karpathy, the former head of AI at Tesla and a founding member of OpenAI, recently proposed the term “agentic engineering” for disciplined development with AI agents, and others like Addy Osmani are getting on board. Osmani’s framing is that AI agents handle implementation but the human owns the architecture, reviews every diff, and tests relentlessly. I think that’s right.

But I’ve spent a lot of the last two years teaching developers how to use tools like Claude Code, agent mode in Copilot, Cursor, and others, and what I keep hearing is that they already know they should be reviewing the AI’s output, maintaining the architecture, writing tests, keeping documentation current, and staying in control of the codebase. They know how to do it in theory. But they get stuck trying to apply it in practice: How do you actually review thousands of lines of AI-generated code? How do you keep the architecture coherent when you’re working across multiple AI tools over weeks? How do you know when the AI is confidently wrong? And it’s not just junior developers who are having trouble with agentic engineering. I’ve talked to senior engineers who struggle with the shift to agentic tools, and intermediate developers who take to it naturally. The difference isn’t necessarily the years of experience; it’s whether they’ve figured out an effective and structured way to work with AI coding tools. That gap between knowing what developers should be doing with agentic engineering and knowing how to integrate it into their day-to-day work is a real source of anxiety for a lot of engineers right now. That’s the gap this series is trying to fill.

Despite what much of the hype about agentic engineering is telling you, this kind of development doesn’t eliminate the need for developer expertise — just the opposite. Working effectively with AI agents actually raises the bar for what developers need to know. I wrote about that experience gap in an earlier O’Reilly Radar piece called “The Cognitive Shortcut Paradox.” The developers who get the most from working with AI coding tools are the ones who already know what good software looks like, and can often tell if the AI wrote it.

The idea that AI tools work best when experienced developers are driving them matched everything I’d observed. It rang true, and I wanted to prove it in a way that other developers would understand: by building software. So I started building a specific, practical approach to agentic engineering built for developers to follow, and then I put it to the test. I used it to build a production system from scratch, with the rule that AI would write all the code. I needed a project that was complex enough to stress-test the approach, and interesting enough to keep me engaged through the hard parts. I wanted to apply everything I’d learned and discover what I still didn’t know. That’s when I came back to Monte Carlo simulations.

The Monte Carlo Connection

I’ve been obsessed with Monte Carlo simulations ever since I was a kid. My dad’s an epidemiologist — his whole career has been about finding patterns in messy population data, which means statistics was always part of our lives (and it also means that I learned SPSS at a very early age). When I was maybe 11 he told me about the drunken sailor problem: A sailor leaves a bar on a pier, taking a random step toward the water or toward his ship each time. Does he fall in or make it home? You can’t know from any single run. But run the simulation a thousand times, and the pattern emerges from the noise. The individual outcome is random; the aggregate is predictable.

I remember writing that simulation in BASIC on my TRS-80 Color Computer 2: a little blocky sailor stumbling across the screen, two steps forward, one step back. The drunken sailor is the “Hello, world” of Monte Carlo simulations. Monte Carlo is a technique for problems you can’t solve analytically: You simulate them hundreds or thousands of times and measure the aggregate results. Each individual run is random, but the statistics converge on the true answer as the sample size grows. It’s one way we model everything from nuclear physics to financial risk to the spread of disease across populations.

What if you could run that kind of simulation today by describing it in plain English? Not a toy demo but thousands of iterations with seeded randomness for reproducibility, where the outputs get validated and the results get aggregated into actual statistics you can use. Or a pipeline where an LLM generates content, a second LLM scores it, and anything that doesn’t pass gets sent back for another try.

The goal of my experiment was to build that system, which I called Octobatch. Right now, the industry is constantly looking for new real-world end-to-end case studies in agentic engineering, and I wanted Octobatch to be exactly that case study.

AI-Driven Development

I took everything I’d learned from teaching and observing developers working with AI, put it to the test by building a real system from scratch, and turned the lessons into a structured approach to agentic engineering I’m calling AI-driven development, or AIDD. This is the first article in a series about what agentic engineering looks like in practice, what it demands from the developer, and how you can apply it to your own work.

The result is a fully functioning, well-tested application that consists of about 21,000 lines of Python across several dozen files, backed by complete specifications, nearly a thousand automated tests, and quality integration and regression test suites. I used Claude to review all the AI chats from the entire project, and it turns out that I built the entire application in roughly 75 hours of active development time over seven weeks.

But this series isn’t just about Octobatch. I integrated AI tools at every level: Claude and Gemini collaborating on architecture, Claude Code writing the implementation, LLMs generating the pipelines that run on the system they helped build. This series is about what I learned from that process: the patterns that worked, the failures that taught me the most, and the orchestration mindset that ties it all together. Each article pulls a different lesson from the experiment, from validation architecture to multi-LLM coordination to the values that kept the project on track.

Vibe Coding vs. Structured Agentic Engineering

When most people talk about using AI to write code, they mean one of two things: AI coding assistants like GitHub Copilot, Cursor, or Windsurf, which have evolved well beyond autocomplete into agentic tools that can run multifile editing sessions and define custom agents; or “vibe coding,” where you describe what you want in natural language and accept whatever comes back. These coding assistants are genuinely impressive, and vibe coding can be really productive.

Using these tools effectively on a real project — maintaining architectural coherence across thousands of lines of AI-generated code — is a different problem entirely. AIDD aims to help solve that problem. It’s a structured approach to agentic engineering where AI tools drive substantial portions of the implementation, architecture, and even project management, while you, the human in the loop, decide what gets built and whether it’s any good. By “structure,” I mean a set of practices developers can learn and follow, a way to know whether the AI’s output is actually good, and a way to stay on track across the life of a project. If agentic engineering is the discipline, AIDD is one way to practice it.

In AI-driven development, developers don’t just accept suggestions or hope the output is correct. They assign specific roles to specific tools: one LLM for architecture planning, another for code execution, a coding agent for implementation, and the human for vision, verification, and the decisions that require understanding the whole system.

And the “driven” part is literal. The AI is writing almost all of the code. One of my ground rules for the Octobatch experiment was that I would let AI write all of it. I have high code quality standards, and part of the experiment was seeing whether AIDD could produce a system that meets them. The human decides what gets built, evaluates whether it’s right, and maintains the constraints that keep the system coherent.

The Limits of Fully Autonomous Agents

Not everyone agrees on how much the developer needs to stay in the loop, and the fully autonomous end of the spectrum is already producing cautionary tales. Nicholas Carlini at Anthropic recently tasked 16 Claude instances to build a C compiler in parallel with no human in the loop. After 2,000 sessions and $20,000 in API costs, the agents produced a 100,000-line compiler that can build a Linux kernel but isn’t a drop-in replacement for anything, and when all 16 agents got stuck on the same bug, Carlini had to step back in and partition the work himself.

Even strong advocates of a completely hands-off approach to agentic engineering might call that a step too far. The question isn’t whether to keep humans in the loop — it’s how much human judgment you need to make that code trustworthy, and what specific practices help you apply that judgment effectively. That’s exactly what this series sets out to answer.

The Monte Carlo Connection

AI-Driven Development

Vibe Coding vs. Structured Agentic Engineering

The Limits of Fully Autonomous Agents

Related Articles

AI in Daily Life: 10 Practical, Real-World Use Cases

AI Review Debt: The Engineering Bottleneck Nobody Is Measuring

The AI That Questions You Back: A Different Kind of Dev Tool