The Future of AI Coding Agents

In the past few years we've gone from autocomplete to coding copilots to agents. The next generation won't just be another upgrade, it will be the biggest shift yet — fundamentally different from everything that came before.

The first major AI App Builder to come out was V0 in October 2023, with Replit Agent, StackBlitz Bolt, and Lovable following in late 2024. We've seen an explosion ever since, as more and more code has been written by AI. But while they've gotten exponentially better in the past year, they haven't evolved into anything new. Like the shift from autocomplete to a coding copilot, the next generation looks unlike anything we've seen.

Phase 1

Before the first AI App Builders, there was AI autocomplete. Tabnine was the first I saw back in 2018, followed by Cursor Tab and Github Copilot. These AI looked a few tokens ahead of us.

In 2023, Vercel saw that they could do more than a few lines and generate whole components. They weren't trying to build whole applications, just design prototypes. V0 took what would be a 30 minute process and turned it into 30 seconds, sketching designs from idea to code.

As AI intelligence has accelerated in the past year, so have AI App Builders. Lovable went from landing pages to backends to whole functional apps. Modern AI App Builders manage data, payments, users and more. What used to take a team of engineers is now a prompt away.

And that brings us to today. Right now, the best AI App Builders work based on a synchronous chat — the user puts in a prompt, the AI builds, the user reviews the AI's work, repeat. This works well: as models have gotten progressively smarter, these AI App Builders have gotten better as well.

In 12 months, the AI software engineers of today will look as silly as the tab autocompletes of 2018 look to us.

Phase 2

Soon, AI App Builders will generate a couple variants for every single request before ultimately selecting the best one. Assuming reasonable variance across ideas and generations, this will lead to noticeable improvements in app quality.

It won't stop at a couple variants. Once we can generate a few, we'll generate a few thousand. Then, a few dozen thousand. Agents will be used to rank, classify and re-rank the generations before presenting the favorites to humans for review. Think of this like a kind of memetic Darwinism- instead of survival of the fittest members of a species, AI App builders will select for the best agent response. We can safely assume AI will be able to handle these classification tasks: classification is consistently easier than generation. For example, think about how long ago we had the first facial detection technology. Detecting a face in an image is a kind of classification task– all it really has to do is select for whether or not a picture has a face in it. We were able to build nearly perfect facial detection algorithms a decade ago, far before we could reliably generate novel faces. This was also true for spam detection vs writing emails and sentiment analysis vs coherent stories. Picking out the best code should be no different.

This will radically improve agent performance. Imagine asking one person to draw a picture. Now imagine three; now three thousand. Think about how much better the best drawing of the 3000 will be compared to the one drawn by a single person. And this won't just happen once — every time you ask an AI to take a nondestructive action, it will try thousands of times and choose the best path.

Even if future models have low default variance, there are many unexplored strategies to deliberately expand the variance between generations. As models get smarter, we can push temperatures higher without collapsing coherence, introducing more token-level noise and yielding more diverse outputs. We can prompt the AI to first generate a wide set of ideas, then branch each one into multiple variants, effectively creating a taxonomy of possible paths. Random noise can be injected at specific points in the generation process to force reconstruction of certain sections, producing unexpected alternatives. More advanced systems could even decide dynamically when to branch, spinning off multiple explorations whenever the model senses multiple promising directions. With the right techniques it can be amplified, structured, and exploited. High variance is guaranteed, and harnessing it well will be a defining characteristic of next-gen AI App Builders.

AI App Builders haven't scratched the surface of this yet. When it comes, the teams that master classification and selection will have performance an order of magnitude better than those before them. But, I don't think this phase will last long.

Phase 3

The first way we'll use massive parallelism will be through a simple ranking of every generated variant.But, once we've figured out how to judge variants against each other, we'll eventually want to be able to merge all of their best parts to construct an optimal composite. The real challenge isn't picking the best variant—it's synthesizing the best of thousands of variants into one coherent system. Section C of variant #1355 may be outstanding, but useless if it clashes with Section A of #2687. AI must learn both what makes each piece strong as well as how to align them across architectures. Coherence becomes the benchmark. Future systems may approach this by treating application layers as modular, generating and optimizing them independently before integration, or perhaps by ranking components across variants and building a composite library that ensures compatibility. This phase is hard to ideate on; it relies on the innovations of phase 2 that we haven't even seen yet. With that said, humans struggle with merge conflicts today. This is going to multiply those problems times the number of selected variants.

Why I'm Certain

In 2006, AWS launched S3, it took them 10 years to make it 85% cheaper by 2016. GPT-4 launched in March 2023; 8 months later, they launched GPT-4 Turbo, a comparable intelligence model 67% cheaper than that. Only 6 months later, they launched GPT-4o, again with comparable intelligence and 50% cheaper than Turbo. Then, 3 months later, they cut prices by another 33% for the GPT-4o API. Since the original GPT-4 to the current GPT-4o, input tokens have become 92% cheaper and output tokens have become 83% cheaper in 17 months.

That's just from OpenAI. Gemini 2.5 Flash is generally a smarter and stronger model than GPT-4, while also being 87.5% cheaper. Qwen3-Coder 480B is an open source alternative that outperforms GPT-4; it's 88% cheaper on some cloud providers. Some estimates put the cost per intelligence closer to 99% cheaper, though I wasn't able to find data to support those numbers.

AI isn't just getting smarter. There has been an exponential reduction in cost per intelligence over the past few years, with no indication that it's slowing down. The first models capable of coding real applications showed up a year ago. In a year they will be 90% cheaper, and in 3 years, they will be at least 99% cheaper — my hunch is both of these will come much sooner than that.

6 months ago, most major AI App Builders tried to keep their tokens under 40,000 per prompt. However, as many move towards more agentic systems, we see that number expanding to 400,000 per prompt. Higher intelligence models enable us to let the models go further, and soon they'll be going much further for much cheaper.

The Bottlenecks

Assuming AI keeps getting smarter, there are a few more bottlenecks we need to solve to make this vision happen. Massive parallelism means thousands of variants exploring ideas all at once. Each variant will need its own sandbox to build, run, and test in isolation. That creates short, extreme spikes in work that today's clouds don't handle well. If the system can't spin sandboxes up fast, keep results organized, and shut them down cheaply, the “generate 10,000 and pick the best” promise stalls—not for lack of model quality, but because coordination breaks down at scale.

Once the coordination problem is solved, the harder challenge appears: deciding which of those thousands of working variants is actually the best. In today's generation of AI App Builders, the user is effectively serving as the QA Agent: reviewing each output, checking for spec compliance, and making judgments on usability and design. That's why more advanced QA Agents haven't emerged yet; humans themselves have been filling that role. This approach collapses under parallelism: humans can't review and rank thousands of variants independently at the speed these systems demand. Soon, AI Agents will reliably be able to hit every part of the spec without human review. That means QA Agents that only check for spec adherence will become irrelevant. We'll need new Agents capable of discriminating on accessibility, usability, consistency, and design. Enabled by smarter models and better coordination systems, this reviewing and ranking challenge will be the problem to solve to maximize quality.

The next generation of AI App Builders will only be as strong as the next generation of QA Agents.

— Madhav Jha, CTO of Emergent — A leading AI App Builder

What changes when this is normal?

AI started as autocomplete—an assistant proposing the next lines. It became a copilot working alongside us. Today's agents make us managers: we set scope, acceptance criteria, and review PRs. The next generation puts us in a general's role: define objectives, constraints, and budgets; spin up thousands of variant sandboxes; let ranking systems select and synthesize; then approve promotion or rollback based on live signals. Velocity is only limited by search budget and judging criteria. It will deliver software with more quality and complexity than we've ever seen.