What It Takes to Run AI Workloads in Production

CoreWeave is the AI-native platform cloud purpose-built for AI, combining next-generation infrastructure and intelligent tools to power the world’s most complex AI workloads.

Ryan Donovan: Hello and welcome to the Stack Overflow podcast. I’m Ryan Donovan, once again recording from the floor at HumanX. Today we’re talking about all of the things required to run AI in production. My guest for that is Peter Salanki, CTO and co-founder at CoreWeave. Welcome to the show, Peter.

Peter Salanki: Thank you so much and I’m happy to be here.

Why AI Infrastructure Is Different from Traditional Cloud

RD: You have the full soup-to-nuts picture of running AI in production. What does that take — especially what wouldn’t people think about as part of that stack?

PS: The stack to run AI workloads, both training and inference, looks very different from your kind of traditional hyperscaler architecture. And I think that’s the question we get: why doesn’t Amazon do this? The way they built their clouds for their use cases over the past 20 years is very different from what you need for AI.

Traditional clouds are built around what I like to call the black box model. You get a virtual machine, or some kind of instance, and you don’t really know what’s happening underneath. Everything is nicely abstracted away from you. And that’s great if you run something that’s easily parallelizable, like a website or an API server.

But when you run AI workloads, you suddenly have supercomputer-sized workloads. They’re all synchronous — they all run together — and if any component breaks, your entire job fails. In traditional cloud setups, you throw away half your network bandwidth for redundancy. If I told AI researchers we were going to cut half their network bandwidth so their jobs would never crash, they’d tell me to run out the door. The use cases here are not designed to never fail. They’re designed to be able to fail, and then we identify the failure, isolate that component, and restart without losing progress. Building infrastructure for that is completely different from building one with all the redundancies to never fail.

Networking as the Core Bottleneck

RD: You mentioned the network piece. I’ve been talking to folks who say the network has been one of the big bottlenecks in AI workloads. Has that been your experience?

PS: Yeah, the network — I mean, the network is always the hardest part because it’s also so interconnected. If we look at the latest generation Grace Blackwell chips from NVIDIA, you have a compute tray with four GPUs, and out from each GPU we take four network links — four 200-gigabit links — going into what’s called a multi-plane architecture to build really large clusters. This makes the network both very large (there are hundreds of thousands of cables and connectors in these clusters) and really complex to manage and operate.

As you scale compute, as you scale your flops, you can do computation faster on a chip, but then you need to synchronize your gradients. Or if you’re doing inference, you can parallelize it or do disaggregated prefill-decode — either way, you need to move all of this data around really fast. The pressure to scale the network as we scale compute is constant, and those two things scale very differently because there’s a physical dimension to it. Over a certain distance, we need to use lasers — we can’t use electrons. Scaling lasers is hard, they run hot, and there’s a completely different physical reality to scaling network versus scaling a chip. So yes, the network is always going to be the ultimate bottleneck. When compute gets faster, people immediately want their synchronization latency down as much as possible.

RD: Are there ways to work around that — more pipes, bigger pipes, better routing software?

PS: Up to a certain point, yes. There are a lot of creative solutions, and in this space, more is more. But there are different network architectures. TPUs use what’s called a torus architecture, which is neat up to a certain size but doesn’t scale beyond that. Where the industry has landed now, if you look at how the latest NVIDIA chips are designed, is different scaling domains. There’s what we call the scale-up domain — the rack — where you have, say, 72 GPUs communicating efficiently over NVLink, which is all electrical with no fiber involved. That keeps both power usage and failure rates down because there are no optics.

But since we can’t do that over large distances, when we scale out beyond that rack, we go back to traditional optical-based transport and traditional network scaling. There’s a lot of research into doing training over WAN, connecting multiple gigawatt-scale data centers, and some people are using those techniques effectively in production. It’s always a trade-off. You have to adjust a model in ways that might cost you some accuracy — which may be acceptable — but as a model developer or researcher, unlimited resources means you can work without bounds. The problem is never going to go away. We’re just working around the constraints every day.

RD: Unlimited everything means unlimited money, right?

PS: Yes. Some people seem to have that as well in this case.

Memory Bandwidth vs. Memory Size

RD: With AI workloads, you need a lot of memory per instance because you’re holding an entire model in memory. Is that different from traditional cloud workloads?

PS: I mean, all of this is different right now, and how you interact with memory is very different in an AI workload versus something like a database, where you don’t necessarily hit every memory segment in every computation. In traditional inference, you would hit all your GPU memory all the time, which means you’re very bottlenecked by memory bandwidth. Most traditional developers don’t worry about memory bandwidth day-to-day. If you’re an AI developer, you worry about it constantly.

We’ve seen some novel techniques come out of the mixture-of-experts approach — it’s been about two years since that really kicked off. It allows you to still load a large model into memory but not activate all of it at once for every request. That reduces the pressure on memory bandwidth significantly. But memory bandwidth is more of a bottleneck than memory size. And while you can scale memory size by connecting more GPUs together, that just pushes the pressure back onto the network. So we’re back at networking again.

GPU Utilization and Cluster Efficiency

RD: One of the other pressures has been GPU speed — everyone chasing the latest Blackwell server, the fastest NVIDIA chip. But I’ve also heard that the issue isn’t as easily reducible to more GPUs, that there’s a GPU efficiency aspect where people aren’t using the power they have efficiently. Would you agree?

PS: There are multiple ways to answer that. First: can you scale clusters to infinite size? There have been some plateaus there, and again that comes down a lot to networking. Those plateaus are usually overcome. Now we can pretty confidently scale clusters to many hundreds of thousands of GPUs. But I don’t think everyone should go build a 100,000-GPU coherent cluster and run 100,000-GPU coherent jobs, because it puts tremendous pressure on infrastructure reliability, lifecycle management, fault management — everything I mentioned earlier. This is something we specialize in. Taking 100,000 chips and a million network connections that are all waiting to cause you trouble, and making that work reliably for customers — it’s hard. In some use cases it’s worth it, particularly for large-scale pre-training where more compute gets you done faster. In many other use cases, you might be better off working a bit smaller and making your life easier.

Then there’s the utilization story, which is a bit different. That’s about how effectively your researchers or your team are actually using the compute. This is a growing question from the enterprise segment: a project requesting 2,000 GPUs — are they actually using them effectively? Are their algorithms a good fit? Are they wasting half their flops? A lot of what we’re working on through our observability and AI software stack is to bridge that gap — giving researchers and project owners visibility into how efficiently their compute is being used and how well the infrastructure is supporting them, rather than just dollars disappearing into a black hole.

Scheduling at Scale

RD: You mentioned coordinating 100,000-GPU clusters. At what point does that become a scheduling problem?

PS: 100 — you can do that in your basement. 100,000 is where it gets interesting. And yes, scheduling is absolutely a significant part of the challenge. At that scale, how you allocate workloads, manage failures, restart jobs, and keep utilization high all become deeply intertwined problems. The scheduler has to understand the topology of the cluster, the health of individual components, and the priorities across different teams and jobs simultaneously. It’s one of the areas where the tooling built for traditional cloud workloads simply doesn’t translate, and where purpose-built AI infrastructure systems have to develop their own solutions from the ground up.

Why AI Infrastructure Is Different from Traditional Cloud

Networking as the Core Bottleneck

Memory Bandwidth vs. Memory Size

GPU Utilization and Cluster Efficiency

Scheduling at Scale

Related Articles

AI Agents Need Tiered Approval Escalation, Not One Confirm Button

How to Fix PDF Table Duplication in RAG/LLM Pipelines

Implement a Queue Using Two Stacks in Java