Beyond Raw Speed: The Multi-Dimensional Chess Game of AI Inference

Why Jensen Huang's "Speed is the best cost-reduction system" only tells part of the trillion-dollar inference story

Apr 01, 2025

Subscribe to AI Afterhours and keep up with the latest AI breakthroughs!
Note: I want to thank readers for some valid points that they have brought up and wanted to clarify my stand on them. e.g. Some have said I don’t give enough credit to the awesome technical work done at big tech. Yet others have said and I paraphrase here - Please stop making my VP of AI Strategy forward your newsletter with seven question marks in the subject line.
Fair enough! So let me clarify my stance here: I have nothing but respect for the technological marvels companies like Nvidia have unleashed. Without them, we'd all still be training ResNet-50 on gaming laptops and calling it "deep learning." (Not that there is anything wrong with that either)
This newsletter exists for two reasons: to push the industry toward better solutions and to call out polished marketing narratives that don't survive contact with reality. Today's post embodies both missions as I dissect Jensen's recent GTC proclamation that "Speed is the best cost-reduction system" for AI inference. I genuinely revere Jensen – the man has single-handedly transformed computing multiple times over. But even tech visionaries aren't immune to the occasional statement that sounds suspiciously like Intel circa 2010 insisting that nobody would ever need anything but x86. The semiconductor graveyard is littered with companies that confused present market dominance with future-proof technical wisdom.

Welcome to another AI Afterhours newsletter, where I spend entirely too much time obsessing over technical minutiae that will fundamentally reshape computing economics. Today we're diving into the trillion-dollar quandary of inference optimization, where most coverage remains stubbornly superficial: "faster chip good" is not analysis, it's a marketing slogan masquerading as insight.

Jensen Huang made headlines at GTC 2025 with his characteristically blunt assessment: "Speed is the best cost-reduction system." The press dutifully reported this, and a thousand LinkedIn posts were born. But as with most things in AI infrastructure, the reality is more nuanced and, frankly, more interesting.

Let me paint you a picture: The lights are dimmed at the SAP Center, the crowd is hushed, and Jensen, in his trademark leather jacket, is doing back-of-the-envelope math on AI economics that will determine which companies thrive and which become footnotes in business school case studies about AI's Cambrian explosion.

The Inference-Dominated TCO Reality

We've spent two years fixated on training costs. The headlines about hundred-million-dollar training runs make for good copy. But all that training is worthless unless you can use it for something useful - aka use the models for inference.

Think about it: you train a model once, but you might run inference billions of times. Your multi-billion parameter model that cost $20M to train? That expense amortizes nicely when distributed across billions of API calls. But the ongoing cost of serving those API calls? That compounds daily and scales with your success. Inference is the silent profit-killer lurking in every AI startup's unit economics.

Jensen understands this with crystalline clarity. When he says, "Inference is going to be one of the most important workloads in the next decade," he's not just positioning for the next quarterly earnings call – he's describing a shift in computing economics that will transform the entire semiconductor value chain, cloud provider landscape, and AI application viability models for years to come.

The increase in these inference demands is not incremental but a quantum change. (And no, this isn't the kind of quantum that gets investors excitedly reaching for their checkbooks at pitch meetings—though I'm sure your quantum computing startup is lovely!) According to Huang, "The amount of computation we have to do for inference is dramatically higher than it used to be... 100 times more, easily."

Let that sink in. Not 2x. Not 10x. 100x.

When your inference costs suddenly require two more zeros, speed isn't just nice to have – it's existential.

The Multi-Dimensional Chess Game

But here's where I think Jensen's sound bite, while punchy and correct, undersells the opportunity. Remember the time we were all obsessed with increasing CPU speeds? Remember how "Intel Inside" stickers adorned every laptop while their marketing department convinced the world that the only path to progress was incrementally faster clock speeds? Remember how Nvidia took over by playing 4D chess through true architectural innovation? The irony here is rich enough to fund a venture round.

Intel spent decades drilling "megahertz matters" into our collective consciousness, only to watch their market dominance evaporate when the paradigm shifted. Their "speed is everything" mantra aged about as well as those Pentium stickers now peeling off ancient ThinkPads in enterprise IT closets. Nvidia didn't win by making slightly faster CPUs; they won by recognizing that the nature of computation itself was changing and building architectures optimized for massively parallel workloads. Sound familiar?

Similarly, reducing inference costs is a multi-dimensional chess game. The winners won't just be those who ship marginally faster chips on a predictable cadence. They'll be the players who recognize that the inference landscape is far more complex and far more malleable than most realize. Checkmate doesn't come from moving the fastest piece; it comes from controlling the board (we are talking circuit board not chess, try to keep up will you!).

Let me walk you through the dimensions of this game, and why optimization across all of them simultaneously is the only viable long-term strategy:

Dimension 1: Raw Speed through Hardware Architecture Innovation

Speed matters, but the most interesting story isn't Nvidia's incremental improvements to existing GPU architectures. It's how competitors are fundamentally rethinking chip design:

Memory Bandwidth Revolution: Cerebras looked at the memory bottleneck problem and basically said, "What if we just... didn't have one?" Their Wafer-Scale Engine—essentially a chip the size of a dinner plate—delivers a mind-boggling "970 tokens per second" on Llama 3.1-405B. While Nvidia's been arranging deck chairs on the HBM Titanic, Cerebras packed the whole model into on-chip SRAM, casually delivering thousands of times more memory bandwidth. It's like watching someone solve traffic congestion by inventing teleportation.

Specialized Matrix Units: Google's TPU v5p and Trillium chips have all the personality of accounting software, but they're 4.7x more efficient at the matrix multiplications that dominate inference workloads. While Nvidia's been adding more of everything, Google's been asking, "What if we just did the important things really, really well?" It's the computational equivalent of skipping leg day to focus entirely on bicep curls—except in this case, it actually works.

What's fascinating isn't just that these challengers are making faster chips—it's that they're questioning everything we thought we knew about AI computation. They're the architectural equivalent of asking "Why are we still doing this with QWERTY keyboards?"

As we come up with these novel architectures and solve for speed we need to keep an eye on the important question about diminishing returns: once you reach 50-100 tokens per second (human reading speed), where's the value in pushing to 500+ tokens? The answer lies in throughput economics and enabling machine-to-machine communication - but also in our next dimension.

Dimension 2: Efficiency - The Environmental and Economic Imperative

Raw speed without efficiency is like chugging Red Bull to stay awake during a marathon—impressive short-term results, catastrophic long-term strategy. As AI energy consumption threatens to consume more power than small nations, efficiency isn't just nice-to-have, it's existential:

Memory-First Efficiency: AMD looked at inference workloads and had the stunning revelation that maybe—just maybe—they're memory-bound, not compute-bound. Their MI300X accelerators prioritize memory bandwidth, letting them run Llama 3.1-405B with significantly fewer chips than Nvidia setups. It's like realizing you don't need a sports car if the real problem is finding parking.

Data Movement Minimization: The dirty secret of AI computation isn't the math—it's the energy cost of data transportation. Moving bits between memory and compute is the computational equivalent of flying avocados from Mexico to Maine. Companies are tackling this problem in increasingly clever ways. Lets take a specific example here - SambaNova built a "three-tier memory subsystem" that's the computational equivalent of urban planning—putting the most frequently accessed data in the most accessible locations. Their chips spend less time waiting for data than a New Yorker spends waiting for the subway. SambaNova's radical approach with their SN40L chip demonstrates that computational organization matters more than brute force. Their dataflow architecture achieves "461 tokens per second" on Llama 3.1-70B with just 16 chips where competitors require entire racks of hardware.

When your inference costs scale to billions of queries per day, these efficiency improvements aren't just good engineering—they're the difference between profit and bankruptcy. While Nvidia focuses on the computational equivalent of drag racing, these companies are designing the Tesla of AI chips: still impressively fast, but without the existential dread of watching your fuel gauge (or bank account) plummet.

Dimension 3: Model Architecture Innovation

The third dimension is perhaps the most under-appreciated: making the models themselves more efficient at inference time. This is where pure hardware vendors are most vulnerable:

Meta's Architectural Revolution: While hardware companies battle over transistor counts, Meta has been relentlessly optimizing the Llama architecture itself. Their implementation of "grouped query attention (GQA) across both the 8B and 70B sizes" demonstrates how architectural choices within the model can dramatically improve inference efficiency regardless of hardware. Additionally, Meta has been pioneering techniques to reduce Llama 3.1 405B's precision requirements "from 16-bit (BF16) to 8-bit (FP8) numerics" without significant accuracy loss.

Google's TPU-Optimized Frameworks: Google has developed JetStream, "a new inference engine specifically designed for Large Language Model (LLM) inference" that's custom-built for their TPU architecture. This tight integration between model architecture and hardware represents a fundamentally different approach than Nvidia's more general-purpose strategy.

Microsoft's Small Model Mastery: Microsoft's Phi-3 models demonstrate that careful architectural design and training methodology can yield models that perform comparably to much larger competitors while requiring a fraction of the computational resources. As noted in industry analysis, Microsoft has developed their "Phi-3 Mini model, with ½ as many parameters as Llama 3 8B, yet similar MMLU performance," showing that model architecture innovation can completely change the inference cost equation.

Anthropic's Speculative Execution: Anthropic has pioneered techniques like speculative execution where a smaller, more efficient model generates initial tokens that are then verified by the larger model. This approach dramatically improves inference speed without sacrificing quality. It's a brilliant example of using algorithmic innovation to outperform brute-force hardware approaches.

This dimension is particularly threatening to Nvidia because architectural innovations in the models themselves can deliver performance improvements that no amount of hardware optimization can match. When the model itself requires less computation, even the fastest hardware loses its advantage.

Dimension 4: Accuracy-Efficiency Frontier

The final dimension is perhaps the most critical: making smaller models more performant. As Nvidia and others push raw speed, there's an emerging recognition that inference isn't just about tokens per second, but about correct tokens per second per watt of power expended. Speed is worthless if you're fast and wrong or will destroy the Earth doing it. (Did I hear you ask what do I mean by correct? My somewhat loosy-goosy definition of correctness is that it minimizes some objective function relevant for the problem you are trying to solve)

Nvidia's Nemotron Gambit: Recognizing this frontier, Nvidia recently released "Llama 3.1-Nemotron-51B-Instruct, developed using NAS and knowledge distillation derived from the reference model, Llama-3.1-70B." The goal? To create a smaller model that "yields 2.2x faster inference compared to the reference model while maintaining nearly the same accuracy." This is a fascinating pivot for a hardware company – acknowledging that model optimization may be more important than raw hardware speed.

Microsoft's Phi-3 Breakthrough: Microsoft has been pioneering the development of extremely efficient small models. Their Phi-3 series demonstrates that careful attention to training methodology and architecture can yield models that perform surprisingly well despite their small size. In particular, the Phi-3 Mini model achieves performance comparable to much larger competitors while requiring dramatically fewer computational resources, fundamentally changing the cost equation.

Anthropic's Signal Filtering: Anthropic has been remarkably focused on filtering signal from noise in large language models. Their Claude 3.7 Sonnet introduces a hybrid approach that allows choosing between fast responses and "extended thinking" mode that improves performance on complex tasks. As independent benchmarks show, this approach creates "78.2% accuracy on graduate-level reasoning tasks" while maintaining flexibility for simpler queries.

DeepSeek's R1 Efficiency Innovation: Chinese startup DeepSeek made waves with its super-efficient R1 reasoning model. Developed at a fraction of the cost of other models, DeepSeek-R1 uses "reinforcement learning in which an autonomous agent learns to perform a task through trial and error and without any instructions from a human user" to create a highly optimized reasoning system that performs competitively with much larger models from OpenAI and Anthropic.

This dimension may ultimately prove the most disruptive. If smaller, more efficient models can match or exceed the performance of their larger counterparts, the entire premise of Nvidia's "bigger, faster, more powerful" approach becomes questionable. Why buy expensive hardware to run inefficient models when you could run optimized models on much cheaper hardware?

The Autonomous Action Acceleration

What's driving this urgency? Autonomous, action-oriented systems.

When models move from passive responders to systems that execute multi-step operations – planning sequences, dispatching API calls, evaluating outcomes, and recursively refining approaches – inference demands explode. These advanced systems aren't just answering your question; they're running complex computational graphs, maintaining state, consulting external tools, and synthesizing disparate information streams in real-time.

Each of those steps requires inference. Lots of it.

Companies are already seeing this in production. Simple LLM-powered interfaces might use a few dollars of inference per user per month. But autonomous systems that execute complex workflows on users' behalf? The costs can be orders of magnitude higher.

This is why optimization matters so tremendously. When your inference costs are $1 per user, shaving off 10% is nice. When they're $100 per user, that same 10% could be the difference between profitability and bankruptcy.

The Economic Reality

Let's put some numbers to this.

Imagine you're running an AI assistant service with 1 million users. Each generates 100 queries per month, and each query costs $0.01 in inference expenses.

That's $1 million monthly in inference costs.

Now add sophisticated reasoning capabilities and autonomous operations. Each original query spawns 10 computational steps in the reasoning chain and 5 external API calls with subsequent processing. Suddenly, you're looking at $60 million monthly.

This isn't hypothetical – it's the economic reality facing every company building sophisticated multi-stage AI systems today.

In this environment, Huang's statement that "speed is the best cost-reduction system" isn't hyperbole – it's economic survival.

The Multi-Pronged Future

So who wins? Those who recognize that inference optimization isn't just about raw speed. It's fascinating to see how the various players are placing their bets:

Nvidia is betting on raw computational power combined with software optimization like TensorRT, but has started to recognize the importance of model architecture with their Nemotron initiative.
Cerebras is making the most radical hardware bet with their wafer-scale approach that reimagines what a "chip" even is, focusing on memory bandwidth as the key to unlocking inference performance.
AMD is betting on memory bandwidth and compatibility, offering much higher memory capacity per chip than Nvidia which translates directly to more efficient inference for large models.
SambaNova is betting on memory hierarchy optimization and a specialized dataflow architecture that minimizes the movement of data during inference operations.
Google is betting on workload-specific acceleration with TPUs that are specifically designed for the matrix operations that dominate AI inference.
Meta is betting on architectural innovation within model design itself and open-sourcing their approach to drive industry-wide adoption.
Microsoft is betting on smaller, more efficient models that can match the performance of larger ones through careful architecture and training methodology.
Anthropic is betting on MoE architectures and speculative execution to dramatically reduce the computational requirements of inference.

The irony is that Jensen Huang is both right and wrong. Speed absolutely matters for inference cost reduction – but "speed" means different things to different workloads, and raw FLOPS is often the least important type of speed for inference.

And of course there are many others beyond the ones that I have mentioned! Some of them are even optimizing for the right metrics! ;) e.g. I think the real winners will be those who optimize for a different definition of speed: correct tokens per watt (see above).

What's certain is that we're witnessing a Cambrian explosion of approaches to AI inference optimization. Unlike the previous generation of AI hardware that was dominated by a single player, the inference landscape is fragmenting into specialized solutions optimized for different workloads and constraints.

For enterprises deploying AI at scale, this diversity of approaches means more options, lower costs, and better performance. For Nvidia, it means the comfortable monopoly of the training era might not persist in the inference era.

Speed may be the best cost-reduction system, but speed comes in many forms – and Nvidia doesn't have a monopoly on any of them.

What This Means For You

If you're building AI systems, the implications are clear:

Audit your inference costs relentlessly. This is probably your largest variable expense, and it's only growing.
Don't wait for hardware alone to save you. Architectural innovation and software optimization can deliver massive gains with your existing infrastructure.
Design for inference efficiency from the start. The most elegant model architecture in the world doesn't matter if its inference costs make it commercially nonviable.
Watch the open-source inference optimization space closely. Tools like Dynamo are just the beginning of an inference optimization revolution.
Consider the full-stack implications. If your inference costs are critical, you may need to rethink everything from your choice of cloud provider to your physical infrastructure.

The Philosophical Shift

What I find most fascinating about all of this is the philosophical shift it represents.

For decades, we've thought about computing in terms of writing software that runs on hardware. The economic model was simple: buy hardware, write software, run software, repeat.

The emerging model turns this on its head. As Huang puts it, "Whereas in the past we wrote the software and we ran it on computers, in the future, the computers are going to generate the tokens for the software."

This isn't just a technical shift – it's a complete reimagining of what computing is and how we organize our technical and economic resources around it.

In this new world, inference isn't just a cost center – it's the beating heart of computing itself. And optimizing it isn't just good business – it's the only way to survive.

What do you think? Are there other dimensions to inference optimization I've missed? Other strategies companies are employing? Let me know in the comments, and we'll explore this further in a future issue.

Until next time, Shwetank

AI Afterhours

Discussion about this post

Ready for more?