The Gigawatt Question: Will Massive New Training Centers Break Through AI's Scaling Ceiling—or Hit a Wall?

Introduction

The AI industry is placing the largest infrastructure bet in the history of computing. In 2025, hyperscalers collectively committed $340–370 billion in capital expenditure, much of it directed at AI infrastructure^[1]. OpenAI and NVIDIA announced a strategic partnership to deploy 10 gigawatts of AI data center capacity^[12]. Microsoft, Google, Amazon, and Meta are each racing to build campuses that would have been unthinkable just three years ago—facilities drawing a gigawatt or more of power, rivaling the output of nuclear power plants. McKinsey estimates the total race to scale data centers could reach $7 trillion^[14].

But behind the staggering numbers lies a question that practitioners, investors, and researchers are actively debating in real time: Will this compute actually translate into proportionally better AI models? Or are we approaching a regime where the returns from simply scaling up pre-training begin to flatten, forcing the industry to find intelligence through other means—better data, smarter post-training, reinforcement learning, and inference-time compute?

This isn't an abstract academic debate. It determines whether the billions being poured into gigawatt-scale facilities will produce transformative AI capabilities or become the most expensive white elephants in technology history. It shapes which companies win, which architectures prevail, and whether the path to artificial general intelligence runs through brute-force scale or through algorithmic ingenuity.

The conversation among experts is nuanced, contentious, and evolving fast. Some see scaling laws as iron-clad empirical regularities that will continue to deliver. Others point to emerging evidence that pre-training scaling is hitting data walls and diminishing returns, while post-training techniques like reinforcement learning are opening entirely new scaling dimensions. Still others argue the entire framing is wrong—that the shift from training-centric to inference-centric compute changes the economics so fundamentally that the gigawatt data center thesis needs to be rewritten.

This article synthesizes what the leading researchers, practitioners, and analysts are actually saying—on X, in papers, and in industry analysis—about where AI scaling stands in mid-2025, what the gigawatt-class training centers will realistically deliver, and which technical methods are expected to keep pushing the frontier forward.

Overview

The Original Scaling Laws: What Chinchilla Told Us (and What It Didn't)

To understand the current debate, you need to understand the foundation it rests on. In 2022, DeepMind's Chinchilla paper established what became the dominant framework for thinking about how to allocate compute when training large language models. The core finding was elegant: for a given compute budget, there's an optimal balance between model size (parameters) and training data (tokens). Specifically, Chinchilla suggested that parameters and data should scale roughly equally—a ratio of approximately 20 tokens per parameter.

pavi2410 @PavitraGolchha Sat, 28 Feb 2026 22:43:33 GMT

Today I learned about the Chinchilla scaling law. Most LLMs are undertrained.

OpenAI & Google added parameters but not enough data.

DeepMind found that for every 2x more compute, you must scale parameters & data equally. Chinchilla (70B) beat GPT-3 (175B) because it had 4x the data.

View on X →

This insight was genuinely revolutionary. It showed that GPT-3's 175 billion parameters were dramatically undertrained relative to their compute-optimal point. Chinchilla, with "only" 70 billion parameters but 4× the training data, outperformed it. The implication was clear: the industry had been building models that were too big and feeding them too little data.

But as Andrej Karpathy has pointed out, practitioners widely misunderstand what Chinchilla actually tells you:

Andrej Karpathy @karpathy Thu, 18 Apr 2024 18:53:55 GMT

no. people misunderstand chinchilla.
chinchilla doesn't tell you the point of convergence.
it tells you the point of compute optimality.
if all you care about is perplexity, for every FLOPs compute budget, how big model on how many tokens should you train?
for reasons not fully intuitively understandable, severely under-trained models seem to be compute optimal.
in many practical settings though, this is not what you care about.
what you care about is what is the best possible model at some model size? (e.g. 8B, that is all that i can fit on my GPU or something)
and the best possible model at that size is the one you continue training ~forever.
you're "wasting" flops and you could have had a much stronger, (but bigger) model with those flops.
but you're getting an increasingly stronger model that fits.
and seemingly this continues to be true without too much diminishing returns for a very long time.

View on X →

This distinction matters enormously for the gigawatt question. Chinchilla-optimal scaling tells you how to get the best loss per FLOP. But in practice, what you often care about is the best model at a given size—because that model needs to run on actual hardware for inference. And for that goal, you train far beyond the Chinchilla-optimal point, "wasting" FLOPs to squeeze more capability into a fixed parameter count.

MosaicML formalized this insight in their "Beyond Chinchilla-Optimal" work, which modified the scaling laws to account for inference costs^[2]. Their finding was striking: when you factor in the cost of actually deploying a model (say, serving a billion requests), you should train models that are smaller and longer-trained than Chinchilla would suggest.

AK @_akhaliq Tue, 02 Jan 2024 04:19:24 GMT

MosaicML announces Beyond Chinchilla-Optimal

Accounting for Inference in Language Model Scaling Laws

paper page: https://t.co/v82ZdcUc8Y

Large language model (LLM) scaling laws are empirical formulas that estimate changes in model quality as a result of increasing parameter count and training data. However, these formulas, including the popular DeepMind Chinchilla scaling laws, neglect to include the cost of inference. We modify the Chinchilla scaling laws to calculate the optimal LLM parameter count and pre-training data size to train and deploy a model of a given quality and inference demand. We conduct our analysis both in terms of a compute budget and real-world costs and find that LLM researchers expecting reasonably large inference demand (~1B requests) should train models smaller and longer than Chinchilla-optimal.

View on X →

This has direct implications for infrastructure planning. If the optimal strategy is to train smaller models for longer on more data, the compute requirements shift in character. You still need enormous amounts of training compute, but the nature of what "optimal" means changes depending on your deployment scenario.

The Data Wall: Pre-Training's Most Fundamental Constraint

The single most important constraint on continued pre-training scaling isn't compute—it's data. This is the finding that has reshaped the entire conversation in 2025.

Aran Komatsuzaki @arankomatsuzaki Fri, 19 Sep 2025 02:47:41 GMT

Pre-training under infinite compute

• Data, not compute, is the new bottleneck
• Standard recipes overfit → fix with strong regularization (30× weight decay)
• Scaling laws: loss decreases monotonically, best measured by asymptote not fixed budget

View on X →

Epoch AI's analysis of whether AI scaling can continue through 2030 identifies data as the binding constraint^[2]. The internet contains a finite amount of high-quality text. Estimates vary, but the consensus is that frontier models are already training on a substantial fraction of all publicly available, high-quality text data. Simply building bigger clusters doesn't help if you've already consumed most of what's available to train on.

This doesn't mean pre-training scaling is dead. It means the nature of the scaling challenge has shifted from "how do we get more FLOPs?" to "how do we get more effective data?" The solutions being pursued include:

Synthetic data generation: Using existing models to generate training data for next-generation models. This works surprisingly well for certain domains (math, code, reasoning) but carries risks of model collapse if not carefully managed.
Data curation and quality filtering: Aggressive filtering of training data to remove low-quality, duplicated, or harmful content. The insight here is that a smaller, higher-quality dataset can outperform a larger, noisier one.
Multi-modal data: Incorporating images, video, audio, and structured data to supplement text. This expands the effective data pool significantly.
Repeated epochs with regularization: Training on the same data multiple times with techniques to prevent overfitting—including dramatically increased weight decay (up to 30× standard values, as recent research suggests).

Kol Tregaskes @koltregaskes Fri, 27 Feb 2026 21:30:06 GMT

AI software progress cuts training compute needs by around 10× per year.

- This efficiency stems from data quality enhancements like synthetic datasets and curation, rather than major algorithmic breakthroughs.
- Scale-dependent innovations, such as shifting from LSTMs to Transformers, amplify gains at higher compute levels - up to 26× at 10^23 FLOP.
- Estimates vary widely, with pre-training at 3× to 20× yearly and post-training at 5× to 30× for specific boosts.
- Progress could shorten AGI timelines to the 2030s but may face compute bottlenecks limiting rapid automation.

Overall, software advancements enable more with existing resources, though data limitations create uncertainty.

View on X →

The claim that AI software progress cuts training compute needs by around 10× per year is aggressive but directionally supported by the evidence. Epoch AI's research suggests that algorithmic improvements have historically contributed roughly as much as hardware improvements to AI progress^[2]. The key nuance is that these gains come primarily from data quality improvements and architectural innovations rather than single breakthrough algorithms.

What Gigawatt-Scale Training Centers Will Actually Deliver

So what happens when you flip the switch on a 1 GW+ training cluster? The honest answer is: it depends entirely on what you're training and how.

NVIDIA's vision for the gigawatt data center age emphasizes that networking, not just raw GPU count, becomes the critical bottleneck at this scale^[11]. A gigawatt facility might house hundreds of thousands of GPUs, but getting them to work coherently on a single training run requires networking infrastructure that can move data between GPUs faster than the GPUs can process it. This is a genuinely hard engineering problem—one that gets exponentially harder as cluster size increases.

Kasehun Abrahem @AbrahemKasehun Thu, 26 Feb 2026 19:59:03 GMT

Lawrence Berkeley Lab estimates:

• Data center power demand doubled (2018–2024)
• Could triple by 2028

AI rack density:

• 30–60 kW per rack
• Legacy racks ~8–12 kW

That’s 3–5× load per rack.

If a hyperscale site runs 100,000 AI racks:

At 40 kW average → 4,000 MW = 4 GW theoretical load envelope.

Even if actual usage is lower, the peak design capacity is massive.

To stabilize that kind of load, operators need:

• Battery buffers
• Fast-response inverters
• Load balancing AI
• On-site redundancy

NextNRG, Inc. (NXXT) microgrid architecture integrates:

• Battery storage
• AI load optimization
• Hybrid generation
• Grid-forming inverter control

That stack becomes essential when load volatility rises.

$AS
$PAC
$OKE

View on X →

The power requirements alone are staggering. At 30–60 kW per AI rack, a facility with 100,000 racks would need 4 GW of theoretical peak capacity. Even more modest configurations push well beyond what conventional commercial power infrastructure can deliver, which is why operators are increasingly co-locating with nuclear facilities and building dedicated power generation.

VadymInvestor @InvestVadi36740 Wed, 04 Mar 2026 07:34:23 GMT

The physical constraints of artificial intelligence scaling are fundamentally forcing a structural reorganization of national energy grids. Regulatory frameworks are currently being expedited to allow the direct co-location of hyperscale data centers with existing nuclear facilities.
This operational shift confirms that the energy requirements for training next-generation cognitive models have surpassed the capacity of conventional commercial power infrastructure. For institutional capital, this development merges the AI infrastructure sector directly with nuclear energy assets. The ability to secure sovereign, uninterrupted baseload power is now established as the primary bottleneck for continuous computational expansion, permanently altering the valuation models for both advanced compute clusters and legacy energy providers.

View on X →

Deloitte's analysis confirms that AI data centers are fundamentally jolting power demand patterns, with data center power demand having doubled between 2018 and 2024 and potentially tripling by 2028^[15]. The build times for gigawatt-scale facilities are measured in years, not months—Epoch AI's data suggests that even with aggressive timelines, most planned gigawatt-class facilities won't be fully operational until 2027–2028^[13].

But here's the critical question: will these facilities primarily be used for pre-training the next generation of frontier models, or will they serve a different purpose entirely?

The Great Pivot: From Pre-Training to Inference

Perhaps the most consequential shift in the AI infrastructure landscape is the growing recognition that the future of AI compute is inference-heavy, not training-heavy. This isn't just a technical observation—it's reshaping hundreds of billions of dollars in investment decisions.

Gavin Baker @GavinSBaker Mon, 24 Feb 2025 20:42:58 GMT

Shifting from a pre-training centric world to an inference centric world is likely positive for compute overall. Intelligence may scale even better with test-time compute (inference) than it does with pre-training per the charts below.

The balance of compute always had to move from pre-training to inference to generate an “ROI on AI.” Just going to happen a lot faster than expected.

And while shifting to a test-time compute, “inference first” world is probably good for compute demand, this shift does change the type of compute. And this has an impact on who wins and who loses from a supplier perspective.

More 50-100 megawatt datacenters geospatially and cost-optimized for inference. More inference “Hondas.”

Fewer 1 gigawatt plus datacenters (which can be anywhere) with the networking, storage, and cooling (which enables density which simplifies networking while increasing potential cluster size) technologies necessary for coherence. Less pre-training “Ferraris.” And the number of companies doing pre-training in a “Ferrari” likely steadily shrinks over time.

Satya explained this in the clearest way possible in his most recent podcast. All the back and forth about the Cowen note vs. Microsoft IR commentary in Australia is missing the forest for the trees - the CEO literally just told you he was going to shift investments away from pre-training focused compute to inference optimized compute, which he noted was different!

Also Grok-3 voice mode is epic.

View on X →

Gavin Baker's analysis captures the emerging consensus among sophisticated infrastructure investors: the balance of compute is shifting from pre-training to inference faster than anyone expected. The implications are profound:

Fewer "Ferrari" training clusters: The number of organizations doing frontier pre-training will shrink. You might need one or two gigawatt-class clusters for training the next GPT-5 or Gemini Ultra, but you need hundreds of geographically distributed inference facilities.

More "Honda" inference centers: 50–100 MW facilities optimized for low-latency, high-throughput inference, located close to users and optimized for cost rather than raw interconnect bandwidth.

Different hardware requirements: Training requires massive GPU-to-GPU bandwidth and tight synchronization. Inference can be more distributed, can use different (often cheaper) accelerators, and benefits from different optimization strategies.

Microsoft CEO Satya Nadella has been explicit about this shift, publicly stating the company's intention to redirect investment from pre-training-focused compute to inference-optimized compute. This isn't a subtle signal—it's the CEO of the world's largest AI infrastructure investor telling you the thesis is changing.

The reason for this shift is partly technical and partly economic. On the technical side, test-time compute scaling—spending more compute during inference to improve answers—has proven remarkably effective. Models like OpenAI's o1 and o3, and DeepSeek's R1, demonstrated that you can get dramatic capability improvements by letting models "think longer" at inference time, without retraining them.

signüll @signulll Sat, 25 Jan 2025 14:55:58 GMT

we’ve now hit an inflection point where both the supply (i.e., scalable, low-cost intelligence models) & demand (interface innovation like operator-level automation) are being disrupted simultaneously at break through pace.

on the supply side, scaling intelligence cheaply breaks old paradigms (r1 proves this). this is an insane moment, bc once you show superintelligence can be scaled, every constraint on growth becomes a joke. training / inference costs, compute limitations—obliterated. hell, you can run it locally on device for zero marginal cost. it reframes what “scale” even means. cheap intelligence fundamentally undercuts industries built on scarcity: consulting, legal, education, even dev work itself. game over on the supply side.

on the demand side, the operator-style automation breakthrough is even more threatening bc it doesn’t just expand productivity—it eliminates the need for humans in the loop for vast swaths of processes. this isn’t optimization—it’s annihilation. the traditional demand chain (users → tools → workflows) implodes when the “operator” is the entire workflow.

combine these: an infinite supply of cheap or free intelligence + self-operating workflows = total industry realignments. this isn’t just “disruption” of verticals; it’s a redefinition of the horizontal stack of how work itself is structured.

if these two breakthroughs stabilize, the fundamental value prop of human labor & cognitive work across most sectors collapses. ripples are understating it—this is fucking tidal.

change to the social contract—more like what is the point of a social contract?

View on X →

On the economic side, the math is straightforward: a model that costs $100 million to train but serves billions of queries needs to generate value at the inference layer. If inference-time compute scaling can substitute for some pre-training compute, the ROI equation shifts dramatically in favor of inference investment.

Reinforcement Learning: The New Scaling Frontier

If pre-training scaling is hitting data walls and the industry is pivoting toward inference, what's actually driving capability improvements in 2025? The answer, increasingly, is reinforcement learning (RL) applied to language models—and it's opening up what appears to be an entirely new scaling dimension.

Meta's landmark paper on scaling RL compute for LLMs represents the most systematic study to date of how RL training scales^[8]. The results are striking: RL exhibits its own scaling laws, analogous to but distinct from pre-training scaling laws.

Deedy @deedydas Fri, 17 Oct 2025 03:43:50 GMT

Meta just dropped this paper that spills the secret sauce of reinforcement learning (RL) on LLMs.

It lays out an RL recipe, uses 400,000 GPU hrs and posits a scaling law for performance with more compute in RL, like the classic pretraining scaling laws.

Must read for AI nerds.

View on X →

Tanishq Mathew Abraham, Ph.D. @iScienceLuvr Thu, 16 Oct 2025 12:03:41 GMT

The Art of Scaling Reinforcement Learning Compute for LLMs

"We present the first large-scale systematic study, amounting to more than 400,000 GPU-hours, that defines a principled framework for analyzing and predicting RL scaling in LLMs."

"we propose a best-practice recipe, ScaleRL, and demonstrate its effectiveness by successfully scaling and predicting validation performance on a single RL run scaled up to 100,000 GPU-hours."

"ScaleRL is an asynchronous RL recipe that uses PipelineRL with 8 steps off-policyness, interruption-based length control for truncation, FP32 computation for logits, and optimizes the JScaleRL(θ) loss. This loss combines prompt-level loss aggregation, batch-level advantage normalization, truncated importance-sampling REINFORCE loss (CISPO) , zero-variance filtering, and no-positive resampling:"

View on X →

The ScaleRL recipe that emerged from this 400,000 GPU-hour study provides a concrete, reproducible framework for scaling RL training. The key innovations include:

Asynchronous RL (PipelineRL): Instead of the traditional synchronous approach where you generate answers, optimize, then update the policy, PipelineRL continuously generates and pushes weight updates immediately. This reaches the same performance ceiling but gets there faster by eliminating idle time.

Better loss functions: CISPO and GSPO substantially outperform DAPO (the loss function used in DeepSeek's R1) in asymptotic performance. This matters because it means the ceiling of what RL can achieve is higher with the right objective function.

Precision management: Computing the language model head operations in FP32 (full precision) rather than lower precision resolves numerical mismatches between the generator and trainer, leading to more stable training.

Smart data filtering: Excluding prompts where all answers are correct or all are incorrect (zero variance filtering) and removing prompts that become too easy (no positive resampling) improves both training efficiency and final performance.

Alex Weers @a_weers Tue, 03 Mar 2026 22:56:47 GMT

Summary of "The Art of Scaling Reinforcement Learning Compute for LLMs"

They use massive compute to run a lot of experiments, ablate many design decisions and study how the scale if you add compute.

As a model they fit a sigmoid to pass rate vs. compute, giving two key numbers: A (the performance ceiling) and B (the max slope of the sigmoid).
This is a nice deviation from the standard "one-scalar" results we see in many papers (a faster-rising but early-plateauing method can be framed as better performing with the right compute budget).

Things they ablate:
- Asynchronous RL setup: Instead of the common "PPO-off-policy" (generate answers, do some optimization steps, then update generation policy), they find that "Pipeline-RL" (continuously generate answers, push weight updates immediately) has a higher B. Both reach the same performance, but Pipeline-RL has less waiting time between steps

- Loss function: CISPO and GSPO both substantially outperform DAPO in final asymptotic performance. They go with CISPO, and its performance rises slightly faster.

- FP32 precision at the LM head: Generator and trainer kernels usually differ, leading to small numerical mismatches. MiniMax identified that computing the LM head operations in FP32 resolves this almost completely, and this paper validated this.

- Advantage normalization: Prompt-/Batch-/No Normalization has no big effect.

- Zero Variance Filtering: if all answers are correct or all are incorrect, there is no learning signal. Instead of sampling more (DAPO, which might be optimal for number of steps), they just exclude those prompts from the optimization (faster training).

- No positive resampling: If a prompt results in more than 90% correct answers, it is excluded from future epochs. Slightly slows down training, but reaches higher asymptotic performance.

Lots of insights and super valuable ablations that most of us can't run (gpu poor).

Great work and insightful contribution by @Devvrit_Khatri, @louvishh, @rish2k1, @rach_it_, @dvsaisurya, Manzil Zaheer, @inderjit_ml, @brandfonbrener, and @agarwl_!

View on X →

The sigmoid model used to characterize RL scaling is particularly informative. Unlike pre-training scaling laws, which show smooth power-law improvements, RL scaling follows an S-curve: slow initial progress, rapid improvement in the middle, and eventual saturation. The two key parameters—A (the performance ceiling) and B (the rate of improvement)—give practitioners concrete tools for predicting how much compute they need to invest in RL to reach a target performance level.

NVIDIA's ProRL v2 work further validates that RL scaling continues to deliver gains with prolonged training^[10]. The key finding is that with the right recipe, you can continue to improve model performance by simply running RL training longer—a form of scaling that doesn't require more data, just more compute applied intelligently.

The Post-Training Revolution

RL is the most dramatic post-training technique, but it's part of a broader revolution in how models are improved after initial pre-training. The state of post-training in 2025 encompasses a rich ecosystem of techniques[^6]:

Supervised Fine-Tuning (SFT) remains the workhorse of post-training. The key advance in 2025 is the quality and specificity of fine-tuning data. Rather than broad instruction-following datasets, leading labs are using carefully curated, domain-specific datasets that target specific capability gaps.

RLHF and its variants continue to evolve. The original RLHF pipeline (train a reward model, then optimize against it) is being supplemented and sometimes replaced by:

Direct Preference Optimization (DPO): Eliminates the separate reward model, directly optimizing the language model on preference data.
Constitutional AI methods: Using AI-generated feedback to scale the alignment process.
Process reward models: Rewarding intermediate reasoning steps rather than just final answers, which produces more reliable reasoning chains.

Distillation has become a critical technique. Training a smaller "student" model to mimic a larger "teacher" model transfers capabilities efficiently. DeepSeek's R1 demonstrated that distilling reasoning capabilities from a large RL-trained model into smaller models can produce remarkably capable systems at a fraction of the inference cost.

The survey of post-training scaling in large language models published at ACL 2025 provides a comprehensive taxonomy of these approaches^[9]. The key finding is that post-training techniques exhibit their own scaling laws—more compute and better data in post-training consistently improve model capabilities, often more efficiently than equivalent investment in pre-training.

Scaling Law Failures and the Limits of Prediction

It would be intellectually dishonest to present scaling laws as settled science. They're not. Recent research has highlighted significant cases where standard scaling laws fail to predict downstream task performance.

Michael Hu @michahu8 Wed, 02 Jul 2025 13:28:16 GMT

📢 today's scaling laws often don't work for predicting downstream task performance. For some pretraining setups, smooth and predictable scaling is the exception, not the rule.

a quick read about scaling law fails:
📜https://arxiv.org/abs/2507.00885

🧵1/5👇

View on X →

This is a critical caveat for anyone planning infrastructure investments based on scaling law extrapolations. The smooth, predictable curves that make scaling laws so appealing for planning purposes may be the exception rather than the rule when it comes to the tasks we actually care about. A model that achieves a lower perplexity (the standard metric scaling laws predict) doesn't necessarily perform better on coding, reasoning, or real-world tasks.

Clever Hans @Der_KlugeHans Sat, 28 Feb 2026 05:07:37 GMT

(1) Scaling pretraining compute Chinchilla-optimally leads to more parameters, which is generally bad for inference scaling (and RL) and these may be the more important drivers currently (these methods lie mostly outside the exact claims in our paper and Epoch's post).

View on X →

The tension between Chinchilla-optimal pre-training and inference/RL scaling is particularly important. Scaling pre-training compute Chinchilla-optimally leads to larger models, which are worse for inference scaling and RL—the very techniques that are driving the most capability improvements in 2025. This creates a genuine strategic dilemma: do you optimize for pre-training efficiency or for downstream post-training and inference potential?

Recent theoretical work is attempting to explain scaling laws from first principles rather than treating them as purely empirical:

Yizhou Liu @YizhouLiu0 Sat, 28 Feb 2026 22:25:40 GMT

💡Neural Scaling Laws Trilogy: Superposition yields 1/width law, averaging yields 1/depth law, and low-entropy universality yields 1/3-time law. At the optimal shape, Chinchilla scaling laws can be explained. Improvements in scaling are hypothesized.
👉https://liuyz0.github.io/blog/2026/NSLT/

View on X →

If these theoretical frameworks hold up, they could provide more reliable predictions about when and how scaling laws will break down—information that would be worth billions to infrastructure planners.

Karpathy's Empirical Approach: Scaling Laws from $100 Experiments

One of the most illuminating contributions to the scaling laws debate in 2025 has been Andrej Karpathy's empirical work training small models to derive scaling relationships from first principles.

Aakash Gupta @aakashgupta Wed, 07 Jan 2026 23:07:03 GMT

I’ve been saying for years that the best LLM education comes from people who can compress complexity into simplicity.

Karpathy just wrote the most elegant explainer of scaling laws I’ve seen.

Let me walk you through why this matters.

The core insight is WILD.

When you double your compute budget, you shouldn’t train a 2x bigger model. You should train a 1.4x bigger model on 1.4x more data.

The math checks out. 1.4 × 1.4 = 2.

But here’s where Karpathy found something that contradicts Chinchilla.

Chinchilla from DeepMind in 2022 said the optimal ratio of tokens to parameters is 20:1. Train a 70B model on 1.4T tokens.

Karpathy’s nanochat experiments found 8:1.

Eight.

That’s a 2.5x difference in how much data you need per parameter.

What could explain this? My read is the Muon optimizer. Or maybe smaller models just prefer to be thicc. Either way, the scaling curves don’t lie.

And the curves here are beautiful.

He trained 11 models from d10 to d20 in 4 hours for $100.

When your architecture and optimization are properly arranged, these curves never intersect. Each depth represents the unique compute-optimal path to a target loss.

The extrapolations get interesting.

To match GPT-2 XL performance (CORE score 0.257), you’d need a d38 model with 2.2B parameters trained on 17.8B tokens. Cost? $546 on 8xH100s.

To match GPT-3 6.7B? That’s d95, 19.3B parameters, 154.6B tokens. 78 days of compute. $45K.

To match full GPT-3 175B? The fit predicts d181, 91.8B parameters, 734B tokens. 1,869 days. A million dollars.

We’re doing a lot of extrapolation here. But the sanity check works. The predicted FLOPs for GPT-3 level is 5.7e23. OpenAI actually used around 3e23.

The reason this is a CORE metric instead of just validation loss is telling.

Karpathy specifically avoided using val loss for comparsions. He called out modded nanogpt for gaming the metric by using batch size 1 with long sequences. That stretches validation batches into one long row which artificially inflates scores.

This is the kind of methodological rigor that makes research trustworthy.

The miniseries v1 is just the foundation.

Goal for v2 is to lift and tilt that line. More bang per buck.

This is what open source AI education looks like when it’s done right.

View on X →

The finding that the optimal token-to-parameter ratio might be 8:1 rather than Chinchilla's 20:1 is potentially significant. If confirmed at larger scales, it would mean models need substantially less data per parameter than previously thought—partially alleviating the data wall problem. The suspected explanation (the Muon optimizer, or different scaling behavior at smaller model sizes) highlights how much uncertainty remains in our understanding of these relationships.

What makes this work valuable for practitioners is its reproducibility and cost. Training 11 models in 4 hours for $100 and deriving scaling relationships that extrapolate reasonably to GPT-2 and GPT-3 scale demonstrates that scaling law research doesn't require gigawatt facilities—it requires careful experimental design.

The Infrastructure Engineering Challenge

Even if the scaling laws hold perfectly, translating theoretical compute requirements into actual training runs at gigawatt scale is an engineering challenge of unprecedented complexity.

Rohan Paul @rohanpaul_ai Fri, 12 Sep 2025 07:12:55 GMT

New MIT paper shows a very simple yet effective trick for scaling LLM pretraining across hundreds of GPUs

Shows that when training across 128 nodes (256 GPUs), the biggest bottleneck is not the network bandwidth for gradient exchange, as many assumed, but rather the way data is prepared and fed to GPUs.

y demonstrate that 2 tricks — pre-tokenizing and shrinking a 2TB dataset down to 25GB a 99% cut.

And copying that processed dataset to each node instead of streaming over the network — were enough to fully saturate GPUs and achieve near-linear scaling.

They train a BERT style encoder on binaries with masking, growing from 1 to 256 GPUs across 128 nodes.

Many of us really did not know that such simple “duplicate and localize” approach works at such scale.

They keep only token IDs and masks on disk, then copy that small set to every node, which removes network hot spots.

Data loading is parallel, but only enough to keep 1 GPU near 100% usage, extra workers add delay.

With data parallel training, speed rises almost linearly to 128 nodes, so the network is not the choke point.

As models grow, memory fills, batches shrink, for example 184 down to 20, which lowers efficiency and pushes toward model parallelism.

----

Paper – arxiv. org/abs/2509.05258

Paper Title: "Scaling Performance of LLM Pretraining"

---

View on X →

MIT's finding that the biggest bottleneck in multi-node training is data preparation, not network bandwidth, is counterintuitive and practically important. The solution—pre-tokenizing data and copying compressed datasets to each node—is almost embarrassingly simple, yet it enables near-linear scaling to 256 GPUs across 128 nodes. This is the kind of practical insight that separates successful large-scale training runs from failed ones.

At gigawatt scale, these engineering challenges multiply. Epoch AI's analysis of data center build times suggests that the physical construction of these facilities is itself a bottleneck^[13]. Even with unlimited capital, you can't instantiate a gigawatt data center overnight. The supply chains for high-bandwidth networking equipment, advanced cooling systems, and power delivery infrastructure are all constrained.

Jarren Feldman @jarrenfeldman Thu, 26 Feb 2026 12:21:10 GMT

A one year national moratorium on new or expanded AI data centers in the US would impose severe economic and strategic costs.

In 2025, hyperscalers committed $340-370 billion in capex, much of it for AI infrastructure. This spending drove 1.1% of H1 GDP growth and up to one-fifth of Q2 expansion in some analyses.

Construction alone hit record rates, supporting jobs, supply chains, and tax revenue exceeding $160 billion sector-wide.

Halting projects would stall $200 billion plus in planned builds, trigger immediate layoffs in construction and equipment sectors, strand investments, and cut GDP growth by an estimated 0.5-2% for the year.

Tech stocks would likely fall sharply, lowering 401K balances across the nation.

Strategically, the US holds over 40% of global data center capacity. A pause cedes ground to China and Gulf states, which face fewer restrictions and continue aggressive builds.

US AI model scaling for training and inference would slow, delaying advances in defense, drug discovery, autonomous systems, and productivity tools projected to add trillions long-term value.

Existing local moratoriums have already blocked or delayed $64 billion in projects due to grid, water, and power concerns. A national version would multiply this, forcing firms to shift overseas and eroding sovereign AI leadership.

Grid relief would prove temporary, as catch-up builds later inflate costs and prices. The net result: lost momentum, exported innovation, and weakened competitiveness with no offsetting gains in a fast-moving global race.

Instead of the anti build rhetoric, we should marshal all resources to improve our energy infrastructure and cut regulations so it’s easier to build new energy.

View on X →

The economic stakes of getting this infrastructure buildout right (or wrong) are enormous. Data center construction is driving measurable GDP growth, and the strategic implications of who builds these facilities—and where—will shape the geopolitical landscape of AI for decades.

The Bear Case: Are We Building Cathedrals to a False God?

Not everyone is convinced that massive infrastructure investment will pay off. The skeptical case deserves serious engagement.

Hedgie @HedgieMarkets Wed, 15 Oct 2025 18:42:00 GMT

🦔 AI companies are burning billions with no path to profitability while Nvidia hits $4.5 trillion in market cap selling shovels to miners who can't find gold. The economics are "brutal" across the industry, from startups to tech giants, with revenues barely registering on balance sheets despite massive infrastructure spending.

The Technical Economics Are Backwards
Companies are using reinforcement learning to fight hallucinations, which drives resource needs up rather than down. Meanwhile, demand is shifting toward even more expensive text-to-video models like Sora 2, creating a cost spiral where solving core problems requires more compute, not less. When your technical improvements make the business model worse, something is fundamentally broken.

The Scale Delusion
Proponents argue that massive data centers will eventually drive down costs and that AI models will teach themselves to improve, but hallucinations continue plaguing even frontier models on simple queries. MIT research shows only 5% of businesses achieve "rapid revenue acceleration" with AI, while data centers age rapidly and require constant hardware upgrades due to obsolete equipment.

The Survival Math
Venture capitalist Vinod Khosla expects only 3% of AI investments to generate 60% of returns, compared to the typical 6% in venture capital. Analysts predict only a handful of companies will survive an AI bust, similar to the dot-com aftermath, but with worse odds than normal startup investing.

My Take
This confirms the impossible unit economics I've been tracking. When scaling up makes the business model worse rather than better, and when technical improvements require exponentially more resources, you're not building a sustainable industry. The comparison to dot-com is apt, but the current bubble has worse fundamentals because at least internet companies had clear paths to reducing marginal costs through scale.

Hedgie🤗

View on X →

The core of the bear argument is that AI's unit economics are fundamentally broken: solving technical problems (like hallucinations) requires more compute, not less, creating a cost spiral rather than the cost deflation that characterized previous technology waves. If scaling up makes the business model worse rather than better, the entire infrastructure thesis collapses.

There's genuine substance to parts of this critique. Hallucinations remain a persistent problem even in frontier models. The MIT finding that only 5% of businesses achieve "rapid revenue acceleration" with AI is sobering^[1]. And the comparison to the dot-com bubble—where the underlying technology was transformative but the business models were mostly wrong—is historically apt.

However, the bear case has significant weaknesses:

It conflates current revenue with future value. The internet in 1999 also had terrible unit economics. The infrastructure built during the bubble became the foundation for the modern digital economy.

It underestimates inference-time scaling. The shift to test-time compute means that models can be made dramatically more capable without retraining, changing the cost-per-useful-output equation.

It ignores the software efficiency gains. If training compute needs are indeed declining by ~10× per year due to algorithmic improvements, the cost of achieving a given capability level is falling rapidly even as absolute spending increases.

It assumes static demand. As models become more capable and cheaper to run, new use cases emerge that weren't previously viable. The demand curve for intelligence is not fixed.

Forbes' AI predictions for 2025 capture the tension: the industry is simultaneously experiencing unprecedented investment and genuine uncertainty about returns^[1]. The resolution likely lies in the middle—not all investment will pay off, but the underlying capability improvements are real and will generate enormous value for the companies and applications that find product-market fit.

What Methods Will Continue to Drive Progress

Synthesizing the expert consensus and the empirical evidence, here's where the continued positive impact is most likely to come from:

Pre-training innovations that will matter:

Better data curation and synthetic data generation (the 10× efficiency gains are primarily here)
New optimizers (Muon and its successors) that change the compute-optimal frontier
Multi-modal pre-training that expands the effective data pool
Stronger regularization techniques for training beyond the data wall
Architecture improvements (though the Transformer remains dominant, modifications like mixture-of-experts continue to improve efficiency)

Post-training techniques with continuing impact:

Reinforcement learning with verified rewards (math, code, and other domains with checkable answers)
Process reward models that improve reasoning reliability
Distillation from larger to smaller models
Domain-specific fine-tuning with high-quality data
Constitutional AI and scalable oversight methods

Inference-time compute scaling:

Chain-of-thought and extended reasoning (the o1/o3/R1 paradigm)
Search and planning at inference time
Adaptive compute allocation (spending more time on harder problems)
Speculative decoding and other efficiency improvements

Infrastructure and systems innovations:

Better parallelism strategies for multi-node training
Improved data loading and preprocessing pipelines
Mixed-precision training with careful numerical management
Asynchronous RL training pipelines (PipelineRL and successors)

The state of LLMs in 2025 reflects a field that has matured past the "just make it bigger" phase into a more nuanced understanding of how different forms of scaling interact^[7]. The most capable models will be those that combine efficient pre-training, sophisticated post-training, and intelligent inference-time compute allocation.

Conclusion

The gigawatt question doesn't have a simple answer because it's actually several questions bundled together. Will massive new training centers produce better models? Yes—but the relationship between compute and capability is no longer the simple power law it appeared to be in 2022. The returns from pure pre-training scale are diminishing as data becomes the binding constraint, and the most dramatic capability improvements are coming from post-training techniques, particularly reinforcement learning, and from inference-time compute scaling.

The infrastructure being built today will be valuable, but perhaps not in the way originally envisioned. The gigawatt "Ferrari" clusters will serve a shrinking number of frontier pre-training runs, while the bulk of AI compute demand will shift to distributed inference facilities. The companies that recognize this shift early—as Microsoft appears to be doing—will be better positioned than those building exclusively for a pre-training-centric world.

For practitioners, the implications are concrete: the models you'll be using in 2026–2027 will be substantially better than today's, but the improvements will come from a combination of sources. Better pre-training data and algorithms will contribute perhaps 3–10× efficiency gains per year. Reinforcement learning will unlock new reasoning capabilities with its own scaling laws. And inference-time compute will allow you to trade cost for quality in a continuous, controllable way.

The scaling laws aren't dead—they're multiplying. Instead of one dimension of scaling (pre-training compute), we now have at least three (pre-training, post-training/RL, and inference-time compute), each with its own empirical regularities and diminishing returns curves. The art of building frontier AI systems in 2025 and beyond lies in understanding how these dimensions interact and where to invest the next marginal dollar of compute.

The gigawatt data centers will get built. The question is whether the intelligence they produce justifies the investment. Based on the evidence—the continuing algorithmic efficiency gains, the new RL scaling laws, the inference-time compute breakthroughs—the answer is cautiously yes. But the path runs through algorithmic ingenuity, not just raw power. The future of AI is not just about how much electricity you can pour into GPUs. It's about how cleverly you use every watt.

Sources

^[1] 10 AI Predictions For 2025 — https://www.forbes.com/sites/robtoews/2024/12/22/10-ai-predictions-for-2025

^[2] Can AI scaling continue through 2030? — https://epoch.ai/blog/can-ai-scaling-continue-through-2030

^[3] How to build AI scaling laws for efficient LLM training and budget maximization — https://news.mit.edu/2025/how-build-ai-scaling-laws-efficient-llm-training-budget-maximization-0916

^[4] AI Trends and Predictions 2025 From Industry Insiders — https://www.itprotoday.com/ai-machine-learning/ai-trends-and-predictions-2025-from-industry-insiders

^[5] Will Scaling Laws Hold? 2025 and the Future of AI — https://higes.substack.com/p/will-scaling-laws-hold-2025-and-the

^[6] The state of post-training in 2025 — https://www.interconnects.ai/p/the-state-of-post-training-2025

^[7] The State Of LLMs 2025: Progress, Problems, and Predictions — https://magazine.sebastianraschka.com/p/state-of-llms-2025

^[8] The Art of Scaling Reinforcement Learning Compute for LLMs — https://arxiv.org/abs/2510.13786

^[9] A Survey of Post-Training Scaling in Large Language Models — https://aclanthology.org/2025.acl-long.140/

^[10] Scaling LLM Reinforcement Learning with Prolonged Training Using ProRL v2 — https://developer.nvidia.com/blog/scaling-llm-reinforcement-learning-with-prolonged-training-using-prorl-v2

^[11] Gearing Up for the Gigawatt Data Center Age — https://blogs.nvidia.com/blog/networking-matters-more-than-ever

^[12] OpenAI and NVIDIA announce strategic partnership to deploy 10 gigawatts of AI datacenters — https://openai.com/index/openai-nvidia-systems-partnership

^[13] Build times for gigawatt-scale data centers — https://epoch.ai/data-insights/data-centers-buildout-speeds

^[14] The cost of compute: A $7 trillion race to scale data centers — https://www.mckinsey.com/industries/technology-media-and-telecommunications/our-insights/the-cost-of-compute-a-7-trillion-dollar-race-to-scale-data-centers

^[15] AI data centers jolt power demand — https://action.deloitte.com/insight/4718/ai-data-centers-jolt-power-demand

Introduction

Overview

The Original Scaling Laws: What Chinchilla Told Us (and What It Didn't)

The Data Wall: Pre-Training's Most Fundamental Constraint

What Gigawatt-Scale Training Centers Will Actually Deliver

The Great Pivot: From Pre-Training to Inference

Reinforcement Learning: The New Scaling Frontier

The Post-Training Revolution

Scaling Law Failures and the Limits of Prediction

Karpathy's Empirical Approach: Scaling Laws from $100 Experiments

The Infrastructure Engineering Challenge

The Bear Case: Are We Building Cathedrals to a False God?

What Methods Will Continue to Drive Progress

Conclusion

Sources

References (15 sources)

Related Guides

Netlify vs Neon: Which Is Best for Rapid Prototyping in 2026?

Meta Llama vs Groq vs Cohere: Which Is Best for Code Review and Debugging in 2026?

What Is PlanetScale? A Complete Guide for 2026

Sprout Social vs Ghost vs Mailchimp: Which Is Best for Customer Support Automation in 2026?

Midjourney vs Adobe Express: Which Is Best for Developer Productivity in 2026?