The Transistor Cliff

Sarah Constantin

Moore’s law may be coming to an end. What happens to AI progress if it does?

How Hardware Affects AI Progress

The biggest AI models are trained on expensive, state-of-the-art microchips, or semiconductors. Only a few organizations, such as Google and OpenAI, have the budgets to train them. For years, improvements in AI performance have been driven by progress in this underlying hardware.

For most of the history of semiconductor manufacturing, steadily and predictably accelerating improvements in performance and reductions in price have been the norm. This pattern has been codified as “Moore’s Law,” Intel CEO Gordon Moore’s observation that the number of transistors that could be placed on a chip for the same price doubled approximately every two years. That may be coming to an end. Depending on the specific semiconductor performance metric, Moore’s Law has either stalled out already, or is on course to soon hit fundamental physical limits.

So, what could happen “after Moore’s Law?” And how would that affect AI performance?

Let’s zoom in and look at the details.

What Does Scaling Mean?

Scaling laws in AI generally relate the performance of a model to its inputs: training data, model parameters, and compute.

The performance of a model describes its accuracy in choosing the “right” answer on known data. A large language model is trained to predict text completions; the more often it correctly predicts how to complete a text, the better its performance. A close antonym of performance is “loss,” which is a measure of how far off a model’s predictions were from reality; lower loss means better performance.

Training data is the size of the dataset a model is trained on.The number of parameters in the model is a measure of its complexity — equivalent to the number of nodes in the neural network.

Finally, the amount of “compute” used for a model, measured in floating point operations, or flops,¹ is simply the number of computer operations (typically matrix multiplications) that must be performed throughout the model’s training. Compute is therefore influenced by both the amount of data and number of parameters.

The scaling relationship between loss and compute found by OpenAI in 2020 is a power law. If a model has 10 times the compute, its loss will be about 11% lower. This tells us how much “better” models can get from “scaling compute” alone. It’s difficult to say exactly what “11% lower loss” means in terms of how powerful or accurate a model is, but we can use existing models for context.

GPT-2, which OpenAI released in 2019, was trained on 300 million tokens of text data and had 1.5 billion parameters. GPT-3 — the model behind ChatGPT — was trained on 300 billion to 400 billion tokens of text data and had 175 billion parameters. The details of their newest model, GPT-4, have not been made public, but outside estimates of its size range from 400 billion to 1 trillion parameters and around 8 trillion tokens of training data.

In other words, training GPT-3 took about 200,000 times as much compute as GPT-2, and GPT-4 probably took between 60-150 times more than GPT-3. In practical terms, GPT-2 could produce coherent sentences, but its output tended to degenerate into repetitive noise after about a paragraph. The much larger GPT-3 can reliably generate on-topic, sensible completions. GPT-4’s performance — on everything from programming problems to the bar exam — is even more impressive.

Looking at a longer time horizon, Epoch AI estimates that the compute used for training the state-of-the-art machine learning models has increased by about eight orders of magnitude (that is, 100 million times over) between 2012 and 2023.

If the largest AI models continue to grow at their current pace through the end of this decade, that would be the equivalent of three orders of magnitude of compute growth. That’s more than the compute growth between GPT-3 and GPT-4, though less than the compute growth between GPT-2 and GPT-3. As extremely large models have become more compute-intensive, the pace of their growth seems to have slowed.

It’s still possible that the compute devoted to AI models will accelerate faster than the current trend. Perhaps AI will attract greater investment and resources as the first LLM-driven product are released and become widely popular. But there are some reasons to expect that we may run into fundamental limits to how much compute can go into LLMs by the end of this decade.

Moore’s Law in Relation to AI Progress

In 1965, Gordon Moore posited that the number of transistors in an integrated circuit at the lowest price per transistor doubles about every two years. At least with respect to the number of transistors per chip, this has held true.

Source: Our World in Data

But Moore’s Law looks stagnant if we include Moore’s original criterion of price. The cost per transistor stopped decreasing in 2011 with the 28 nanometer (nm) node (today’s state of the art transistors use 3 nm, with 2 nm likely to be released next year). Since then, transistor costs have increased, rising to $2.16 for the latest 3 nm nodes — costs not seen since around 2005.

Source: International Business Strategies, Inc.

Where money is no object, the transistor density of the best available computer hardware is currently still growing at an exponential rate; but if price matters, transistor density at the best available price stagnated more than a decade ago.

What about state-of-the-art performance? The quantity we care about, for the purposes of predicting AI progress, is the top speed of the hardware, measured by the number of operations it can carry out per second: peak flops.

The “compute budget” of an AI model is given by

c = training time x number of cores x peak flops x utilization rate

In other words, compute (and hence performance) scales with the amount of time devoted to training a model, the number of computers (these days, largely GPUs² ) performing computations in parallel, the speed of the GPU when it’s running, and the utilization rate, i.e., the percentage of the time the GPU is actually executing tasks while the model is training.

“Wait a minute, why would the GPU be idle?”

Because training an AI model involves more than just multiplying numbers. Critically, it also involves calling memory and communicating between different processors. Even the most efficient models on today’s hardware spend 40% of training time making calls to memory. Empirically, utilization rates seem to be 30-75% at best. Utilization rates also decline with the number of GPU processors used in parallel, since the more processors you use, the more time you’ll have to “waste” sending data between them.

Training time probably can’t scale up much from here; the largest language models are already spending months training, and firms may not find it profitable to spend years training a single model. So, if you assume that OpenAI, DeepMind, Meta, and the other big AI players are time constrained at current margins and not cost constrained, then the growth of compute spent on LLMs should scale with peak flops and the number of cores.

GPU flops have grown at a doubling rate of roughly every two years.

The primary drivers of this trend in improved GPU performance are smaller transistors and increased numbers of cores.³ Straightforwardly, as Moore’s Law makes transistors smaller, and each GPU contains more transistors, then each GPU will compute more operations per second with those transistors.

But there are inherent physical limits to how small you can make a transistor.

Fundamental Physical Limits to Transistor Size

One limit has to do with thermodynamics. As transistors become smaller, it takes less and less energy to flip the gates which control the currents from “on” to “off” and back. Once this “switching energy” drops to the same scale as the energy fluctuations produced by the random molecular movements we call heat, then the transistor will turn on and off at random.

So what is this thermodynamic minimum gate length? One paper from 2015 estimates it as in the 4-5 nm range — a limit we will likely reach by 2030.⁴ But the thermodynamic minimum gate length depends on the material being used. Moving beyond silicon to newer semiconductor materials which can hold more electrical energy will allow smaller transistors to stay “switched off” despite thermodynamic fluctuations.

Researchers using novel materials like this may be able to produce much smaller transistors, at least in the lab. A team at Tsinghua University in China claims to have fabricated an 0.34 nm transistor made of graphene and molybdenum disulfide. But it’s a long way from fabricating one transistor in the lab to mass-producing hundreds of billions on a chip, and not all materials are amenable to mass production.

Another physical limit to making transistors smaller has to do with light resolution. Currently, circuits are etched into semiconductors in a method known as photolithography. Ultraviolet light is projected through a “mask” to hit a semiconductor wafer in precise geometric patterns, where it reacts with photoactive materials. Then, strong solvents are used to etch away everything the light didn’t touch, leaving a raised pattern that forms one layer of a circuit. But light is a wave, and it’s impossible to resolve features smaller than about half the frequency of the light — in the range of tens of nanometers.

This is why semiconductor manufacturers have been using higher and higher frequency light at enormous cost, among other tricks. But even so, you simply can’t etch things that are that much smaller than the light wavelength.

Theoretically, it would be possible to use higher-frequency radiation like x-rays. But apart from the extreme cost and the need to develop new technologies and materials, x-rays are ionizing radiation — they interact with everything they touch, scattering electrons and “blurring” the resolution of the image. The smallest x-ray lithography feature sizes produced to date are actually, at 30 nm, larger than what ultraviolet photolithography at its best can provide. Paolo Gargini, the chair of the IEEE International Roadmap for Devices and Systems, the semiconductor industry’s organization for predicting and planning progress in chip manufacturing, predicts that we’ll reach the limits of photolithography around 2029.

For two independent reasons, it seems we have less than a decade left of shrinking transistor sizes.

Beyond Moore’s Law: Alternative Paths

There are several paths to achieve more flops without making transistors smaller.

The first path is to redesign chips. One option is to manufacture 3D chips, where transistors are stacked vertically. 3D stacked complementary metal-oxide semiconductors (a type of semiconductor that uses two types of transistors to allow for efficient switching between “on” and “off” states) can double transistor density. Intel recently announced progress on novel materials which promise a 10x improvement in transistor density by 2030. And two-layer CPU chips, which can improve transistor density by 40%, were released in 2021.

These developments build on 3D memory designs, first commercialized a decade ago. Each release has come with 30-50% more layers. And in principle, layers can be stacked arbitrarily high, allowing transistor density to scale linearly without shrinking the individual transistors. Samsung, for instance, predicts that they’ll reach 1,000 layers by 2030.

The second path is to design special-purpose chips. The best possible flops for a given application may not be achievable on a general purpose computing device, but rather on a special chip architecture designed for the application. There are two main options here.

Application-specific integrated circuits, or ASICs, are rigidly special-purpose, designed for exactly one type of computation. ASICs are widely used in telecommunications equipment, modern vehicles, and medical devices. Field-programmable gate arrays, or FPGAs, are a notch more flexible, allowing users to configure their own logic circuits. AI-specialized “accelerator” chips, which are optimized for training neural networks, are a type of FPGA. AI accelerators can produce more flops than GPUs at the same or lower transistor density.

Google’s TPU (Tensor Processing Unit), for example, is a custom type of ASIC designed specifically for accelerating machine learning tasks. Third generation TPUs have two to four times the flops of the widely used Nvidia V100 GPUs, despite the GPUs being fabricated on a 12 nm node and the TPUs only being fabricated on a 16 nm node. But despite a zoo of emerging competitors developing special-purpose architectures, as of 2022 Nvidia’s latest generation of GPUs, the H100s, are still the leaders on a standard MLPerf benchmark test.

So, while in principle special-purpose AI chips could get more flops with the same number of transistors, and while they are often cheaper on specific training tasks, they haven’t yet come out firmly ahead of standard GPU architectures at maximum processing speed.

The third path involves replacing transistors with other kinds of switches. There’s no physical law that says computation has to be done with transistors. Alternative models for computation, which include optical computing and memristors, could be faster and scale better, but most of these are still in their infancy.

Optical computing uses light instead of electrons for computation, which results in less energy and heat. Moreover, photons are about 20 times faster than electrons. One experimental optical switch, developed by IBM researchers, can alternate 1000 times faster than conventional transistors. A more recent result from the University of Arizona found switching speeds for an optical device that are a million times faster than transistors.

But while optical switches may be fast, they can’t be dense; to transmit light, optical waveguides can’t be much narrower than the light’s wavelength, which in this case is hundreds of nanometers. (By contrast, conventional semiconductor transistors can have feature sizes in the tens of nanometers.) So all-optical computing devices remain speculative.

Memristors are another alternative. A memristor is an electronic component whose resistance depends on the accumulated electric charge that has passed through it — in contrast to a semiconductor, whose conductivity depends on the current presence of an electric field. It’s possible memristors could scale to be smaller than transistors; the smallest ones produced in the lab are about 1 nm. But like optical computing, the technology remains unproven.

Putting together these three classes of beyond-Moore’s-law innovation, we’re looking at:

Advanced packaging and 3D designs: at least 10 times transistor density improvement by 2030, possibly continued 10 times growth into the 2030s as more layers are stacked

Special-purpose computers: up to 10 times flops speedup depending on whether greater architecture optimizations are possible

Non-transistor computation paradigms: very uncertain, and might not happen at all, but could theoretically improve flops by five to 1000 times.

A moderately likely scenario, for instance, might be “Moore’s Law holds until 2040 with 3D architectures, special-purpose AI accelerators don’t provide any flops improvement, transistors remain the main building block of computation throughout the 2030s.” In such a scenario, the peak flops achieved by a GPU might grow from nearly a teraflop in 2022 to hundreds of exaflops by 2040, or a five-order-of-magnitude increase over nearly two decades.

In the more pessimistic scenario where flops stop growing altogether by 2030, we’ll only see a two-order-of-magnitude increase in peak computation speed by 2030, and no more between 2030 and 2040.

This means that, for the current rate of compute growth of the largest AI models to continue through 2030 (resulting in models three orders of magnitude more compute-intensive), state-of-the-art models would need to use significantly more computer chips and cost far more than they do today. In this case, a jump in model scale comparable to that between GPT-3 and GPT-4 would take until the end of the decade, depending on how easy it is to acquire and train across a vastly increased numbers of chips.

The Memory Wall

Even if GPUs didn’t improve their peak flops much (or at all), couldn’t an AI company just buy lots and lots of them and run them in parallel, and see linear improvements in “compute?”

No, because training time is an issue. The biggest models today take six months to train, and a significant portion of that time is spent writing and retrieving model weights from memory.

Memory is stored on a separate device from the chip that does the computation. Typically this is DRAM.⁵ Using clever optimizations you can design the training algorithms to minimize the number of calls to memory, but ultimately these are one-time improvements that don’t scale as models get bigger. The more data points a model is trained on, the more calls to memory there must be to update the weights.⁶ The best utilization rates observed with Nvidia’s A100 GPUs are about 60%.

So, is memory bandwidth improving over time at the same rate as compute? Not so much.

Source: Amir Gholami

Peak DRAM bandwidth has been increasing far slower than flops, at 30 times in the last 20 years (flops have increased 90,000 times in the same period). Today, a model that takes six months to train and has a 60% GPU utilization rate will spend about 2.5 months just transferring data to and from memory.

Let’s say we want to no more than double that time — ever. And let’s say that DRAM bandwidth continues to grow at its current rate. In that case, it doesn’t matter how much compute we have. In order for the training run to be able to use all of its available compute without memory call times ballooning, compute cannot grow by more than 3 times by 2030 and 17 times by 2040. This is a much more conservative bound on AI compute growth than either what Moore’s Law for flops suggests (128 times by 2030) or what recent AI compute trends suggests (631 times by 2030). In a world where memory bandwidth (and time) is the limiting factor, we don’t even get one more order of magnitude of scaling growth in AI compute this decade.

That world looks like getting GPT-5 in 2040. Or, it looks like OpenAI CEO Sam Altman’s recent announcement that we’re at the “end of the era where it’s going to be these, like, giant, giant models” and that they will not be training GPT-5 for “quite some time.”

On the other hand, there are some countervailing factors that might make this picture look different.

Memory Bandwidth Improvements

The memory bandwidth (in GB/s) between DRAM memory and the processor depends on factors such as the memory clock speed (how many operations it can perform per second) and the memory bus width (how many bits of data can be transferred per cycle), as well as other aspects of the memory architecture, chip design, and manufacturing quality.

For memory, as for logic, clock speeds increased along with Moore’s Law over many decades.

But while transistors have continued to get smaller, they have stopped getting faster.

Switching speed depends on the width of a transistor’s gate, but gate widths are now a single molecule wide and can’t actually get any narrower. So further shrinking the other dimensions of transistors doesn’t increase their speed.

Clock speeds, in fact, have been flat since 2004.

Block error: "Call to a member function url() on null" in block type: "image"

So we can’t count on clock speed alone to improve memory bandwidth.

What about increasing bus widths?

Stacking multiple layers of DRAM dies, in a format known as High Bandwidth Memory (HBM) can increase memory bandwidth by allowing larger bus widths. There are more independent connections between the HBM and the GPU, allowing faster data flow.

However, HBM chips are far more expensive than traditional DRAM — the most advanced HBM costs $120/GB compared to about $3/GB for DRAM.

Moreover, the limiting factor in stacking more and more layers of memory, or packing circuit elements denser, is heat. Memory requires power to store information, even when it’s not “on,” and it dissipates heat. Today, even going beyond 12 layers may be infeasible due to heat constraints. Memory is especially sensitive to heat, because at higher temperatures, thermal noise can degrade stored data. Faster degradation means the data needs to be refreshed more frequently — but refreshing also generates waste heat! So there’s a vicious cycle where overheating leads to even more overheating.

Improving heat dissipation is an active area of research, and so more heat-efficient memory designs may be invented in future years, but scaling up memory bandwidth above its current slow trajectory is likely to continue to be challenging.

AI Model Efficiency Improvements

Another approach is to redesign AI models (or the algorithms for training them) so that they require less memory bandwidth or computational power.

One recent example of progress in memory is FlashAttention,⁷ a method for computing attention (a component of all current state-of-the art AI architectures, including LLMs) that reduces how frequently the model accesses memory. On GPT-2, FlashAttention led to a tripling of training speed (although this might be an unusually favorable result – attention takes up about 20% of the cost for most LLM training runs, so the effect would typically be less dramatic).

When training is parallelized across many GPUs, avoiding redundancy in memory storage can also reduce a model’s memory footprint, producing 8 times speedups on a billion-parameter (GPT-2-sized) model. If memory bottlenecks loosen by an order of magnitude or more, we might return to our “flops-bottlenecked” scenario where AI models’ computational load can grow two to three orders of magnitude by 2030 and perhaps as many as five orders of magnitude by 2040.

There have also been innovations on the compute front. Compact open-source models like Alpaca, which uses only 7 billion parameters plus fine-tuning on a combination of human-generated and LLM-generated examples, produce similar performance to the much larger GPT-3 (175B parameters) – and it can be trained in only 3 hours for less than $100. In the same vein, LoRA⁸ is a novel training scheme that allows computationally cheap fine-tuning of large language models to specific tasks, allowing a 25% speedup on fine-tuning of GPT-3, and much easier parallelization to multiple GPUs.

The open-source community’s progress in creating compact, cheap-to-train LLMs may have leapfrogged the big AI companies — at least according to a memo allegedly leaked by a Google employee titled “We Have No Moat, and Neither Does OpenAI.” If large language models don’t have to be large to work well, they need not remain the purview of a handful of large tech companies. If smaller models can match GPT-3-like performance, does that mean that we should expect far better performance than the current state-of-the-art will be possible with less compute than the 2020 and 2022 scaling laws suggest?

It’s not clear. Small, open-source models like Alpaca and Vicuna showed that fine-tuning a small model with a carefully curated dataset of (real and simulated) human-computer interactions can almost match the performance of a larger LLM trained from scratch. We can interpret this as an insight about task-specific training. These models lack the flexibility of their larger counterparts, and can’t compete on tasks that require more robust reasoning skills. A general LLM trained on human text can, among other things, function as a chatbot; but a smaller LLM trained on a smaller dataset of human text plus a targeted dataset of human-chatbot interactions can perform nearly as well, for far more cheaply.

Rather than a refutation of scaling laws, or an acceleration of their slope, I think this is more like a move in a different direction altogether, towards a Cambrian explosion of “little AIs” used for different purposes, where getting good performance on a task depends on the quality of your task-specific dataset. That could be consistent with the state of the art continuing to progress steadily along “scaling law” lines for quite some time, but it could also mean the economic incentive towards ever-bigger models would diminish and we’d enter an entirely new era where AI progress would not be driven primarily by semiconductor scaling.

The End of Scaling: Not the End of AI

What happens when we run up against the limits of AI scaling?

Whether they’re training time limits, data availability limits, or limits based on the cost and availability of computer hardware, we might not be far from the end of the era when the most straightforward way to make models better is to make them bigger.

That doesn’t, of course, mean the end of making models better.

Scaling is just the most naive, straightforward way to improve AI models — and it has worked surprisingly well, for a while. In a world where scaling has stalled, progress in AI will look more like innovation in developing new applications and fine-tuned variants of the big foundation models we already have, along with architectural and algorithmic innovation to push out the fundamental capabilities of the big models without using more data or compute.

A post-scaling scenario for AI might look like an “AI winter,” or it might look like an acceleration of AI capabilities — just driven by other, less predictable factors than the steady drumbeat of Moore’s Law.

Floating point operations per second, flops, is a measure of a computer’s performance. Floating point refers to numbers in a computer’s memory — “floating” because the decimal point can move around. Operations refer to basic arithmetic, while per second is the time component. For context, an Apple M1 chip manages 2.6 teraflops of performance, while an Nvidia A100 GPU manages 312 teraflops. Large labs may use thousands of GPU clusters to train a model. ↩
Graphics processing units are computers that were initially designed for image rendering, but are effective for completing many highly parallelizable tasks, including training and inference of neural network models. ↩
EpochAI formalizes this in one of their analyses: Out of six potentially relevant variables, most of the variance in flops was captured (r2 = 95%) by only two, the number of cores and the process size. ↩
TSMC’s latest “3 nm” fab does not actually produce transistors with gate lengths of 3 nanometers. When referring to a semiconductor manufacturing node, “3 nm” is a marketing term that does not refer to the size of any feature. The actual gate length in a 3 nm node is closer to 16-18 nm. ↩
Dynamic random-access memory, the most rapidly accessible form of memory that’s not directly on the same chip as the compute. ↩
This is assuming that the batch size, or number of samples the model processes before it updates, remains constant — which it usually does. ↩
Tri Dao et al., "FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness," ArXiv, June 24, 2022 ↩
Edward Hu et al., "LoRA: Low-Rank Adaptation of Large Language Models," ArXiv, October 16, 2021 ↩

Sarah Constantin holds a PhD in mathematics from Yale and blogs at Rough Diamonds.

Published June 2023

Have something to say? Email us at letters@asteriskmag.com.

Previous
How We Can Regulate AI

Next
The Puzzle of Non-Proliferation