Download PDF version of this article PDF

AI: It's All About Inference Now

Model inference has become the critical
driver for model performance.

Michael Gschwind

Public discourse about artificial intelligence and neural networks until recently invariably centered on training. First and foremost, training is a prerequisite to anything that follows. Until very recently, a defining question for AI has been whether neural networks can deliver quality results that make them relevant, and how much computational capacity is necessary to train them.

As far back as the 1990s and 2010s, the works of LeCun et al. (1998),33 Krizhevsky et al. (2012),31 and Sutskever et al. (2014)47 resoundingly answered this question: Neural networks can deliver meaningful capabilities over traditional systems. While the question of how far we can push training continues to be of great importance and the focus for many researchers, the question of whether AI model outputs can be produced affordably—a step known as inference—has become topical, now that it has been established that models have reached a quality that makes it worthwhile to deploy them.

Inference is critical to deploying AI models in the real world, and advancements in AI have made inference a key consideration across the entire life cycle of AI models. Inference optimization is critical to delivering a performant, cost-effective infrastructure with optimizations as varied as quantization, a range of dynamic batching schemes, a variety of caching strategies, and more.

Inference is no longer an afterthought: Architecture design now considers and optimizes for inference cost before models get developed (e.g., mixture-of-experts, multi-query attention, multi-head latent attention); quantization-aware training is an example of adapting the training process to deliver models that deliver higher quality during inference.

Finally, with test-time scaling, inference allows the delivery of outcomes beyond the evaluation of a model, by refining answers, sampling multiple solutions, and integrating agentic interactions to access additional information and interact with their environment.

From a deployment perspective, model inference comes with a distinct set of cost metrics and tradeoffs related to deployment, operations, accuracy, scalability, and sustainability.56 These metrics materially affect model architecture, model accuracy, inference hardware, and the software stack used for inference. Thus, while computationally model inference is a subset of model training, the usability and affordability of models require drastically different tradeoffs from training.

In a nutshell, inference is "easy" only when cost, time to market, deployment parameters, and response time do not matter. While this is true of training and inference for all model types, it is exacerbated for LLMs (large language models) because of their size and a strong dichotomy between training and inference characteristics. This dichotomy is sufficiently large to use vastly different implementations of the LLM transformer modules to maximize performance for each use case, delivering speedups of four times or more for many models for the popular BetterTransformer implementation with inference optimization.4,21

For transformer-based LLMs, the pretraining process is inherently parallel: A model trains on all tokens of an input sequence in parallel, sharing the computation across these multiple tokens and resulting in efficient hardware utilization with high compute intensity (see figure 1).

In figure 1(a), all transformer positions are trained in parallel using a causal mask to train the next word for each subsequence in an input. In figure 1(b), after prompt processing ("prefill"), LLM inference generates a single next token based on all previously seen tokens. The token is added to the previously seen tokens, then subsequent tokens are generated one at a time, iteratively. Transformer-based models have been wildly successful, at least in part, because of the efficient use of available hardware parallelism during model pretraining. This hardware efficiency makes training affordable and makes it possible to scale up model size and training environments to improve model quality.

For transformer-based LLM inference, the prompt input is highly parallel and can be performed in a single step to compute activations for all input tokens, making this an efficient compute-intensive processing step. The time for processing the prompt defines the TTFT (time-to-first-token), the time when the first output can be expected from the model.

After generating the first token, autoregressive model inference requires each token of a sequence to be generated sequentially, defining the generation throughput (tokens/sec), and leading to reduced speed and efficiency out of the box. Several techniques are described later in this article that have been developed to optimize autoregressive generation. Increasingly, LLM deployment affordability is measured with the new $/token metric, unifying cost and efficiency.

Efficient exploitation of training parallelism has led to an inexorable march toward training more expensive models with more parameters and bigger training corpora in line with pretraining scaling laws.29 In contrast, the focus for inference has been cost reduction to improve affordability of deployments by reducing resource consumption until very recently.23 As pretraining scaling may be reaching a point of diminishing returns, recent advancements in test-time compute scaling are laying the foundation for model inference as a facilitator of model quality.

 

Model Compression

Model compression refers to techniques that reduce the size and computational requirements of neural networks without significantly sacrificing performance. This is crucial for deploying models on resource-constrained devices, accelerating inference, and reducing training costs. Three primary methods are commonly used: pruning, distillation, and quantization.

LLMs are often trained and provided in multiple sizes to suit different deployment needs. For example, the Llama-2 (large language model Meta AI) family includes models with 7, 13, and 70 billion parameters, and the Llama-3 family ranges from 1 to 405 billion parameters.19,53 Training multiple large models from scratch is expensive. Starting with a single, large pretrained model, however, and then applying techniques such as pruning, distillation, and quantization can produce smaller, more efficient models for different deployment scales, and offers improvements compared with training smaller models from scratch. This approach is a more cost-effective way to create LLMs for diverse requirements and use cases.

 

Reducing model parameter count

Pruning and distillation reduce a model's complexity (measured by the number of parameters) by creating a new, derived model for inference. The goal of both techniques is to derive models with smaller sizes and lower computational cost than training a new model from scratch, while also preserving the larger model's learned knowledge and quality. Muralidharan et al.35 suggest a set of best practices for LLM compression and retraining based on their empirical findings: using specific importance estimation techniques for different axes (width, depth), preferring width pruning for smaller models, and leveraging distillation with varying loss components depending on the pruning strategy.

Pruning selectively removes redundant or less-important parts of a neural network to reduce its size and computational complexity.8 Recently, the "winning lottery hypothesis" provided a theoretical underpinning to explain the higher importance of a select subgraph of a more complex network relative to the remaining network.15 Weight pruning removes individual weights in the network that have a small magnitude or contribute minimally to the network's output. Unstructured pruning removes individual weights regardless of their location in the network. This often leads to sparse weight matrices, which can be challenging to accelerate on standard hardware.27 Structured pruning removes entire neurons, filters, or channels, resulting in smaller, denser networks that are easier to accelerate and easier to run on edge devices.20

Pruning can also be performed iteratively, where the network is pruned, retrained to recover accuracy, and pruned again. This process can be repeated until the desired level of compression is achieved.26

Distillation trains a smaller "student" model to mimic the behavior and knowledge of a larger "teacher" model, effectively transferring its learned capabilities into a more compact form. Distillation can also be used to improve the robustness and generalization ability of student models.7 The teacher model is typically a pretrained, high-performing model and provides soft targets (probability distributions over classes) for the student model to learn. This allows the student model to learn more nuanced information than it would from hard targets (one-hot encoded class labels).28

Muralidharan et al.35 explore combining these techniques for creating smaller and more efficient LLMs by leveraging pruning and knowledge-distillation techniques with retraining to create "Minitron" models with 4 and 8 billion parameters derived from the Nemotron-4 15B model. They find that pruning a large LLM and then retraining it with a smaller dataset can be a viable alternative to training smaller models from scratch and can significantly reduce training costs. Combining different pruning techniques with knowledge distillation during retraining leads to better results compared with using a single technique. During this process, width pruning (attention heads, neurons, embedding channels) is generally more effective than depth pruning (layers) for the model sizes studied (up to 15 billion parameters).

 

Quantization

Quantization reduces the precision of the weights and activations of a neural network while leaving the architectural structure of the model unchanged. Quantization can be applied to the gamut of value types (e.g., weights, activations, and key-value caches storing activations across iterations) used in a model. By reducing the size of the different value types in models, quantization allows models to take better advantage of a given device.

The benefits from quantization accrue from several factors, affecting both the cost and speed of inference:

• Memory footprint reduction. Reducing the size of weights for parameters and activations enables larger models to fit into a given device, such as an accelerator's limited onboard HBM (high-bandwidth memory) or edge-device system memory. In server applications, this will reduce the number of accelerators needed and increase peak throughput while reducing deployment cost. For edge and mobile on-device inference, memory footprint is the primary determinant of which models can be used on-device.

• Data bandwidth reduction. Reducing the data-type size of processed values also implies that more data can be accessed with a given memory or network bandwidth. This is particularly relevant because arithmetic intensity (as measured in ops/byte), a key metric for computational efficiency, can be as low as 1.0, or less for LLM inference.

• Peak performance per data type. Many more operations can be performed simultaneously per time step on smaller data types on many hardware devices such as GPU accelerators and CPU vector units.22,24,36

During quantization, data widths may be reduced from 32-bit or 16-bit floating-point numbers (FP32, FP16, BF16) to lower-precision formats such as 16-bit, 8-bit, or 4-bit floating-point (FP16, BF16, FP8, FP4), or integer (int16, int8, int4), or even lower. Torchchat provides LLM quantization for a broad set of LLMs to optimize LLM inference from servers to on-device AI.51

Quantization down to 16-bit floating-point types is most often performed by simply rounding values to the new floating-point type.

Quantization beyond 16-bit floating-point types most often requires scaling factors to represent the original value range. For computing scaling factors, a matrix multiplication can be viewed as a sequence of independent inner products along the shared dimension such that Ci,j​=ai​⋅bj​. The computation of a scaling factor for a given vector can then be defined as s(V), and the quantization of the vector as Q(V).

 

equation 1

 

This yields a set of dot products in the quantized domain Q(ai)⋅Q(bj), scaled by the product of scaling factors s(ai)⋅s(bj).

 

equation 2

 

This formulation is specifically known as vector-wise (or channel-wise) quantization, where each vector of a dot product has a single scaling factor.

As the size of the vectors increases, the larger number of values in vectors ai and bj will lead to a progressively wider range of numbers being mapped, reducing the dynamic range of values. Groupwise quantization addresses this asymptotic degradation by chunking the inner product terms into groups of values, and defining scaling and quantization functions snk and Qnk with k=card(V)/n segments with a shared scale:

 

equation 3

 

and

 

equation 4

 

such that

 

equation 5

 

Quantization implementations offer a wide range of flexibility in deciding when to perform data conversions between a stored quantized representation and the result in a wider data type. The previous quantization example uses symmetric ("absmax") quantization, which quantizes 0 in the input space to 0 in the output space. Asymmetric quantization projects the input values of each vector (or each vector chunk) to a range defined by the minimum and maximum value of each vector (or vector chunk) to make better use of the available encoding space. GPTQ is a one-shot weight quantization method utilizing approximate second-order information to achieve high accuracy and efficiency for generative AI.16 GPTQ enables the quantization of large GPT models with billions of parameters, reducing the bit width to 3 or 4 bits per weight with minimal accuracy loss.

The most common way to quantize a model is with PTQ (post-training quantization).2,46 This method quantizes the data types of a pretrained model without any further training. It is a simple and fast way to reduce model size and improve inference speed, but it can sometimes lead to a drop in accuracy, especially with aggressive quantization levels. To increase result fidelity for quantized models, QAT (quantization-aware training) can simulate and account for the rounding in quantized models during training.37

In many instances, full QAT may not be feasible because a model has already been trained (e.g., pretrained LLMs that are expected to operate in a broad range of scenarios). Quantization-aware fine-tuning offers an alternative to training a model from scratch with quantization-aware training.50

A final dimension in the quantization space is whether to derive the scaling factor when preparing the model for inference or on the fly during inference. In static quantization, a fixed scaling factor is computed ahead of inference time during model preparation and remains constant regardless of model inputs. In dynamic quantization, the scaling factor is computed at model runtime for each model input. This allows for better handling of input distributions by adjusting scaling factors to model inputs but introduces a small computational overhead. For matrix multiplications the overhead is O(nm+mk) relative to O(nmk) for the matrix multiply.

While static quantization is conceptually simpler, computing the scaling function during model preparation for values only available at runtime requires a calibration step to adapt to the expected inputs to quantization. The complexity introduced to collect a calibration data set and to calibrate scaling factors for activations and key-value caches negates conceptual simplicity. In comparison, for large models with numerically expensive operators such as matrix multiplications, dynamic quantization allows models to adapt their dynamic range to a particular input at runtime.

Weights are constant throughout the life of a model, independent of inputs, so a calibration step is unnecessary. Dynamic quantization offers no advantage here either in terms of adapting to model inputs or by obviating the need for a calibration step during model preparation. Using dynamic quantization for activations and key-value caches and static quantization for weights offers a good balance between accuracy and performance.

Model Inference Optimization for LLMs

Based on the LLM inference flow for autoregressive text generation, the first pass of inference is performed with the input prompt (in a step known as "prefill"), and then the output token is appended to the input, and the newly expanded sequence becomes the new input, generating the next token (figure 1b). Based on this flow, the same input features are repeatedly projected into the attention domain with only a single row or column added to the matrix for the latest token. Figure 2 shows computation of scaled dot-product attention during autoregressive text generation. The gray matrix corresponds to values that cached in a key-value (KV) cache if used.

To compute the interaction with prior tokens and avoid recomputing the full feature to key and value projection, the prior token key and value vectors can be stored in a KV cache. This effectively replaces one dimension of context length with a single element along the context dimension for input and output projections, scaled dot-product computation, and the feed-forward block from the dimensionality of the context length to a dimension of 1 (i.e., eliminating the original computation to a minuscule fraction of the original computation by 1/context length).

The grayed-out values have been previously computed, and new computation is limited to the features of the latest token and its interaction with prior tokens. To compute interaction with prior tokens and avoid recomputing the full feature to key and value projection, the prior token key and value vectors can be placed in a KV cache. This cache is "primed" with a prefill operation evaluating all KV positions corresponding to all input tokens, usually in a single step.

Two popular open source inference frameworks are vLLM and SGLang.32,59 Both vLLM and SGLang address the optimization of large language model inference, with vLLM prioritizing high-throughput through efficient memory management and kernel optimization, while SGLang extends this focus to encompass a higher-level language abstraction that facilitates structured output generation and complex workflow design, potentially achieving comparable or superior performance through a co-designed front-end and back-end approach; importantly, both frameworks can also leverage FlashInfer, a library specializing in high-performance GPU kernels, particularly for attention mechanisms, to further accelerate inference speeds.57

 

Key-value cache

Efficient management of the KV cache is crucial for optimizing LLM inference, especially on resource-constrained devices such as GPUs. The KV cache stores the attention keys and values computed for the full context length (i.e., all previous tokens), enabling efficient computation of attention scores for subsequent tokens. Storing keys and values for context length S, number of heads H, head dimension D, number of layers L, and data type width T yields a KV cache size of S * H * D * L * T, further scaled with batch size B when batching is in use. Since the KV cache can consume a large fraction of a model's memory usage during inference, research has focused on reducing its memory footprint.

MQA (multi-query attention), GQA (group query attention), and MLA (multi-head latent attention) optimize transformer models by improving attention computation and KV-cache size. MQA increases efficiency by sharing keys (K) and values (V) across attention heads, reducing memory usage and computational cost, especially in large models. This leads to a smaller KV-cache size, making autoregressive inference more efficient.43 GQA balances between MQA and full MHA (multi-head attention) by grouping multiple queries per key-value set, improving efficiency while preserving more model expressiveness than MQA.1 This reduces KV-cache requirements compared with MHA while maintaining better performance than MQA in complex tasks. MLA further optimizes attention by using a shared, lower-dimensional latent space across heads, reducing complexity and accelerating inference. These techniques reduce KV-cache storage, leading to faster processing and better scalability without significantly increasing memory usage.12

Several promising approaches have emerged to address the size of KV caches. KV-cache sparsification techniques aim to selectively store only the most important key-value pairs, discarding less relevant ones. Quantization reduces the precision of the stored keys and values, trading off a small amount of accuracy for significant memory savings.6 In addition, quantized KV caches align well with model quantization for model compression and computational efficiency. Chunking divides the sequence into smaller segments, caching only the keys and values for the current segment and evicting older segments as needed. Chang et al. introduce a post-training KV-cache compression framework that leverages low-rank projection to reduce the hidden dimension of KV caches, offering an additional and orthogonal compression dimension to existing quantization and token eviction methods.9

Multilevel caching hierarchies are also being explored, using different levels of storage with varying speeds and capacities. For example, a small, fast on-chip cache could store the most recently accessed keys and values, while a larger, slower off-chip memory could hold the rest. This hierarchical approach aims to minimize access latency by keeping frequently used data readily available. Beyond these core techniques, other optimizations include efficient data structures for storing and retrieving KV pairs, and specialized hardware accelerators designed specifically for attention computation and KV cache management.

The ongoing research into KV-cache optimization is essential for deploying LLMs effectively in diverse applications. By reducing memory requirements and improving inference speed, these techniques pave the way for handling longer sequences, increasing batch sizes, and enabling realtime interactive experiences for chat and media generation applications. Continued exploration of novel methods, including adaptive caching strategies, dynamic quantization, and integration with emerging memory technologies, promises further substantial improvements in the efficiency of LLM inference.

 

Increasing arithmetic intensity of LLM inference

As described, the KV cache reduces the unnecessary recomputation of tokens' key and value representations. Using a KV cache with autoregressive decoding, however, implies that all layers of the model are repeatedly evaluated for a single token at a time, making it woefully inefficient. In supercomputing terms—and artificial intelligence (training and inference) is undoubtedly the ultimate supercomputing application21—the arithmetic intensity of the matrix multiplications that dominate the evaluation of LLMs is dramatically reduced.

A popular way to reason about the performance of applications is the roofline model.55 This is a performance-analysis tool that provides an intuitive upper bound on the achievable performance of a computation on a given hardware platform. It plots performance, typically in operations per second, against arithmetic intensity (operations per byte accessed).

As shown in figure 3, the roofline model constructs a "roof" with two key ceilings: the "compute roof," representing the peak computational throughput of the processor (limited by factors such as clock speed and number of cores/ALUs); and the "memory roof," representing the peak memory bandwidth (limited by the speed of memory interfaces). A given computation's performance is then limited by the lower of these two ceilings for its specific arithmetic intensity. Computations with low arithmetic intensity are memory-bound (performance limited by data bandwidth), while computations with high arithmetic intensity are compute-bound (performance limited by the processor's computational capabilities). This visualization helps identify performance bottlenecks and guide optimization efforts toward either improving memory-access patterns or maximizing computational efficiency.

 

Quantization

Quantization (described in more detail in the previous section on model compression) improves the arithmetic intensity and thereby enables exploiting more of the computation capabilities of the system. This allows model optimization to achieve success via increased arithmetic intensity according to the roofline model. At the same time, on GPUs and many modern vector-oriented CPUs, the shorter data formats also come with increased peak performance. While this does not diminish the gains achieved through the increased arithmetic intensity of quantized evaluation, the raising of the peak ops-per-sec ceiling poses an opportunity to grow performance even further.

 

Speculative decoding

Speculative decoding reduces the sequential nature of autoregressive decoding in transformers. Instead of generating one token at a time with a large model, it uses two models: a small "predictor" (or draft) model; and a large "verifier" (or target) model.10 The predictor model produces a predictive sequence consisting of several tokens, which is then checked in parallel with higher arithmetic intensity by the more complex verifier model. In effect, the predictor/verifier paradigm allows the large model to run more efficiently with higher arithmetic intensity.

The speculative predictor model is an autoregressive model producing one token per iteration until a targeted token sequence has been computed. This may be a fixed length, some adaptive threshold such as aggregate confidence at the last token, or prior prediction success. After a token sequence has been proposed by the draft model, this sequence is evaluated in its entirety by the target model in parallel (like training and prefill). Similar to how training and prefill are performed, the target model computes features for all proposed tokens of the sequence and checks whether the target model agrees with the proposed tokens. At the first divergence, the target model's proposed next token is adopted, and then this sequence is used to restart the draft model to produce the next output sequence (see figure 4).

In speculative decoding, the draft and target models take turns. The draft model proposes a sequence of tokens produced sequentially using autoregressive generation. The target model then verifies the multiple proposed tokens in parallel, accepting proposed tokens that match the target's computed next token. At the first divergence, the draft model is reset to consume the target model's generated token, discarding all further predictions. The draft model then produces a next set of proposed tokens until the end-of-sequence <EOS> is reached.

Advantageously, this enables the numerically more complex model to execute with higher arithmetic intensity because it no longer works on a single token at a time. In effect, the model works on the entire proposal in parallel. Often, working on a short sequence is so much more efficient than working on a single token, that processing the sequence takes about the same time as a single token. Thus, to produce n tokens with the target model, you can run either the target model n times for a runtime of n⋅ ttarget or a draft model to produce n tokens in n⋅ tdraft time and verify in ttarget for a total time of n⋅ tdraft + ttarget to propose and test the sequence.

The potential of this approach depends on how much faster the draft model can be and how many of the proposed tokens are accepted by the target model, depending on factors such as the quality of the model, the complexity of the sequence, and reflecting how much the models agree. The verifier model then identifies the first token that diverges, rewinds the draft model to that point, and feeds it as its next autoregressive token, to repeat the process. The achievable improvement in secs/token depends on how well models agree. Speedup factors of at least two times are common, and for tdraft≪ ttarget, can be achieved by predicting sequences of one token at a time correctly, which yields two tokens—the correctly predicted token and the next correct token—resulting in time

 

equation 6

 

MTP (multi-token prediction)

MTP involves forecasting multiple tokens in parallel, a technique that improves both training efficiency and inference speed. Stern et al. introduced this approach with blockwise parallel decoding, where a model predicts multiple tokens simultaneously rather than one at a time, reducing the number of sequential steps.45

Gloeckle et al. extended the concept by focusing on training LLMs to predict multiple tokens at once, rather than relying solely on next-token prediction.18 Their work demonstrated that training models on multi-token prediction tasks enhances sample efficiency and improves performance on specific benchmarks such as code generation. DeepSeek V3 and R1 adopt MTP to generate predictions for speculative decoding.

 

LayerSkip

LayerSkip combines speculative decoding with early exiting from an LLM.14 Unlike traditional methods, it uses the same LLM for both drafting and verification. Early exit mechanisms allow the model to generate drafts using fewer layers, then refine them with deeper layers, reducing latency, improving memory efficiency, and eliminating the need for a separate model.

LayerSkip capitalizes on the fact that early layers capture basic semantic information, while deeper layers refine nuances. By exiting early, it generates drafts quickly and then feeds them back for deeper layers to assess and improve iteratively. Performance estimates show significant speedups (1.34 to 2.16 times) compared with traditional autoregressive decoding.

 

Dynamic continuous batching

Batching allows taking advantage of more parallel compute resources in GPU cores by adding a batch dimension. Increasing the number of floating-point operations used in a single compute kernel enables processing elements to more efficiently parallelize kernels and take better advantage of available parallel floating-point units. In addition, batched inputs share the weights of operators, resulting in improved arithmetic intensity and more efficient use of parallel resources—in turn, this is critical to lowering the cost per result (e.g., $/token). Unlike batching during training where the inputs are known a priori, inference operates under latency and efficiency constraints, and must balance these two competing demands.

Dynamic (cross-request) batching optimizes LLM inference efficiency by grouping incoming requests into batches for concurrent processing.25 Unlike static batching, which uses fixed batch sizes, dynamic batching runs batches once a batch is full or once a maximum time has elapsed, improving latency versus static batching while maintaining throughput in high-traffic periods. This flexibility allows for efficient use of hardware, minimizing idle time while ensuring acceptable latency for individual requests.

Continuous batching takes this one step further, reshaping a batch as generative steps for one batch element come to an end as new requests arrive. Rather than creating a dynamic batch, continuous batching removes batch elements that have indicated a stopping point and adds new batch elements as they are received.58 A particular consideration is processing the "prefill" stage when new batch elements are added: Prefill processes the prompt as the first step of generation, and prefill inference time scales linear with the number of tokens in prefill. Several schemes such as chunking the prefill step or precomputing prefill separately have been proposed to handle the distinct characteristics of prefill in the context of continuous batching.

 

Very large language model inference

Generative inference on very large language models becomes achievable through techniques such as distributed inference and MoE (mixture of experts). Pope et al. distribute large transformer models across multiple accelerators using model parallelism.40 They explore several partitioning schemes to reduce bandwidth, latency, and computation cost, making it feasible to run models with hundreds of billions of parameters. This approach allows for significant scaling of inference, lowering the computational overhead and speeding up execution, thus making large-scale deployments more practical.

Recent DeepSeek models—such as V3 and R1—improve generative inference efficiency by using sparse activation strategies with MoE gating.12,13 Instead of activating all experts (submodels) for every input, MoE gating activates only a subset of experts, significantly reducing the computational load. This makes it possible to scale model size by adding more experts without proportionally increasing resource requirements, thus allowing models with billions of parameters to remain computationally affordable.

 

Test-time Compute Scaling

Most recently, inference has also become topical as a new frontier for model response quality. Motivated by concerns about an end of pretraining scaling because of a lack of bigger corpora for training larger models,48 test-time scaling during inference allows models to improve results quality by scaling up inference.

Test-time compute scaling—that is, inference scaling—offers one of the most promising approaches to continue improving AI result quality. Test-time compute scaling represents a significant shift in machine-learning inference, moving beyond fixed computational budgets to dynamically allocated resources for enhanced performance, particularly relevant to LLMs. This paradigm recognizes that inference, like training, can benefit from increased computation, strategically investing computational resources during inference to achieve more accurate and versatile LLM results.44

Consider OpenAI's o1 and o3 models as a practical example, where increased test-time compute leads to improved performance on a range of complex problems. Beeching et al. explore the potential of scaling test-time compute with a hands-on code-based approach using open source models.3 They demonstrate how to significantly improve LLM performance, particularly on complex tasks and even for comparatively small models.

Post-training to enhance accuracy in reasoning and adapt to human preferences, but at a much lower cost than pretraining, is emerging as a core component of the end-to-end training pipeline in LLMs. Implementation of RL (reinforcement learning) techniques such as GRPO (Group Relative Policy Optimization) is critical for advanced reasoning at test time and simultaneously depends on efficient inference techniques to ensure the exploration efficiency of policy models.13,42 In turn, using RL for training chain of thought requires performing inference to determine the output of the thinking steps, making inference accuracy and efficiency performance critical for post-training scaling.

Ensemble methods offer another avenue for test-time compute scaling. By aggregating the predictions of multiple models or multiple instances of the same model, these techniques leverage the "wisdom of crowds" to improve robustness and accuracy. Self-consistency generates multiple candidate outputs and selects the most consistent one, mitigating the impact of stochasticity in the model's output.54 This echoes the concept of ensemble methods but is achieved within a single model execution. Similarly, Monte Carlo dropout applies dropout at inference time to generate diverse outputs, effectively creating an ensemble from a single model.17 Input perturbation, a related technique, introduces small variations to the input and generates outputs for each perturbed input, subsequently combining the results to enhance robustness against input noise.49

Iterative refinement constitutes a core approach, where models generate initial outputs and subsequently refine them through multiple computational steps. Iterative decoding, particularly prevalent in sequence-generation tasks, refines the generated output through feedback mechanisms or additional processing at each step, akin to dynamic programming.30 Tree search methods, such as beam search and MCTS (Monte Carlo Tree Search), explore the space of possible outputs by constructing a search tree and evaluating different branches.41 A key consideration with search-based methods is the computational cost, which can become prohibitive for complex tasks or realtime applications.

Adaptive computation introduces a dynamic element to test-time compute scaling. Rather than fixing the computational budget a priori, these methods adjust the resources allocated based on the input characteristics or the model's confidence. Conditional computation uses different parts of the model or varying amounts of computation depending on the input.5 Early exiting strategies terminate computation early if the model achieves sufficient confidence in its prediction.52 Defining appropriate confidence metrics and exit criteria is a key challenge in adaptive computation.

Agentic systems with a single agent and those composed of multiple agent models allow inference to tackle problems that are not part of its training set. The integration of external resources represents a powerful form of test-time compute scaling. Retrieval augmentation leverages external databases or knowledge graphs to provide contextual information to LLMs to improve their output for complex, knowledge-rich tasks.34 Code execution empowers models to perform calculations, access APIs, or interact with external systems.11

LangChain provides a modular framework for LLM application development, abstracting complex processes like prompt engineering, data retrieval, and agent orchestration, to facilitate the creation of scalable and context-aware applications. SGLang is a structured generation language designed to simplify and optimize the interaction with large language models (LLMs) by offering a high-level abstraction for controlling their output and execution; this language allows for the explicit definition of output structures, enabling precise control over LLM responses, and facilitating the development of agentic AI systems that can reliably parse and utilize generated information for subsequent actions.

Schick et al. explore how LLM-based agents can self-learn to invoke external tools and APIs dynamically, enhancing their problem-solving capabilities.40 Seo et al. discuss how LLMs can act as multi-agent systems using cooperative methods, leveraging the relevance of information and plan validation to improve dynamic collaboration.41

Conclusion

Model training has long dominated the discussion around artificial intelligence by providing the yardstick for model quality performance. Application deployment and deployment efficiency have long been the domain of inference. Inference metrics such as cost per token guide the development of more efficient models, pushing the boundaries of what's possible within resource limitations.

Inference optimization opens doors to a multitude of benefits that enhance model utility, quality, and efficiency. By optimizing models for inference, we address critical issues such as cost, scalability, response time, and sustainability. This allows for the deployment of powerful models on resource-constrained devices, facilitating AI applications at the edge. Additionally, efficient inference techniques such as pruning, quantization, KV caches, speculative decoding, and dynamic continuous batching significantly reduce the computational burden and improve efficiency, making models more accessible and affordable to implement.

As the scaling of pretraining is reaching a plateau of diminishing returns, model inference is quickly becoming an important driver for model performance. Today, test-time compute scaling offers a new, exciting avenue to increase model performance beyond what can be achieved with training, and test-time compute techniques cover a fertile area for many more breakthroughs in AI. Innovations using ensemble methods, iterative refinement, repeated sampling, retrieval augmentation, chain-of-thought reasoning, search, and agentic ensembles are already yielding improvements in model quality performance and offer additional opportunities for future growth.

 

Acknowledgments

The author would like to thank Rich James (Google) and Katharina Gschwind (Meta Platforms) for their feedback and suggestions on drafts of this paper.

 

References

1. Ainslie, J., Lee-Thorp, J., de Jong, M., Zemlyanskiy, Y., Lebrón, F., Sanghai, S. 2023. GQA: training generalized multi-query transformer models from multi-head checkpoints. arXiv:2305.13245; https://arxiv.org/abs/2305.13245.

2. Anderson, M., et al. 2021. First-generation inference accelerator deployment at Facebook. arXiv:2107.04140; https://arxiv.org/abs/2107.04140.

3. Beeching, E., Tunstall, L., Rush, S. 2024. Scaling test time compute with open models: tutorial and experiments to outperform Llama 3.1 70B on MATH-500 with a 3B Model. Hugging Face blog; https://huggingface.co/spaces/HuggingFaceH4/blogpost-scaling-test-time-compute.

4. Belkada, Y., Marty, F., Benayoun, M., Han, E., Shojanazeri, H., Puhrsch, C., Guessous, D., Gschwind, M., Chauhan, G. 2022. BetterTransformer, out of the box performance for Hugging Face transformers; https://medium.com/pytorch/bettertransformer-out-of-the-box-performance-for-huggingface-transformers-3fbe27d50ab2.

5. Bengio, Y., Léonard, N., Courville, A. 2013. Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv:1308.3432; https://arxiv.org/abs/1308.3432.

6. Bondarenko, Y., Nagel, M., Blankevoort, T. 2021. Understanding and overcoming the challenges of efficient transformer quantization. arXiv:2109.12948; https://arxiv.org/abs/2109.12948.

7. Buciluă, C., Caruana, R., Niculescu-Mizil, A. 2006. Model compression. Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 535–541; https://dl.acm.org/doi/10.1145/1150402.1150464.

8. Cao, Y., Xu, W.-J., Shen, Y., Shi, W., Chan, C.-M., Xu, J. 2025. PIP: Perturbation-based Iterative Pruning for large language models. arXiv:2501.15278; https://arxiv.org/abs/2501.15278.

9. Chang, C.-C., Lin, W.-C., Lin, C.-Y., Chen, C.-Y., Hu, Y.-F., Wang, P.-S., Huang, N.-C., Ceze, L., Abdelfattah, M. S., Wu, K.-C. 2024. Palu: compressing KV-Cache with low-rank projection. arXiv:2407.21118; https://arxiv.org/abs/2407.21118.

10. Chen, C., et al. 2023. Accelerating large language model inference with speculative sampling. arXiv:2302.01318; https://arxiv.org/abs/2302.01318.

11. Chen, M., Tworek, J., et al. 2021. Evaluating large language models trained on code. arXiv:12107.03374; https://arxiv.org/abs/2107.03374.

12. DeepSeek-AI. 2024. DeepSeek-V3 technical report. arXiv:2412.19437; https://arxiv.org/abs/2412.19437.

13. DeepSeek-AI. 2025. DeepSeek-R1: incentivizing reasoning capability in LLMs via reinforcement learning. arXiv:2501.12948; https://arxiv.org/abs/2501.12948.

14. Elhoushi, M., Shrivastava, A., Liskovich, D., Hosmer, B., Wasti, B., Lai, L., Mahmoud, A., Acun, B., et al. 2024. LayerSkip: enabling early exit inference and self-speculative decoding. arXiv:2404.16710; https://arxiv.org/abs/2404.16710.

15. Frankle, J., Carbin, M. 2018. The lottery ticket hypothesis: finding sparse, trainable neural networks. International Conference on Learning Representations. arXiv:1803.03635; https://arxiv.org/abs/1803.03635.

16. Frantar, E., Ashkboos, S., Hoefler, T., Alistarh, D. 2022. GPTQ: accurate post-training quantization for generative pre-trained Transformers. arXiv:2210.17323; https://arxiv.org/abs/2210.17323.

17. Gal, Y., Ghahramani, Z. 2016. Dropout as a Bayesian approximation: representing model uncertainty in deep learning. Proceedings of the 33rd International Conference on Machine Learning, 1050–1059; https://dl.acm.org/doi/10.5555/3045390.3045502.

18. Gloeckle, F., Idrissi, B. Y., Rozière, B., Lopez-Paz, D., Synnaeve, G. 2024. Better & faster large language models via multi-token prediction. Proceedings of the 41st International Conference on Machine Learning, 15706-15734; https://dl.acm.org/doi/10.5555/3692070.3692699.

19. Grattafiori, A., et al. 2024. The Llama 3 Herd of Models. arXiv:2407.21783; https://arxiv.org/abs/2407.21783.

20. Gschwind, K. 2021. Model compression and AutoML for efficient click-through rate prediction. MEng. thesis, MIT; https://dspace.mit.edu/bitstream/handle/1721.1/139253/Gschwind-gschwind-meng-eecs-2021-thesis.pdf.

21. Gschwind, M. 2024. LLMs everywhere: acceleration from servers to mobile devices in the age of generative AI. Keynote speech at the International Conference on Supercomputing; https://ics2024.github.io/keynote.html.

22. Gschwind, M. 2016. Workload acceleration with the IBM POWER vector-scalar architecture, IBM Journal of Research and Development 60(2-3); https://ieeexplore.ieee.org/document/7442604.

23. Gschwind, M., Han, E., Wolchok, S., Zhu, R., Puhrsch, C. 2022. A better transformer for fast transformer inference. PyTorch blog; https://pytorch.org/blog/a-better-transformer-for-fast-transformer-encoder-inference/.

24. Gschwind, M., Hofstee, P., Flachs, B., Hopkins, M., Watanabe, Y., Yamazaki, T. 2006. Synergistic processing in Cell's multicore architecture, IEEE Micro 26(2), 10-24; https://ieeexplore.ieee.org/document/1624323.

25. Gupta, N., Gschwind, M., Husa, D., Dewan, C., Khabsa, M. 2023. MultiRay: optimizing efficiency for large-scale AI models. Meta AI blog. https://ai.meta.com/blog/multiray-large-scale-AI-models/.

26. Han, S., Mao, H., Dally, W. J. 2016. Deep compression: compressing deep neural networks with pruning, trained quantization, and Huffman coding. International Conference on Learning Representations; https://arxiv.org/abs/1510.00149.

27. Han, S., Pool, J., Tran, J., Dally, W. 2015. Learning both weights and connections for efficient neural networks. Proceedings of the 29th International Conference on Neural Information Processing Systems, volume 1, 1135–1143; https://dl.acm.org/doi/10.5555/2969239.2969366.

28. Hinton, G., Vinyals, O., Dean, J. 2015. Distilling the knowledge in a neural network. NIPS 2014 Deep Learning Workshop. arXiv:1503.02531; https://arxiv.org/abs/1503.02531.

29. Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., de Las Casas, D., Hendricks, L. A., Welbl, J., Clark, A., et al. 2022. Training compute-optimal large language models. Proceedings of the 36th International Conference on Neural Information Processing Systems, 30016-30030; https://dl.acm.org/doi/10.5555/3600270.3602446.

30. Jelinek, F. 1969. A fast sequential decoding algorithm using a stack. IBM Journal of Research and Development 13(6), 675–685; https://dl.acm.org/doi/abs/10.1147/rd.136.0675.

31. Krizhevsky, A., Sutskever, I., Hinton, G. 2012. ImageNet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems, 1097–1105; https://papers.nips.cc/paper_files/paper/2012/hash/c399862d3b9d6b76c8436e924a68c45b-Abstract.html

32. Kwon, W., Li, Z., Zhuang, S., Sheng, Y., Zheng, L., Yu, C. H., Gonzalez, J. E., Zhang, H., Stoica, I. 2023. Efficient memory management for large language model serving with paged attention. arXiv:2309.06180, https://arxiv.org/abs/2309.06180.

33. LeCun, Y., Bottou, L., Bengio, Y., Haffner, P. 1998. Gradient-based learning applied to document recognition. Proceedings of the IEEE 86(11), 2278–2324; https://ieeexplore.ieee.org/document/726791.

34. Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., et al. 2020. Retrieval-augmented generation for knowledge-intensive NLP tasks. Proceedings of the 34th International Conference on Neural Information Processing Systems, 9459–9474; https://dl.acm.org/doi/abs/10.5555/3495724.3496517.

35. Muralidharan, S., Sreenivas, S. T., Joshi, R., Chochowski, M., Patwary, M., Shoeybi, M., Catanzaro, B., Kautz, J., Molchanov, P. 2024. Compact language models via pruning and knowledge distillation. Advances in Neural Information Processing Systems 37; https://papers.nips.cc/paper_files/paper/2024/hash/4822991365c962105b1b95b1107d30e5-Abstract-Conference.html.

36. NVIDIA. 2024. NVIDIA Blackwell Architecture Technical Brief: powering the new era of generative AI and accelerated computing; https://resources.nvidia.com/en-us-blackwell-architecture.

37. Or, A., Zhang, J., Smothers, E., Khandelwal, K., Rao, S. 2023. Quantization-aware training for large language models with PyTorch. PyTorch blog; https://pytorch.org/blog/quantization-aware-training/.

38. Pope, R., Douglas, S., Chowdhery, A., Devlin, J., Bradbury, J., Levskaya, A., Heek, J., Xiao, K., Agrawal, S., Dean, J. 2023. Efficiently scaling transformer inference. Proceedings of Machine Learning and Systems 5; https://proceedings.mlsys.org/paper_files/paper/2023/hash/c4be71ab8d24cdfb45e3d06dbfca2780-Abstract-mlsys2023.html.

39. Russell, S. J., Norvig, P. 2010. Artificial Intelligence: A Modern Approach. Pearson.

40. Schick, T., Dwivedi-Yu, J., Dessì, R., Raileanu, R., Lomeli, M., Zettlemoyer, L., Cancedda, N., Scialom, T. 2023. Toolformer: language models can teach themselves to use tools. Proceedings of the 37th International Conference on Neural Information Processing Systems, 68539–68551; https://dl.acm.org/doi/10.5555/3666122.3669119.

41. Seo, S., Noh, S., Lee, J., Lim, S., Lee, W. H., Kang, H. 2024. REVECA: adaptive planning and trajectory-based validation in cooperative language agents using information relevance and relative proximity. arXiv:2405.16751; https://arxiv.org/abs/2405.16751.

42. Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y. K., Wu, Y., Guo, D. 2024. DeepSeekMath: pushing the limits of mathematical reasoning in open language models. arXiv:2402.03300; https://arxiv.org/abs/2402.03300.

43. Shazeer, N. 2019. Fast transformer decoding: one write-head is all you need. 33rd International Conference on Neural Information Processing Systems. arXiv:1911.02150; https://arxiv.org/abs/1911.02150.

44. Snell, C., Lee, J., Xu, K., Kumar, A. 2024. Scaling LLM test-time compute optimally can be more effective than scaling LLM parameters. arXiv:2408.03314; https://arxiv.org/abs/2408.03314.

45. Stern, M., Shazeer, N., Uszkoreit, J. 2018. Blockwise parallel decoding for deep autoregressive models. Proceedings of the 32nd International Conference on Neural Information Processing Systems; https://dl.acm.org/doi/10.5555/3327546.3327673.

46. Subramanian, S., Saroufim, M., Zhang, J. 2022. Practical quantization in PyTorch. PyTorch blog; https://pytorch.org/blog/quantization-in-practice/.

47. Sutskever, I. 2024. Sequence to sequence learning with neural networks: what a decade. Test of Time Award Talk at NeurIPS; https://www.youtube.com/watch?v=1yvBqasHLZs.

48. Sutskever, I., Vinyals, O., Le, Q. V. 2014. Sequence to sequence learning with neural networks. Proceedings of the 28th International Conference on Neural Information Processing Systems, volume 2, 3104–3112, https://dl.acm.org/doi/10.5555/2969033.2969173.

49. Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I., Fergus, R. 2013. Intriguing properties of neural networks. arXiv:1312.6199; https://arxiv.org/abs/1312.6199.

50. Team PyTorch. 2023. torchtune: easily fine-tune LLMs using PyTorch. PyTorch blog; https://pytorch.org/blog/torchtune-fine-tune-llms/.

51. Team PyTorch. 2024. Introducing torchchat: accelerating local LLM inference on laptop, desktop, and mobile. PyTorch blog; https://pytorch.org/blog/torchchat-local-llm-inference/.

52. Teerapittayanon, S., McDanel, B., Kung, H.-T. 2016. BranchyNet: fast inference via early exiting from deep neural networks. 25th International Conference on Pattern Recognition, 2464–2469. arXiv:1709.01686; https://arxiv.org/abs/1709.01686.

53. Touvron, H., et al. 2023. Llama 2: open foundation and fine-tuned chat models. arXiv:2307.09288; https://arxiv.org/abs/2307.09288.

54. Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D. 2023. Self-consistency improves chain of thought reasoning in language models. arXiv:2203.11171; https://arxiv.org/abs/2203.11171.

55. Williams, S., Waterman, A., Patterson, D. 2009. Roofline: an insightful visual performance model for multicore architectures. Communications of the ACM 52(4), 65–76; https://dl.acm.org/doi/10.1145/1498765.1498785.

56. Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., Gschwind, M., Gupta, A., Ott, M., Melnikov, A., Candido, S., Brooks, D., et al. 2022. Sustainable AI: environmental implications, challenges and opportunities. Machine Learning and Systems 4. arXiv:2109.02079; https://arxiv.org/abs/2111.00364.

57. Ye, Z., Chen, L., Lai, R., Lin, W., Zhang, Y., Wang, S., Chen, T., Kasikci, B., Grover, V., Krishnamurthy, A., Ceze, L. 2025. FlashInfer: efficient and customizable attention engine for LLM inference serving. arXiv:2501.01005, https://arxiv.org/abs/2501.01005.

58. Yu, G., Jeong, J., Kim, G. Kim, S., Chun, B. 2022. Orca: a distributed serving system for transformer-based generative models, 16th USENIX Symposium on Operating Systems Design and Implementation; https://www.usenix.org/system/files/osdi22-yu.pdf.

59. Zheng, L., Yin, L., Xie, Z., Sun, C., Huang, J., Yu, C. H., Cao, S., Kozyrakis, C., Stoica, I., Gonzalez, J. E., Barrett, C., Sheng, Y. 2023. SGLang: efficient execution of structured language model programs. arXiv:2312.07104, https://arxiv.org/abs/2312.07104.

 

Dr. Michael Gschwind is a Distinguished Engineer at NVIDIA in DGX Cloud and AI optimization. He previously created and led GPU Inference, the PyTorch generative AI stack for GPU-accelerated AI servers and mobile/edge on-device AI, and AI training at Meta AI. Prior to joining Meta, he was architecture lead for Cell, the first general-purpose programmable GPU, was chief architect for three Top-1 supercomputers (Roadrunner, BlueGene, Summit), and three game-console processors (PlayStation 3, Xbox 360, Wii) at IBM. Dr. Gschwind has also been a faculty member at Technische Universität Wien and Princeton University. He is a Fellow of the IEEE.

Copyright © 2025 held by owner/author. Publication rights licensed to ACM.

acmqueue

Originally published in Queue vol. 23, no. 2
Comment on this article in the ACM Digital Library





More related articles:

Vijay Janapa Reddi - Generative AI at the Edge: Challenges and Opportunities
Generative AI at the edge is the next phase in AI's deployment: from centralized supercomputers to ubiquitous assistants and creators operating alongside humans. The challenges are significant but so are the opportunities for personalization, privacy, and innovation. By tackling the technical hurdles and establishing new frameworks (conceptual and infrastructural), we can ensure this transition is successful and beneficial.


Erik Meijer - From Function Frustrations to Framework Flexibility
The principle of indirection can be applied to introduce a paradigm shift: replacing direct value manipulation with symbolic reasoning using named variables. This simple yet powerful trick directly resolves inconsistencies in tool usage and enables parameterization and abstraction of interactions. The transformation of function calls into reusable and interpretable frameworks elevates tool calling into a neuro-symbolic reasoning framework. This approach unlocks new possibilities for structured interaction and dynamic AI systems.


Chip Huyen - How to Evaluate AI that's Smarter than Us
Evaluating AI models that surpass human expertise in the task at hand presents unique challenges. These challenges only grow as AI becomes more intelligent. However, the three effective strategies presented in this article exist to address these hurdles. The strategies are: Functional correctness: evaluating AI by how well it accomplishes its intended tasks; AI-as-a-judge: using AI instead of human experts to evaluate AI outputs; and Comparative evaluation: evaluating AI systems in relationship with each other instead of independently.


Mark Russinovich, Ahmed Salem, Santiago Zanella-Béguelin, Yonatan Zunger - The Price of Intelligence
The vulnerability of LLMs to hallucination, prompt injection, and jailbreaks poses a significant but surmountable challenge to their widespread adoption and responsible use. We have argued that these problems are inherent, certainly in the present generation of models and likely in LLMs per se, and so our approach can never be based on eliminating them; rather, we should apply strategies of "defense in depth" to mitigate them, and when building and using these systems, do so on the assumption that they will sometimes fail in these directions.





© ACM, Inc. All Rights Reserved.