Optimizing GPU Memory for LLMs: A Deep Dive into Paged Attention
In today’s fast-paced world, it’s no secret that the demand for GPU resources has never been higher. Nvidia, now the world’s most valuable company, stands as a testament to this soaring demand. With every enterprise racing to transform their business using Generative AI, the need for these powerful machines have become a critical priority. For ML Engineers, the pressure from stakeholders to minimize GPU requirements for running Large Language Models (LLMs) is a constant challenge. It's not just about performance - on the business side, it’s important to squeeze the most value out of every GPU to maximise ROI and keep costs in check.
By reducing the memory needed to run these models, businesses can serve more users with their current hardware, or even downsize to smaller, more readily available (and cheaper!) GPUs. Techniques like quantization are effective for shrinking a model’s GPU footprint, but for this explanation, we’ll consider the model size fixed and focus on other strategies to optimize GPU memory utilization. After a quick foray into the importance of optimizing LLM inference, we can dive into a specific technique known as Paged Attention to help you get the most out of your GPUs.
The Importance of Optimizing LLM Inference
Optimising runtime inference is a crucial step in bringing LLMs into real-world applications. No one will use a chatbot that takes 10 minutes to respond to a simple question, or a code completion tool that lags behind your typing. In the case of a batch processing data pipeline, a slow model can bottleneck the entire process, or even preclude it from running at all. For people to adopt these tools, they need to be fast and responsive; and if the goal is to seamlessly embed Generative AI into daily business operations, performance is key. By increasing throughput—the number of sequences processed in a given time—and reducing latency—the time it takes to get a response—we can serve more users simultaneously and finish tasks in less time. This makes applications more responsive and cheaper to use and serving models more efficient and accessible.
Much of LLM inference optimisation focuses on efficient implementations of the attention mechanism in the transformer architecture - as broken down in Fig 1. Before we can optimize this process, we need to discuss how this architecture handles a prompt and generates an output sequence.
Prefilling & Decoding
In a typical interaction with a large language model (LLM), there are two main phases: prefill and decoding. During the prefill phase, the model processes the entire prompt in parallel—calculating hidden states and attention outputs for all prompt tokens simultaneously. This parallelization makes the prefill phase very efficient. Once the prompt is fully processed, the model transitions to the decoding phase. In this phase, tokens are generated one by one, each depending on the prompt and all previously generated tokens. Because each step’s output relies on the prior steps, the decoding process must happen sequentially. The generation continues until a stopping criterion (e.g., maximum token limit or special stop token) is reached, at which point the model terminates its output.
As the name suggests, the prefill phase is run only once for a given sequence, while the decoding phase typically occurs multiple times. After each decode step, the model’s output becomes the input for the subsequent step. To ensure that each token only “attends” to preceding tokens, we employ masking (see Fig. 1). Because of this masking, a token’s representation depends solely on previously generated tokens and since those tokens do not change from one decode step to the next, their representations also remain consistent. This stability creates an opportunity to cache the repeated computations—specifically, the transformer’s “keys” and “values.” We call this the Key-Value (KV) Cache, and it helps avoid redundant calculations during sequential decoding.
If we were to calculate the full-sequence attention at every decode step, the number of computational steps required would scale quadratically1 in the length of the sequence. To put this in perspective, it would require 100x the FLOPs (floating-point operations per second, how we measure compute power) to compute the 100th token vs the 10th token. Because we've stashed the results of our previous decoding steps in GPU memory in our KV cache, the amount of computation required only scales linearly with our sequence length.
The KV Cache
Overview
Leveraging a KV Cache enables us to focus only on the attention of the final token for any sequence when decoding. This means we can work on smaller matrices and achieve the same output with many fewer GPU FLOPs. Fig. 3 below illustrates how this works: our Key and Value matrices are stored from the prefill, and updated with the new values from each decoding iteration. When decoding, we work with single token embeddings and can pull in our cached values to compute an output attention vector.
While the KV Cache dramatically boosts computational performance, it must be fully stored in GPU memory—a limited and valuable resource. This memory requirement bounds the number of sequences we can process simultaneously. Below is the formula for calculating the memory footprint of our KV Cache - which can grow to the order of gigabytes per sequence. In practice, this becomes the bottleneck and limits the number of sequences that can be processed in parallel by LLM serving systems 2.
Memory Management
The KV Cache introduces another demand for our precious GPU memory. It's reasonable to consider the weights of the model as the predominant load on their GPU, but as we serve numerous users the KV Cache will start to dominate memory usage 3.
This memory is far harder to manage as its size, in contrast to the model weights which are statically and permanently sized, will grow and fall with the number of sequences being processed in parallel. This is a problem because we have a fixed amount of GPU memory and if we do not bound the cache it can exceed the remaining space left on the GPU. If we hit an out of memory error we can no longer continue decoding (generating tokens), as each step requires additional memory to append the new tokens to each in-flight sequence. So without any limits on your KV Cache, a single user can send in a request which causes a denial of service for all other users. This is a massive security vulnerability and leads to us limiting the total memory of the KV Cache - we do this by setting static limits on the maximum size a sequence can grow to and the maximum number of sequences that can be processed in parallel - this is achieved by assigning a rectangular slab of memory on the GPU, as illustrated in Fig 5. A sequence can only be pulled into processing if there is an available row and we are now protected against runtime out of memory issues.
Researchers at Stanford and Berkeley found that allocating a static KV cache for a worst-case scenario—where every incoming sequence is assumed to grow to its maximum length—can waste up to 80% of GPU memory. While this approach ensures we never trigger catastrophic out-of-memory errors, it also causes inefficient resource usage when most sequences terminate well before the maximum length. See Fig 6, and imagine setting aside 1GB of GPU memory for requests like these, only to use ~25 out of the reservation of 8192 tokens.
This inefficiency is illustrated on the left side of Figure 7; for an application where high request concurrency is important, the right side shows a more optimal reorganization of the cache. Instead of changing the total cache size, we simply trade off some maximum sequence length capacity to accommodate more sequences at once, thereby improving throughput.
While the concept of dynamically trading sequence length for batch size is appealing, it becomes complex in practice when the KV cache is pre-allocated. Because reallocating this memory at runtime can be costly and risky, we need a mechanism that can efficiently manage sequences within a fixed GPU memory block. The same paper that highlighted this inefficiency also introduced Paged Attention, a novel approach that addresses these challenges.
Paged Attention
Paged Attention took inspiration from operating systems and how they flexibly manage their memory to allow many programs to run simultaneously. To understand this, we need to first look at the root of the problem that LLM request scheduling shares with operating systems: memory fragmentation.
Memory Fragmentation refers to the condition where available memory becomes broken into numerous small, noncontiguous blocks, making it difficult or impossible to allocate large contiguous memory segments even if sufficient total free space exists. This fragmentation manifests as gaps between occupied memory areas and occurs over time as processes frequently request and release memory. It comes in two main flavours:
1. Internal Memory Fragmentation: occurs when programs are given all the memory they will potentially need before starting. They may or may not use this memory eventually, but until they use it or the program ends, it is reserved and idle.
2. External Memory Fragmentation: when programs are allocated and deallocated in contiguous memory, gaps of memory are left as they are evicted (see Fig 8, left side). Our restriction on contiguous memory means that future programs may not be able to fit in available gaps, despite the total free memory available being enough to accommodate them. As we don't know the lifetime and therefore order of termination of our programs in advance, it is difficult to pack our program memory to avoid these gaps from occuring.
So how does this situation from Operating Systems apply to our KV Cache Management? In our context, the internal fragmentation is all the empty space given to a sequence that it hasn't reached yet in its generation. All of this space could be used by other sequences in the meantime. If we allocate the KV Cache in one single block, the inflexibility of our sequence length is analogous to external memory fragmentation - despite having space available to accommodate a longer sequence, there is no contiguous gap large enough that will accept it (Fig 8).
Bringing it all Together - Implementing Page Attention with the KV Cache
The Paged Attention algorithm introduces a paging and virtual memory system for KV cache management. By partitioning the KV cache into pages and allowing sequences to be allocated non-contiguously across these pages, the algorithm removes restrictions on batch size and sequence length. As long as there is enough free space overall, sequences can be distributed anywhere within the allocated memory. This design also caps the maximum internal fragmentation at the page size, rather than tying it to a sequence’s entire length. With on-demand allocation, sequences receive precisely the amount of memory they need at any point in time, ensuring more efficient use of GPU memory.
To implement this approach, we reorder the KV cache dimensions to [total_blocks×block_size×hidden_dims]. For each sequence, we construct a Page Table that tracks which blocks in the paged KV cache belong to which sequence. A scheduler then manages these sequence fragments—allocating new blocks as needed—to efficiently handle growing sequences.
Because the total physical memory for the cache remains statically allocated, we preserve protection against runtime out-of-memory errors. At the same time, sequence length and batch size are dynamic and determined by the current workload. With these optimizations, we reduce GPU memory waste from 80% to under 4%, allowing us to use that reclaimed memory to increase throughput.
Conclusion
Paged Attention illustrates how a fresh perspective on memory management can dramatically transform LLM inference. By taking inspiration from operating systems and enabling sequences to be stored in non-contiguous “pages”, we can dynamically trade sequence length for batch size without risking out-of-memory errors. This means higher concurrency, improved throughput, and crucially up to 20x reduction in wasted GPU memory. By adopting Paged Attention, businesses can handle a wider variety of workloads with fewer resources, giving them the agility to scale faster, serve more users, and keep costs in check.
Footnotes
- We refer to quadratic scaling to mean that the time taken to execute the process is proportional to the square of the number of elements. Linear scaling is a process that scales proportionally to the number of elements.
- If we were to use a GPU with 24GB of memory (A10s, 4090s, L4s etc.), the weights of a Llama3.1-8B model will use 16GB, leaving us with 8GB for KV Cache (ignoring space for intermediate activation values). According to the calculation, this gives us space for around 60,000 concurrent tokens - for a sequence length of 8192 our batch size is capped at 7, and we have no ability to access the models full context length of 128k tokens.
- A sequence of 8192 tokens occupies the order of a gigabyte of memory and a 7B model in floating point 16 takes up 14GB (7B * fp16 size = 7B * 16 bits = 7B * 2 bytes). Running a LLaMa 7B in production on an H100 with 80GB of VRAM will have 14GB for the model weights and 66GB for the KV Cache (ignoring space for intermediate activation values). Which gives us capacity to serve approximately 66 users in parallel.
Deploying Enterprise-Grade AI in Your Environment?
Unlock unparalleled performance, security, and customization with the TitanML Enterprise Stack