Optimising LLM Latency: Why Speed Matters In Generative AI
It's an uncomfortable truth in the generative AI space1 that most applications will not see massive load as soon as they're deployed. A lot of effort in the industry goes into improving the throughput of LLM serving: i.e. how many customers your system can serve at once. But, most of the time, the bottleneck to you building an effective product is not how many customers you can serve - but how fast you can serve each one.
When using an API-based model, such as GPT4 from OpenAI or Claude from Anthropic, you're (mostly) stuck with the system performance they provide to you. This is because your performance hinges on the current load that provider is processing at the moment you send the request. Therefore if the provider is currently experiencing a heavy load, you wind up with lower performance. While this may sound like an unfair tradeoff to the end user, the game of inference optimization is figuring out how to trade one resource (latency) for another (throughput). This observation has been made before: when people make flashy benchmarks comparing different inference providers, most of what they're measuring is how few requests each system serves!2
The alternative to using LLM API Providers is self hosting the model yourself. This provides you full control and you get access to all the knobs and levers. For example, if your application is latency-sensitive, you should tune the knobs. In this post, I will discuss latency, what kinds of optimizations make sense to reduce it, and how to implement them in practice.
Optimization for Latency
Language models are simple creatures - they take in text, they spit out text. We want them to do that quickly, so our applications can run quickly resulting in a positive user experience
There are lots of ways to make systems faster, but one persistent, storied, and successful trick is to make sure they never do the same work twice. Caching trades space for time, to make sure that this is the case.
Request caching
The simplest way to apply caching in LLMs is by responding with the same responses to the same requests. However there's a slight subtlety here - LLM inference samples from a distribution of outcomes.3 If we cache LLM outputs and return them again and again, we're freezing the distribution in some meaningful way: picking one sample as representative of the whole distribution.
You'll have to think carefully about whether this makes sense for your application, for a lot of people it does! This simple feature has an impactful result on reducing latency and saving costs in not re-generating a new question you already have the answer to.
The KV cache
Another way to apply caching forLLM inference is inside the architecture itself. Autoregressive decoding uses the same outputs again and again as it proceeds: feeding information from previous tokens into the next step, and the next step, and the next step... Specifically, in the attention mechanism, these values are the keys, and the values.
This KV caching mechanism should be included in any sensible LLM serving system: but if you're writing your own model definitions, make sure you pay attention.
Caching the KV cache
The KV cache is used for each step in the decoding process: the KV cache computed in the prefill4 step is used for generating the third token, is used for generating the second token and so on... So what happens when the sequence ends; should we just throw away all this cached information? Not if we're smart.
When a second sequence enters our inference system, it will also compute a KV cache as it runs through the inference process. In all the places that it shares a prefix with another sequence, the computed KV cache will be the same. Remember the trick - we never want a system to do the same work twice!
However, implementing this idea is not as simple as just storing the KV cache for sequences when we finish processing them because the KV cache lives in GPU memory - the most precious resource for our inference system. Therefore, we need to manage it carefully, since using our GPU memory for caching means we can't use it for processing new sequences5.
You can imagine all the prompts coming into the system as forming a tree, with branches all the places where requests diverge from one another. The branches themselves are the tokens. The nodes of the tree map to chunks of KV cache6. You can find an explanation of how TitanML use this insight (alongside another technique called paged attention), to do prefix caching without using any extra GPU memory in our inference server.
Caching at decode time
Once we've finished prefilling our prompt we can start decoding. Decoding is a serial process meaning we cannot decode the next token until we've finished with the current one. This is a problem for our accelerators: they're geared to do lots of things at once, not one thing very quickly.
There's a trick that we can use to evade this seriality. LLMs can check whether many tokens are correct, faster than they can generate new ones. If we have a source of plausible next tokens, then we can check them all at once. If more than one is correct, then we get more than one token for the cost of one.
This is predicated on the assumption that we have a good source of plausible next tokens. A lot of work has been done on this. The usual choice is a smaller LLM: you can use your small LLM as a draft model, whose outputs can then be checked at runtime by the teacher model. There are a few tricky aspects to implementing this approach - most importantly it’s hard to find a good draft model! First, it has to be a smaller model that tends to produce similar outputs to the teacher model. It also must use the same tokenizer, so that its outputs are intelligible to the teacher. Finally, it has to be fast: if it's not faster than the teacher model, then we're not saving any time.
If we're building our serving system from scratch, we can take advantage of the fact that we've already served lots of sequences before. If we build a cache of previous outputs, then the statistics in that cache give a good sense of what the model has said before for the same inputs. Importantly, the input doesn't have to be exactly the same to be useful. For example,say the last 10 tokens match something the model has seen before, then we can use what it said last time (given that previous short context) as plausible next tokens for what it should say this time. This technique assumes the last 10 times the model saw this chunk of text, it said the same thing, then it's probably going to say the same thing this time too.
This kind of caching is implemented in the Takeoff inference server. We call it SSD, Space-like Speculative Decoding7.
Model improvements
Static memory allocation
For a long time, Hugging Face used an implementation of the autoregressive loop that triggered big new CUDA memory allocations on each decode step8. The more effective way to do this is to allocate all the memory up front. Happily, this is no longer the case (for some models). Still, there are performance gains to be made above and beyond Hugging Face's implementations, by being careful about your model definitions.
CUDA graphs & torch compile
One performance optimization that's enabled by static memory allocation is CUDA graphs. A lot of overhead in pytorch applications comes from the time required to dispatch CUDA kernels from python (via pytorch's C++ bindings). If your workload is very large (i.e. large batch sizes), this launch time is ignorable, because of the way pytorch kernel dispatch works. CUDA kernels dispatched to the GPU can run at the same time as code on the CPU. Even kernels that depend on one another (think layers in a neural network) can run correctly in this fashion: they're dispatched to a queue (called a CUDA stream) that the GPU runs through.
With large workloads, the work in this stream piles up, the python code runs ahead of the GPU, and the kernel launch times can happen in parallel: they no longer affect the overall speed of the system.
However, if you have a smaller workload, this is no longer the case. GPUs are fast! If you don't give them enough to do, they'll get through their work too quickly, and will sit idle waiting for you to give them more stuff to do. If the GPU is sitting idle while you're dispatching new kernels, then the kernel launch time falls into the end-to-end latency of your system. When this is true, we say that the system is overhead bound.
One solution might be to rewrite your kernel launch and management code in a faster language - say, write your model definitions in C++, and use pytorch's libtorch interface. But most people aren't willing to give up the ergonomics of python that quickly. Development time matters too!
Happily, pytorch gives us the ability to get python out of our hot path. One of the big development priorities in torch 2.0 is torch compile: a JIT compiler that can transform python code (containing torch kernel invocations) into a single pytorch graph. Torch compile has several backends that then compute, from this graph representation, machine code that runs your model inference. One such backend is CUDA graphs.
CUDA graphs are a mechanism devised by NVIDIA to reduce the launch overhead of a series of CUDA kernel invocations. Instead of launching a series of kernels, we can transform our model forward pass (via the torch compiled 'graph') into a single CUDA graph invocation: in which the whole graph of CUDA operations are encoded into a single object which is sent to the GPU all at once. The GPU then runs the entire forward pass before returning the results to the CPU.
What's next?
While we have discussed a few of the techniques that an inference system should use to improve the latency of an LLM deployment, we are excited by the many more that have yet to be discovered.The next factor of 10 in LLM inference speed has the potential to be transformative for new applications.
Luckily, here at TitanML we already did the hard work for you. Each type of technique discussed above is implemented in the Titan Takeoff Stack, the enterprise standard for self-hosting language models. Just like the models - we don’t want you to do the same work twice!
With Titan Takeoff enterprises get access to all the knobs and leavers and the ability to tune their inference stack to meet the needs of the business. Whether that’s optimising for latency or serving more users for half the cost, if enterprises want to see true, transformative, business value from Generative AI, they must take full control.
1 Really true of any new product.
2 This is especially true when new MoE models come out, since they're especially easy to tune to produce useless throughput and minimal latency.
3Depending on the sampling mechanism. A simple deterministic algorithm for generating text from an LLM takes the most probable token at each timestep: this is deterministic, modulo floating point & accelerator randomness. But nucleus sampling, temperature, etc. give different outcomes for the same prompt.
4For the uninitiated: the prefill step is the process of running the prompt sequence through the model up to the point where we want to start generating new tokens. Generating the new tokens is often called decoding.
5This is fine if you're using a small model and your system is completely latency bound: you'll probably have some spare GPU memory left over. But if you do, you're probably better off using it for a larger model, or keeping it free for serving new requests.
6This data structure goes by the name of a radix tree, with an important modification: instead of storing values at terminal nodes, we accumulate values along the branches of the tree as we traverse it.
7Because the model’s previous outputs travel forwards in time (instantaneously) along a spacelike curve in spacetime, its Space-like Speculative Decoding.
8This is somewhat less of a problem because of torch's caching allocator, but it still leads to unpredictable performance.
Deploying Enterprise-Grade AI in Your Environment?
Unlock unparalleled performance, security, and customization with the TitanML Enterprise Stack