Technical Guide

Takeoff Serverless LoRA: Efficient inference at scale for fine-tuned models

To webinar

Jamie Dborin

January 27, 2025

•

Serverless, Batched LoRA

This blog introduces the Takeoff Serverless LoRA Inference Engine. This is a LoRA serving framework that allows you to efficiently inference hundreds of fine-tuned LoRA modules on a single GPU instance. Doing this can reduce the cost of serving a large number of LoRA adapters by orders of magnitude, unlocking wide-scale finetuning of self-hosted language models and enabling the deployment of truly differentiated applications at a fraction of the cost.

Delivering Differentiated AI Applications

One of the most remarkable features of languages models is their ability to be used out-of-the-box on a wide range of natural language based tasks, without having to be specifically trained on that task. The ability of large models in particular to generalize to a wide range of tasks with prompting alone is partially the reason why LLMs have grown in popularity as quickly as they have. However there is a limit to where prompting alone can take you. Fine tuning is a more complex way to adapt a language model to a given task, and that complexity often puts off teams who are just starting with language models. However there are broadly two reasons you might look to fine tune: cost, and improving the performance of the frontier models.

Cost Optimization

If you are happy with the performance of a given frontier-model of choice, and it's answers are largely to your liking, then using that model to construct a synthetic dataset to train a small language model is an attractive option. Large models are often used because they generalize much better to many tasks simultaneously, meaning that the out-of-the-box performance on your task is likely to be better. However these larger models come at a cost. They are far more expensive than the smaller models, run much slower, and have tighter rate limits. A lot of this cost is unnecessary. You don't need a model that can write poetry in 500 languages to summarize a report. If you take a smaller model and fine tune it on the outputs of a large model, you can dramatically lower the overall cost of your model serving, with minimal performance impact, by 'throwing away' the parts of the large model that you don't need. This will let you serve your applications at greater scale, at a lower cost, and with a better user experience because of the improved generation speed of the model.

Improving Frontier Models

If you are unlucky, you might be in a field where even the state-of-the-art models are unable to effectively solve your task. In this case fine tuning, alongside tools like retrieval augmented generation (RAG), can be used to create new abilities in language models so they can solve your task. This is a much harder task, as you do cannot simply rely on the cheap-to-generate response of a larger model to train, and you will probably have to train a larger model yourself to achieve the required performance. However, if done correctly, this sort of fine tuning will let you deliver truly differentiated applications that competitors cannot mimic with API-providers.

Difficulties with Fine-tuning

Challenges when Training

The potential benefits of fine-tuning are clear, but the hurdles that prevent it's adoption are significant. There are the hurdles in actually performing the fine tuning in the first place. This includes collecting data, performing the training, and running custom evaluations to make sure the model is performing as you expect. Over time these barriers to entry are slowly dissolving. Collecting synthentic data from other language models is lowering the cost and difficulty of data collection. API providers like OpenAI typically place restrictions on their model outputs that restrict their use in finetuning. Powerful open-weights language models like DeepSeek-V3 and Llama-3.1-405B are making it easier to collect high quality, permissive, synthetic data to train smaller models on. Myriad fine tuning libraries are appearing that are decreasing the difficulty of actually performing fine tuning. Notable examples include Unsloth, Llama Factory, and Axolotl. As these tools improve, it is going to become increasingly common to have teams and companies looking to fine tune language models to benefit from the cost improvements and differentiation this offers.

Challenges when Serving

The challenges don't stop when the model is trained. In a large organisation there might be dozens or hundreds of AI powered applications that might benefit from fine tuning. Serving these models can become its own challenge. When you fine tune a model and deploy it, you lose the benefit of being able to share deployment resources across multiple applications. When you use a non-finetuned model you can share the cost of serving the model across multiple applications, and benefit from batching requests together from many applications. When you fine tune a model you lose this benefit, and naively the new model must demand its own dedicated resources. This is likely why fine-tuning on API services incurs a high additional per-token cost, or forces you to switch to provisioned throughput rather than on-demand pricing. Finetuning GPT-4o-mini from OpenAI doubles the input and output token cost and it is a 50% increase for GPT-4o. On AWS Bedrock, finetuning a model forces you to switch to paying for compute time, rather than per-token. Google doesn't allow fine tuning of their Gemini-pro models, but their Flash models incur no additional cost to fine tune.

The animation below highlights that with full fine-tuning you must have separate serving resources for each fine-tuned model, which potentially causes an underutilisation of resources.

Animation showing the load that three applications place on three finetuned models. Each model is given its own dedicated resources. As a result, each dedicated model is underutilised. Dots from each application represent a call to the associated model.

Parameter Efficient Training and Inference with LoRAs

Low Rank Adapters (LoRAs) emerged as a way to reduce the cost of fine tuning language models. LoRAs are a way of fine tuning a language model while only updating a small fraction of the total parameters. This allows you to fine tune a language model with a fraction of the compute cost of full fine tuning. In many cases models fine tuned with LoRAs are just as performant as those that have undergone full fine tuning.

LoRAs and other Parameter Efficient Fine Tuning (PEFT) methods are primarily popular for lowering the cost of fine tuning at train time. However they are also extremely effective at lowering the cost of serving fine tuned models. At inference time, it is possible to use multiple LoRAs that share the same base language model, allowing you to serve multiple fine tuned models on the same hardware. This is the key to unlocking the cost savings of fine tuning at scale.

(Left) Attaching a single LoRA to a model layer is what you do during parameter-efficient training. (Right) Multiple LoRAs can be attached to a single layer. A user's request specifies which LoRA to use. The Takeoff Batched Lora Inference Engine allows for any combination of LoRAs can be used in parallel within a single batch. Each LoRA represents a fine tuned language model.

The impact of being able to serve multiple fine tuned models on the same hardware is significant. You get the benefit of cheaper, more performant fine-tuned models, without the cost of dedicated serving infrastructure for each model.

Serverless LoRA Inference Engine

The Takeoff Serverless LoRA Inference Engine is a way of serving hundreds of LoRA-fine-tuned models on a single piece of hardware. It has two key components that allow it to do this:

a batched LoRA inference engine
a LoRA hotswapping mechanism

The batched LoRA inference engine facilitates fast inference of multiple LoRAs in parallel, such that and combination of LoRAs can be served efficiently in a single batch. This means that a pool of available LoRAs can be stored on a GPU and users can query any combination of them in a batch. This is in contrast to an alternative inference mechanism, where requests are micro-batched such that all requests in a microbatch use the same LoRA, and then the LoRAs are swapped out between microbatches. This is what is done by default in the canonical open-source PEFT library by Huggingface.

Swapping Loras in and out, as is done in some inference libraries. You get the reduction in GPU count, but the overall throughput of the system is much lower, because you do not benefit from batching and resource sharing.

Batched parallel LoRA inference as is done in the Takeoff Inference Engine. You benefit from using fewer GPUs and maximally utilising the available resources by batching requests, even with different LoRAs.

Using Parallel LoRAs in a batch means you can serve the same number of downstream finetuned applications with a fraction of the GPUs.

Serving Hundreds of LoRAs

As well as this, to reach the scale of serving hundreds of LoRA modules per device, you need the ability to swap LoRAs modules on and off accelerator memory. If you have hundreds of fine-tuned adapters you don't want ones that are not commonly accessed to idly sit in GPU memory and take up precious VRAM that could go to supporting larger batches and sequence lengths. To manage this, we build a 3-level cache where LoRAs can sit in GPU memory, CPU-memory, or on-disk. Frequently accessed LoRAs will more likely find themselves on the GPU and be available for very low-latency inference. Less Frequently accessed LoRAs, such as those used for batched applications rather than frequently-used real-time applications, will go cold and reside on the CPU or Disk cache. When a request for them comes in, the LoRAs will be moved into the GPU cache and used for inference. LoRAs are typically very small, often < 1% the size of the base model. Subsequently the cold start times are much faster than the cold start times that are expected when loading the full model. For Llama 8B, we measure a LoRA cold start time of 70ms, when a LoRA is requested that is in the LoRA CPU cache.

The Takeoff Serverless LoRA implementation gives you extremely fast LoRA cold starts, with essentially zero inference speed cost over the base model. You can get this with hundreds of LoRA adapters hosted on a single device.

LoRA hotswapping on a single machine gives you the ability to serve hundreds of fine-tuned applications from a single instance. However the Takeoff Inference Stack doesn't operate on a single machine. The Takeoff Stack is designed to operation distributed across many devices, and handles everything from auto-scaling, including scale to zero, and load balancing. When deployed at scale the batched lora inference engine with multi-layer caching acts like a low-latency serverless inference stack, where applications can quickly scale up and down to meet demand by requesting more devices to use those LoRAs. This means you can seamlessly serve hundreds of differentiated AI applications on a small GPU cluster without having to worry about scheduling and provisioning enough compute for each resource, this is handled entirely by the LoRA serving framework.

Takeoff Serverless LoRA when deployed across a distributed set of devices. The LoRA caching engine means that applications scale up and down more LoRAs are made hot across the cluster and can be quickly inferenced, and as the applications shut down, the LoRAs become cold again. Even hitting a cold LoRA is not so costly because of the low-cost cold start.

Getting Started With Takeoff Serverless LoRA Inference

As language models become more ubiquitous in enterprise, fine tuning is going to become a standard way to make use of smaller models and to build applications that can work even in very niche fields and use cases. Businesses that go down this path with discover that the difficulty of fine tuning doesn't end with a trained model, but continues into significant serving and deployment challenges. In this blog we have shown that with the Takeoff Inference Stack you can efficiently serve hundreds of fine-tuned language models at a fraction of the cost of deploying them individually. This is done using the Takeoff Serverless LoRA Inference Engine which has the ability to serve hundreds of LoRAs from a single machine, and efficiently scale up and down LoRA availability across multiple machines.

‍

Deploying Enterprise-Grade AI in Your Environment?

Unlock unparalleled performance, security, and customization with the TitanML Enterprise Stack

Get started

Takeoff Serverless LoRA: Efficient inference at scale for fine-tuned models

Serverless, Batched LoRA

Delivering Differentiated AI Applications​

Cost Optimization​

Improving Frontier Models​

Difficulties with Fine-tuning​

Challenges when Training​

Challenges when Serving​

Parameter Efficient Training and Inference with LoRAs​

Serverless LoRA Inference Engine​

Serving Hundreds of LoRAs​

Getting Started With Takeoff Serverless LoRA Inference​