NEW RELEASE: TitanML with detokenization endpoint & enhanced Gemma 2 support!
tECHNOLOGY

Best in breed technology powers the TitanML Enterprise Inference Stack

With the TitanML Enterprise Inference Stack our clients can take for granted that they are always using the best inference techniques for Enterprise RAG deployments.

Acceleration
Accelerate model inference by 3-12x
01
Response Caching

The TitanML Enterprise Inference Stack gives models access to their old outputs, so they can fast forward responses and quickly generate responses to similar requests, even if the requests aren't identical.

02
Speculative Decoding

The TitanML Enterprise Inference Stack natively uses speculative decoding, allowing smaller models to draft responses and then using a larger model to validate and correct the small model. Use up to a 10x bigger model with no extra compute resources, blending efficiency with accuracy. Get the best of both worlds: speed and reliability.

03
Flash attention

Flash attention dramatically improves transformer inference speeds, especially for long input sequences. Looking for quick and efficient model performance? We’ve got you covered. 

04
CUDA graphs

Massively increase language model inference speeds by bypassing the CPU entirely and queueing up all the operations on the GPU.

05
Fused Triton Kernels

Many operations common in LLMs can be combined into a single CUDA kernel to make them many times faster. TitanML uses custom kernels written using in the Triton DSL to out-the-box support accelerated inference on non-NVIDIA hardware.

Throughput Optimisations
Confidently scale applications for production
01
Continuous batching

Queued requests can be inserted into running batches, minimizing the amount of time that your requests wait to be seen to. This is crucial for keeping GPU utilization high and making the most of every dollar spent on GPUs.

02
Model Sharding For Multi-GPU

Models are tensors sharded across GPUs. This is perfect if you are looking to run large models that don't fit in a single GPU, and maximise the per-token speeds.

03
Multi-threaded rust server

You never want the server to get in the way of lightning fast model inference. The lightweight TitanML Enterprise Inference Stack is lightning fast and guarantees high performance even under high load.

GPU Utilisation
Improve GPU Utilisation when deploying multiple models
01
Multi model serving

Use the same GPU to serve multiple models that can fit in GPU memory. Perfect for multi-model applications like RAG where models are used asynchronously. Never leave your GPUs idle!

02
Batched LoRA

TitanML empowers serving of hundreds of fine-tuned models for the cost of just one, by deploying the fine-tuned LoRAs onto a single Takeoff Server. This innovation results in significantly reduced infrastructure requirements, especially in centrally managed deployments.

Controller
Minimise unpredictable model outputs with model controllers
01
JSON and Regex controller

Control the model’s output to fit a set JSON or REGEX schema. Confidently build pipelines around your language model without impacting the model latency. Perfect for document extraction workflows. 

Compression
Deploy models to smaller and cheaper GPUs with up to 8x model compression
01
Quantisation

Compress the models using accuracy-preserving model compressions (AWQ). Deploy the same model to significantly smaller and cheaper GPUs, or even CPUs.