Technology | TitanML

tECHNOLOGY

Best in breed technology powering the future of enterprise AI inference

At the heart of TitanML lies our groundbreaking AI inference technology. Our proprietary solutions combine advanced optimization techniques, flexible deployment options, and enterprise-grade security to deliver unparalleled performance and efficiency.

Acceleration

Accelerate model inference by 3-12x

Response Caching

The TitanML Enterprise Stack gives models access to their old outputs, so they can fast forward responses and quickly generate responses to similar requests, even if the requests aren't identical.

Speculative Decoding

The TitanML Enterprise Stack natively uses speculative decoding, allowing smaller models to draft responses and then using a larger model to validate and correct the small model. Use up to a 10x bigger model with no extra compute resources, blending efficiency with accuracy. Get the best of both worlds: speed and reliability.

Flash attention

Flash attention dramatically improves transformer inference speeds, especially for long input sequences. Looking for quick and efficient model performance? We’ve got you covered.

CUDA graphs

Massively increase language model inference speeds by bypassing the CPU entirely and queueing up all the operations on the GPU.

Fused Triton Kernels

Many operations common in LLMs can be combined into a single CUDA kernel to make them many times faster. TitanML uses custom kernels written using in the Triton DSL to out-the-box support accelerated inference on non-NVIDIA hardware.

Throughput Optimisations

Confidently scale applications for production

Continuous batching

Queued requests can be inserted into running batches, minimizing the amount of time that your requests wait to be seen to. This is crucial for keeping GPU utilization high and making the most of every dollar spent on GPUs.

Model Sharding For Multi-GPU

Models are tensors sharded across GPUs. This is perfect if you are looking to run large models that don't fit in a single GPU, and maximise the per-token speeds.

Multi-threaded rust server

You never want the server to get in the way of lightning fast model inference. The lightweight TitanML Enterprise Stack is lightning fast and guarantees high performance even under high load.

GPU Utilisation

Improve GPU Utilisation when deploying multiple models

Multi model serving

Use the same GPU to serve multiple models that can fit in GPU memory. Perfect for multi-model applications like RAG where models are used asynchronously. Never leave your GPUs idle!

Batched LoRA

TitanML empowers serving of hundreds of fine-tuned models for the cost of just one, by deploying the fine-tuned LoRAs onto a single Takeoff Server. This innovation results in significantly reduced infrastructure requirements, especially in centrally managed deployments.

Controller

Minimise unpredictable model outputs with model controllers

JSON and Regex controller

Control the model’s output to fit a set JSON or REGEX schema. Confidently build pipelines around your language model without impacting the model latency. Perfect for document extraction workflows.

Compression

Deploy models to smaller and cheaper GPUs with up to 8x model compression

Quantisation

Compress the models using accuracy-preserving model compressions (AWQ). Deploy the same model to significantly smaller and cheaper GPUs, or even CPUs.