Takeoff 0.16.0: Enterprise RAG with Enhanced Performance and Expanded Capabilities
TitanML is pleased to announce the release of Takeoff 0.16.0, a significant update that brings substantial performance improvements, improvements to quality of life features, and expanded model support to our inference stack.
TL;DR:
- Performance:
- Improved Batch Scheduling and Memory Management - 3.5x throughput improvement across most workloads
- Prefix Caching - Clients are already seeing an additional up to 8x latency improvement in long input - short output workloads like RAG
- Ecosystem improvements:
- Chat template - easily swap between models as the inference stack correctly formats your requests for you depending on the model
- Schema-less JSON and Multi-Schema JSON - new features which allow for more complex and flexible tool usage
- Model support:
- Llama 3.1 - Takeoff supported the new state of the art llama 3.1 on the day of the release, and this support comes officially in this new version!
Performance
Improved Batch Scheduling and Memory Management - 3.5x better throughput
Takeoff 0.16.0 introduces a step-change in inference throughput, powered by sophisticated batch scheduling and memory management algorithms. Our internal benchmarking has demonstrated up to 3.5x improvements in general settings. This performance leap is initially available for Llama models, with version 0.16.1 slated to extend these optimizations across our entire supported model ecosystem.
Prefix Caching - Optimized Inference for longer context RAG
Takeoff 0.16.0 implements an advanced prefix caching system, significantly accelerating operations on repetitive or similar content. This feature is particularly beneficial for:
- Multi-shot learning paradigms
- Document-based question-answering systems
- Any use case with high token overlap between queries (>2000 tokens)
Prefix caching is configurable via environment variables, allowing fine-tuned control over cache allocation and management. For production environments with consistent workloads, we've introduced the ability to pin specific prompts in the cache, ensuring optimal performance for high-priority operations.
To illustrate the impact of prefix caching, consider a typical RAG (Retrieval-Augmented Generation) scenario. In many real-world applications, we often see long input contexts (around 2000 tokens) paired with relatively short outputs (around 16 tokens). For example:
- Input (approx. 2000 tokens): This could be a lengthy document or a combination of retrieved passages containing detailed information about a specific topic.
- Output (approx. 16 tokens): A concise answer or summary generated based on the input, such as "The main cause was economic instability."
With prefix caching, Takeoff can significantly speed up processing for these types of workloads. By caching the common prefixes in long inputs, subsequent similar queries can be processed much faster, leading to the up to 8x latency improvement mentioned earlier. This is particularly beneficial for applications dealing with large documents or repetitive queries on similar data.
Ecosystem Improvements
Chat Template - Easily swap between models
To address the complexities of prompt formatting for instruction-tuned models, we've introduced a new /chat_template endpoint. This feature allows structured input of messages and roles, similar to leading cloud AI APIs, ensuring optimal prompt formatting for maximum model performance. For seamless integration, our OpenAI-compatible API leverages this endpoint internally, abstracting away the intricacies of prompt engineering while maintaining full compatibility with existing workflows. This new feature makes swapping between different instruction tuned models significantly easier.
New JSON Features - Improving model tool use
Takeoff 0.16.0 expands its JSON handling capabilities with two key features:
- Schema-less JSON Generation: Enables dynamic schema creation, ideal for applications with evolving data structures or exploratory data analysis.
- OneOf and AnyOf JSON Schema Support: Facilitates complex decision-making scenarios, particularly useful for sophisticated function-calling implementations where multiple potential tools need to be considered.
Llama 3.1 Support
We supported llama 3.1 on release day! You can read more about that support here. We are fully committed to always providing support to our clients for all of the models that they care about - the takeoff inference stack currently has support for 100,000s of models including LLMs, Embedders, Rerankers, Imaging processing models and more. Takeoff is a complete stack for every enterprise looking to build document processing and RAG applications.
Conclusions
Takeoff 0.16.0 represents a significant leap forward in Enterprise AI inference technology, offering substantial performance improvements that directly translate to cost savings and enhanced capabilities for enterprises. With up to 3.5x throughput improvements and 8x latency reductions for RAG workloads, organizations can now process more data faster and more efficiently than ever before. The introduction of advanced features like prefix caching and flexible JSON handling, coupled with our chat template system, provides the adaptability needed to tackle complex AI challenges across various use cases.
Takeoff 0.16.0 is not just an update; it's a comprehensive solution designed to meet the evolving needs of enterprise AI.
We invite you to experience the power of Takeoff 0.16.0 for yourself and see how it can transform your AI operations. Whether you're looking to optimize existing workflows or embark on new AI initiatives, our team is ready to demonstrate how Takeoff can accelerate your journey.
Contact us today to access your one month free trial and start unlocking the full potential of your AI infrastructure.
Deploying Enterprise-Grade AI in Your Environment?
Unlock unparalleled performance, security, and customization with the TitanML Enterprise Stack