We're excited to announce the release of TitanML's Takeoff Inference v0.11, which includes several new capabilities to improve performance and usability
Reranking and Classification Endpoints
We've added a new "/classify" endpoint that supports text classification tasks like sentiment analysis, natural language inference, and reranking models. It enables you to use the full sequence representations from models like T5 and BERT to determine document relevance for retrieval.
CUDA Graph Caching
CUDA graphs can accelerate inference but consume additional memory. We've implemented an LRU cache to store a capped number of CUDA graphs to optimize this tradeoff. It improves average throughput while reducing the chance of out-of-memory errors on longer sequences.
Smaller Container Image
By refactoring some dependencies, we've significantly reduced the container image size compared to the previous version. It allows for installation on more resource-constrained systems without compromising on model support.
Contact us if you have any questions or suggestions! We look forward to hearing your feedback and feature requests.
Deploying Enterprise-Grade AI in Your Environment?
Unlock unparalleled performance, security, and customization with the TitanML Enterprise Stack