API-based large language models

Copy link

API-based Generative AI models (including ChatGPT, Bard, Cohere, Claude, LLaMA and PaLM) are hosted in external servers, meaning that every time the model is called, both the data and the responses are sent outside a business' secure environment to the environment where the model is hosted.

Whilst this is an effortless process, it is therefore not the most private and secure form of large language model deployment. Instead, self-hosting is considered the gold standard in terms of private and secure large language model deployments. However, self-hosting is typically considered to be a very complex process. This is why we exist at TitanML: we want enterprises to be able to deploy large language models in the most secure and private environments, effortlessly. The Titan Takeoff Inference Server does just this.

These are the differences typically between API-based large language model deployments and self-hosted deployments. The Titan Takeoff Inference Server, however, makes self-hosting as easy as API-based deployments.

No items found.

Learn More

Join Beta

Previous Term

No Next Term!

Check out our other Terms

Next Term

No Previous Term!

Check out our other Terms

No Next Term!

No Previous Term!

Tensor Parallelelism

Context Length

Rate Limits

HIPPA

Docker

Llava

Containerized

Public Cloud

Virtual Private Cloud (VPC)

Self-hosted models

Compression

Bandwidth

Autoscaling

API-based large language models

API

CI/CD Pipelines

Kubernetes

Node

Inference

Inference Server

Mixture of Expert Models (MoE)

Continuous batching

Multi-GPU inference

Zero shot learning

Unsupervised learning

Weight

Turing Test

Transformer

Training set

Transfer learning

Training data

TPU

Top P

Top K

Tokenization

Token

Throughput

Titan Takeoff Inference Server

Synthetic data

Supervised learning

Serving

Speculative decoding

Sentiment analysis

Sampling temperature

Rust

Recurrent neural network (RNN)

Repetition penalty

RAG (Retrieval Augmented Generation)

Quantization aware training

Quantization

Pruning

Prompt engineering

Pretrained model

On-prem

Perplexity

Natural language processing (NLP)

Ngram

Natural language understanding (NLU)

Neural networks

Model serving

Model parallelism

Human in the loop

Model monitoring

Model

Model compilation

Mistral

Machine learning (ML)

LLaMA

Latency

Large language model

Language model

Kernel

Inference optimization

Instruction tuning

Machine learning inference

Hallucination

GPT

GPU