Build with Llama 3.3 - deployed in your private environment

Glossary

Deep Learning Glossary

We've cut through the jargon so you don't need to (and then we've built best-in-class infrastructure so you don't need to!)

AGI

Artificial General Intelligence (AGI) in simple terms is the point at which artificial intelligence (AI) can perform all human cognitive skills better than the smartest human, including an ability to self teach.

There are two broad debates within the machine learning field around AGI:

1) How to define the point at which AGI has been reached

2) Whether developing an AGI system is possible

Currently, researchers are using all types of tests (Turing Test, Steve Wozniak's coffee test, bar exam, CFA exams, medical exams) as a way of measuring whether AI is close to reaching AGI, although no strict criteria exists to determine the point at which AGI will have been achieved. The general consensus remains that AGI has not yet been reached.

Whilst researchers have previously claimed AGI would never be reached, or would be 50-100 years away, it is the openly stated goal for companies including OpenAI, Google Deepmind, and Anthropic. Geoffrey Hinton (one of the godfathers of AI) now believes achieving AGI is likely to be much less than 30 years away, whilst CEO of Anthropic, Dario Amodei, believes AGI is only 2-3 years away.

Related Articles

No items found.

API

An API (Application Programming Interface) is a set of rules and protocols which allow different software applications to communicate and interact with each other. It enables developers to access specific features or data from external services, libraries, or platforms, making it useful for building AI-powered applications.

‍

API model deployment is an effortless process.

‍

In terms of AI adoption, many enterprises currently rely on API-based model deployments. This is because, historically, proprietary large language models, including GPT-4, have been considered the gold standard, whilst open source models were seen as significantly cheaper but ultimately, poor-quality substitutes. Yet, in 2023, there were significant improvements in the quality of open source models. In December, Mistral AI’s Mixtral demonstrated significantly better performance than GPT-3.5. As major players, including Middle Eastern nations and Meta, continue to invest heavily within this space, we expect Llama 3 (or equivalent) to be as good, if not better than GPT-4. This point at which open source models will be as good as proprietary ones, will mark a significant turning point for the industry. It will mean the use of API-based models over self-hosted models will no longer be a decision taken solely on the basis of model quality, and instead move to a more complex one, which takes privacy, control, ease of use and cost into account. We therefore expect a significant number of enterprises to move from deploying API-based models, to self-hosted ones. Many of our clients have already planned for this eventuality and are now using the Titan Takeoff Inference Server to make the process of self-hosting models as pain free as possible.

Related Articles

No items found.

API-based large language models

API-based Generative AI models (including ChatGPT, Bard, Cohere, Claude, LLaMA and PaLM) are hosted in external servers, meaning that every time the model is called, both the data and the responses are sent outside a business' secure environment to the environment where the model is hosted.

Whilst this is an effortless process, it is therefore not the most private and secure form of large language model deployment. Instead, self-hosting is considered the gold standard in terms of private and secure large language model deployments. However, self-hosting is typically considered to be a very complex process. This is why we exist at TitanML: we want enterprises to be able to deploy large language models in the most secure and private environments, effortlessly. The Titan Takeoff Inference Server does just this.

These are the differences typically between API-based large language model deployments and self-hosted deployments. The Titan Takeoff Inference Server, however, makes self-hosting as easy as API-based deployments.

Related Articles

No items found.

Activation aware quantization (AWQ)

Activation aware quantization (AWQ) is a process for quantizing large language models whilst maintaining significant accuracy without the memory overhead of Quantization Aware Training. There are difficulties quantizing very large language models due to outliers.

Outliers are weights in the network which take on very large values. These large values can skew the distribution of weights at quantization time, making it harder to maintain performance whilst reducing weight precision. AWQ accounts for these outlier values during the quantization process by calculating scale factors to offset them, thereby maintaining model performance.

Image credits: J. Lin et al, https://arxiv.org/pdf/2306.00978.pdf

Related Articles

Deploying LLMs on small devices: An introduction to quantization

Agent

Agents are language models which are given persistent access to tooling and memory in order to solve an open-ended task. Agents might be given access to third-party APIs, code interpreters, or a scratch pad to record previously generated texts and told to use them when appropriate in order to complete a task. An agent operates autonomously (not directly controlled by a human operator).

Image credits: Z. Xi et al, https://arxiv.org/pdf/2309.07864.pdf

There are a number of types of agents within machine learning:

1) Reactive: an agent responds to stimuli from their environment to achieve goal

2) Proactive: an agent takes initiative and plans in advance to achieve goal

3) Operate in a fixed environment: contain a static set of rules in which an agent should respond

4) Operate in a dynamic environment: rules are constantly changing and it therefore requires an agent to regularly adapt to new circumstances

5) Single agent: as described

6) Multi-agent system: where many agents work together to achieve a common goal

Related Articles

No items found.

Attention

Humans are able to naturally determine the contect behind a word which can have multiple meanings (homonym) for example, differentiating between when "spring" means a season or a metal coil. However, for large language models, this process is known as "attention". Attention mechanisms are therefore integral to large language models, allowing for their ability to understand and generate natural language.

In machine learning, there are two types of attention typically talked about:

1) Self attention: a core mechanism in models like transformers, it evaluates the importance of elements within the same input sequence (e.g., words in a sentence) by computing attention scores among them. This enables the model to assign varying degrees of importance to each element based on its relationships with others, leading to richer context understanding.

2) Multi head attention: extends self-attention by employing multiple parallel attention mechanisms, or "heads." Each head focuses on different aspects of the input, allowing the model to capture diverse relationships and representations simultaneously. These individual heads are then combined to provide a comprehensive understanding of the data, enhancing model performance in various machine learning tasks.

Related Articles

No items found.

Auto regressive model

An auto regressive model is a solution for inferencing large language models created by TitanML. It combines fast runtime engines, model management and large language model (LLM) output controllers to make it as easy as possible to deploy LLMs at scale.

Related Articles

No items found.

Autoscaling

Autoscaling is a cloud computing feature that automatically adjusts the number of computational resources (such as virtual machines) allocated to an application based on its current workload. This ensures optimal performance and cost efficiency as AI workloads fluctuate.

Related Articles

No items found.

BERT

BERT is a language model architechture which is commonly used for classification and embedding tasks. Often used interchangably with encoder-only language models. It was created by machine learning researchers at Google in 2018.

Related Articles

No items found.

Backpropagation

Backpropagation is arguably the most fundamental building block in a neural network. It was popularized by Rumelhart et al in a paper entitled "Learning representations by back-propagating errors" [1].

Backprogation is a process used to calculate the gradients of the weights of a machine learning model from a batch of input data. This is followed by the optimization step, where those gradients are used to update the model.

Related Articles

No items found.

Bandwidth

Bandwidth refers to the maximum data transfer rate of a network or internet connection. In the context of AI, sufficient bandwidth is crucial for transmitting large datasets, facilitating real-time communication with AI models, and ensuring smooth operations.

Related Articles

No items found.

Batch

A batch of inputs is a set of inputs which are processed by a machine learning model in parallel. This is an effective technique to increase throughput when using GPUs.

Related Articles

No items found.

Batch accumulation

Batch accumulation is a technique for reducing the GPU memory requirements of machine learning training. In a normal model training step, the gradients with respect to each parameter of a machine learning model are stored, and a single update is performed. If the batch size is too small, these gradients can cause an "out-of-memory" error. With gradient accumulation, you can accumulate these gradients in place across a set of smaller batches. This trades time for memory, allowing for performing training as if there was a GPU with more VRAM, at the cost of a longer training time.

Image credits: Precisely Docs

Related Articles

No items found.

Batch inference

Batch inference is the process of running models with batched inputs to increase throughput.

Related Articles

No items found.

Batch size

Batch size is simply the number of distinct items in a batch.

Related Articles

No items found.

Big data

Big data is a term that has been used to describe tools for managing and processing massive data sets. Usually, these tools have to be specifically architected to extract, transform and manage fractions of these huge datasets in an efficient streaming manner.

Related Articles

No items found.

CI/CD Pipelines

CI/CD pipelines refer to the automated processes of Continuous Integration (CI) and Continuous Delivery or Continuous Deployment (CD). CI involves automatically integrating code changes from multiple contributors into a single software project, usually accompanied by automated testing to ensure code quality. CD extends this by automating the release of the tested changes to a staging or production environment, enabling rapid and reliable software development and deployment.

Related Articles

No items found.

CPU

The CPU (central processing unit) is the heart of any modern computer. It executed a series of instructions across a number (usually less than 10) of processors.

CPUs can also be used for machine learning inference, but their architecture is less well suited to accelerate massively parallelizable modern machine learning architectures than GPUs. CPUs are, however, significantly cheaper and easier to come by than GPUs, and can be effective for machine learning, especially at inference time, and for small batch sizes.

At TitanML, we offer enterprises the option to utilise CPUs in their AI deployments, following our success in real-time deployment of a state-of-the-art Falcon LLM on a commodity CPU [1].

Related Articles

No items found.

Chain of thought prompting

Chain of thought prompting is a prompting method which involves asking language models to "think step-by-step" in order to encourage longer outputs with more steps that resemble reasoning about the problem. This method has been shown to improve the output of models for complex reasoning tasks [1].

Image credits: J. Wei et al, https://arxiv.org/pdf/2201.11903.pdf

Related Articles

No items found.

Classification

Classification is the task of placing data into one of a fixed set of buckets.

Related Articles

No items found.

Classification model

A classification model is a model designed to place inputs into one of a fixed number of buckets.

Related Articles

No items found.

Cloud computing

Before the advent of cloud computing, companies looking to sell digital services (operating an online store, a social media company etc) would have to buy computing hardware. This hardware would require a great deal of expertise to setup, maintain, and troubleshoot. Cloud Computing is the name for the paradigm shift that we have seen in the computing industry over the last few decades, wherein cloud computing providers, usually spun off from large IT-focused companies which built substantial experience operating computing hardware themselves, began to rent access to this hardware over the network. With cloud computing services, users can pay by the day for virtual machines (VMs) with various capabilities, pay by the Gigabyte for storage, or access many other services on-demand. (see on-prem).

Related Articles

No items found.

Compression

Compression in AI typically refers to model compression. Model compression helps to reduce the size of the neural network, without significant accuracy loss. There are four main types of model compression often used by machine learning engineers:

1) Quantization

2) Pruning

3) Knowledge distillation

4) Low-rank factorization

At TitanML, as part of our Titan Takeoff Inference Server offering, a number of accuracy-preserving compression techniques have been built in, in order to allow for large language deployment everywhere, including on-prem.

Related Articles

No items found.

Containerized

Containerization involves encapsulating software in a container with its own operating environment, libraries, and dependencies, ensuring consistency and efficiency across different computing environments. Containers offer lightweight, portable units for application development, deployment, and management, facilitating faster delivery and scalability. This technology isolates applications from the underlying infrastructure, enhancing security, and making it easier for teams to collaborate on and deploy applications regardless of the host system.

Titan Takeoff is a fully containerized solution, making it easy to deploy Generative AI applications in private environments.

‍

Related Articles

No items found.

Context Length

Context length describes the upper limit of tokens that the model can recall during text generation. A longer context window allows the model to understand long-range dependencies in text better.

Related Articles

No items found.

Continuous batching

Continuous batching is an algorithm which increases the throughput of large language model (LLM) serving. It allows the size of the batch on which a machine learning model is working on to grow and shrink dynamically over time. This means responses are served to users more quickly at high load (significantly higher throughput).

Related Articles

Announcing Titan Takeoff 0.7.0

Data parallelism

Data parallelism involves running models in parallel on different devices where each sees a distinct subset of data during training or inference. It accelerates training deep learning models, reducing training time significantly [1].

It is often confused with task parallelism, but both can be applied in tandem make the most of available resources, reduce training and deployment times, and optimize the end-to-end machine learning process for better results and faster development. See task parallelism.

Related Articles

No items found.

Deep learning

Deep learning is the process of producing deep neural networks to solve machine learning problems.

Related Articles

No items found.

Deep neural network (DNN)

For a neural network to be considered a deep neural network (DNN), it is typically complex and usually with at least two layers. Neural networks consist of a series of layers. Each layer performs a successive transformation on data which was passed into the model. Layers usually consist of a linear operation, followed by a simple, elementwise nonlinearity. The composition of many such simple operations can be used to build up any data transformation. At the same time, their relative simplicity means they map well to modern computer hardware. (see also deep learning).

Related Articles

No items found.

Distillation

Distillation is the process of using a larger model to train smaller models. This has been shown to be more effective than training small models from scratch [1]. It can involve using intermediate states of the larger model to assist the smaller model, or using large generative models to produce new text from which the smaller model is trained on.

Image credits: C. Hsieh et al, https://arxiv.org/pdf/2305.02301.pdf

Related Articles

No items found.

Docker

Docker is an open-source platform that automates the deployment of applications inside software containers, providing an additional layer of abstraction and automation of operating-system-level virtualization on Linux. It enables developers to package applications with all of their dependencies into a standardized unit for software development, ensuring that it works seamlessly in any environment. Docker simplifies the process of managing applications in containers, making it easier to create, deploy, and run applications by using containers. Titan Takeoff is deployed using docker containers.

Related Articles

No items found.

Dynamic batching

Dynamic batching is a process of adjusting the batch size run during inference to match the incoming traffic. During times of high traffic, the model runs at large batches to maximize GPU utilization, and during times of low traffic, a lower batch size is used to minimize time spent waiting for additional requests.

Related Articles

No items found.

Encoder

A machine learning model designed to produce a representation of the input data which can be used for further downstream processing. Encoders are used to populate vector databases, or can be combined with decoder models for generation. They benefit tasks including data compression, anomaly detection, transfer learning, and recommendation systems.

Related Articles

No items found.

Epoch

During model training, an epoch has passed when the model has seen all pieces of data in the training set once.

Related Articles

No items found.

F-score

Accuracy is a common metric for assessing the performance of binary classification models. However, it can sometimes be a difficult metric to interpret properly in the case where the number of examples of different classes is highly unbalanced. This is where the F-score, also known as the F1-score comes in; it combines precision and recall into a single score to provide a balanced measure of a model's accuracy.

It is calculated as the harmonic mean of precision and recall., and it balances the trade-off between precision (the ratio of true positives to all predicted positives) and recall (the ratio of true positives to all actual positives). This balance is important when dealing with situations where one metric may be favoured over the other.

The formula for calculating the F-score is:

The F-score ranges between 0 and 1, with higher values indicating better model performance.

It is particularly useful when you want to strike a balance between precision and recall, such as in information retrieval, medical diagnoses, or fraud detection, where false positives and false negatives have different consequences.

Related Articles

No items found.

Falcon

Falcon is a family of large language models released by the UAE’s Technology Innovation Institute (TII). Falcon's 40B model was trained on AWS Cloud continuously for two months with 384 GPUs. The pre training data largely consisted of public data, with few data sources taken from research papers and social media conversations.

Offering high performance, whilst also being more cost-effective than competitors, Falcon garnered the #1 spot on Hugging Face's latest open large language model leaderboard (considered to be the world's most definitive independent verifier of AI models).

Related Articles

No items found.

Few shot learning

Few shot learning is the ability of a model to learn new behaviours, having been shown only a "few" examples of the desired behaviour.

It is typically useful for:

1. Scarcity of data: Collecting large amounts of labelled data is impractical and/or costly. Few shot learning makes it feasible to tackle machine learning tasks with limited training examples.

2. Rapid adaptation: Enables models to quickly adapt to new tasks or classes without the need for extensive retraining.

3. Efficient training: Requires less computational resources and time compared to traditional deep learning methods.

4. Generalization: Encourages models to generalize from the limited number of examples available, often leading to more robust and versatile systems that can perform well on other, related tasks.

5. Low resource settings: Particularly helpful in low-resource settings, for example, with certain medical diagnoses, where collecting extensive labelled data is challenging due to privacy concerns and/or the scarcity of experts.

Few shot learning is not to be confused with zero shot learning and one shot learning. See zero shot learning. See one shot learning.

Related Articles

No items found.

Few shot prompting

Few shot prompting is the ability of a model to learn new behaviours having only been shown a "few" examples of the desired behaviour as part of its input prompt. It can help with a variert of tasks without the need for extensive fine-tuning or training on specific examples. It allows you to provide context, guidance, or examples to steer the model's responses in a desired direction.

Use cases include:

1) Custom chatbots: For developing custom chatbots and virtual assistants, few shot prompting is helpful when guiding the model's responses in a way that aligns with the specific conversational goals and context.

2) Question answering: In question answering tasks, you can provide a question as the prompt, along with relevant context, to help the model generate accurate answers, even for questions it has not encountered previously.

3) Creative writing and storytelling: Few shot prompting is helpful when used to seed creativity, generate stories, or assist with narrative generation, by providing initial prompts of ideas for the model to then build upon.

Related Articles

No items found.

Fine tuning

Fine tuning is the process of adapting a pre-trained large language model for a downstream task. It is a useful way to inject new behaviours into a model, like instruction following, reasoning, and question answering.

Related Articles

When should I fine tune my LLM - Low effort strategies that beat fine-tuning

The easiest way to finetune and inference LLaMA 2.0 🦙🛫

Foundation model

A foundational model, also known as "general purpose artificial intelligence", or "GPAI", refers to a large-scale, pre-trained machine learning model which serves as a base for further fine tuning and customization. It is usually trained on vast amounts of text data in order to capture general language patterns.

Foundation models form the basis of many applications including:

1) OpenAI’s ChatGPT

2) Microsoft’s Bing

3) Midjourney

4) Adobe Photoshop's generative fill tools

5) + many chatbots.

Related Articles

No items found.

GPT

GPT stands for Generative Pretrain Transformers. It is a type of decoder-only transformer first defined by OpenAI that underlies the famous GPT-3/4 series of models.

Related Articles

No items found.

GPU

GPUs (graphics processing units) are a type of computing hardware which can be used to perform a series of computations in parallel. This makes them useful in graphics applications, however, they have since become a crucial tool within machine learning. Used in both training and inference, they have significantly accelerated training times for deep learning models, enabling the development of state-of-the-art AI systems.

However, there are a number of challenges associated with GPU usage in machine learning:

1) High costs: High-performance GPUs can be expensive, making them a significant investment for organizations and individuals. This cost can be a barrier for smaller projects or researchers with limited budgets. There has also been NVIDIA GPU shortages, pushing lead times up significantly, and increasing GPU prices further

2) Compatability: GPUs require compatible hardware and software. Ensuring that your machine learning framework and libraries are GPU-accelerated and that your GPU is compatible with these tools can be a challenge

3) Energy usage: GPUs consume a substantial amount of power, leading to increased energy costs for running machine learning workloads on GPU servers or personal machines. This is a concern for both environmental and economic reasons

4) Parallelism: Whilst GPUs excel at parallel processing, not all machine learning algorithms are highly parallelizable. Some tasks may not benefit significantly from GPU acceleration, making it important to choose the right hardware for the job

5) Memory constraints: GPUs have limited memory compared to traditional CPUs. This can become problematic when working with large datasets or deep learning models which require significant memory capacity.

6) Driver and software updates: GPU drivers and software libraries must be kept up to date for optimal performance and compatibility. This maintenance can be time-consuming and therefore costly.

7) Portability: GPUs are typically found in desktop workstations or specialized servers. Deploying GPU-based machine learning models in resource-constrained environments or on edge devices can be challenging due to their size, power consumption, and cost.

8) Vendor lock-ins: ‍Different GPU vendors (e.g., NVIDIA, AMD) may have specific tools and libraries, leading to vendor lock-in concerns. This can limit flexibility and interoperability in the long term.

The Titan Takeoff Inference Server offers solutions to these challenges associated with GPU usage.

Related Articles

Introducing the Titan Takeoff Inference Server 🛫

There aren’t enough GPUs for the AI revolution

Generative AI (GenAI)

Generative models, model the whole distribution of the data. Generative AI is the new wave of models which can produce everything from art and language, to music and video. Its ability to produce such diverse and creative content has led to its rapid adoption across industries and revolutionized how humans generate, create and interact with data and media. The likely economic impact of GenAI is considerable, with McKinsey's latest report estimating it could add the equivalent of between $2.6 trillion -$4.4 trillion annually across 63 use cases [1].

Prominent examples of generative AI applications:

1) Text generation: Models like GPT-4 can generate human-like text, making them useful for content creation, chatbots, and automated writing assistance.

2) Image generation: Generative adversarial networks (GANs) can create realistic images, enabling applications in art and design.

3) Music composition: AI algorithms can compose music, mimicking the style of various composers or generating original compositions for multimedia projects.

4) Video synthesis: AI can generate‍ video content, from deepfakes to video game animations - enhancing visual storytelling and entertainment.

5) Drug discovery: AI-driven generative models can suggest new chemical compounds with potential pharmaceutical applications, speeding up drug development.

Related Articles

No items found.

HIPPA

HIPAA, the Health Insurance Portability and Accountability Act of 1996, is a United States legislation that provides data privacy and security provisions for safeguarding medical information. It sets standards for the protection of sensitive patient health information, ensuring that it is handled with confidentiality and security. HIPAA applies to healthcare providers, health plans, healthcare clearinghouses, and business associates of those entities that process health information.

‍

Related Articles

No items found.

Hallucination

Hallucination is the habit of generative models to produce plausible-sounding but ultimately incorrect completions of the prompt.

Machine learning researchers have found a technique called retrieval augmented generation (RAG) to be successful in reducing hallucinations [1]. This is why TitanML has built a plug and play RAG engine into its Titan Takeoff Inference Server.

Related Articles

No items found.

Human in the loop

Machine learning models can sometimes be unreliable. A human in the loop system is one in which machine learning inferences are assessed continually by a human operator. For example, GitHub's Copilot system has a human in the loop - in that the responses from the code model are accepted or rejected by the human writing the code.

Image credits: Anderson Anthony

In essence, using a human in the loop approach is valuable in situations where human judgement, expertise, or oversight is essential to enhance the performance, safety, ethics, and overall reliability of AI systems. For example, it is often used for quality assurance, in safety-critical systems, for legal and ethical compliance, in content moderation, anomaly detection and in error recovery. It therefore ensures AI technologies are deployed responsibly and effectively across a wide range of applications and domains.

Related Articles

No items found.

Inference

Inference in generative AI refers to the process where a trained generative model generates new data samples based on learned patterns and structures. This process involves the model taking input (which can be minimal or even none in some cases) and producing output that aligns with the distribution of the data it was trained on.

For example, in a generative AI model trained on images, inference would be the act of the model creating a new image that resembles the types of images it has seen during training. Similarly, in text-based models like GPT-3 or GPT-4, inference involves generating text that is coherent and contextually appropriate based on the input prompt and the vast amount of text data it was trained on. Inference is the practical application of a generative AI model's learned capabilities, showcasing its ability to create, predict, or simulate data that is new, yet familiar in structure and content to its training set.

Related Articles

No items found.

Inference Server

Inference servers are the “workhorse” of AI applications, they are the bridge between the trained AI model and real-world, useful applications. Inference servers are specialised software that efficiently manages and executes these crucial inference tasks.

The inference server handles requests to process data, running the model, and returning results. An inference server is deployed on a single ‘node’ (GPU, or group of GPUs), it is scaled across nodes for elastic scale through integrations with orchestration tools like Kubernetes. Without an inference server, the model weights and architecture are useless, it is the inference server that gives us the ability to interact with the model and build it into our application.

Related Articles

No items found.

Inference optimization

Inference optimization is the process of making machine learning models run quickly at inference time. This might include model compilation, pruning, quantization, or other general purpose code optimizations. The result improves efficiency, speed and resource utilization.

The use of inference optimization matters for several reasons:

1) Efficiency: Optimizing inference ensures predictions are made quickly and with minimal computational resources. This is crucial for applications requiring low latency and real-time responses, such as autonomous vehicles or online recommendation systems.

2) Cost reduction: Efficient inference leads to reduced hardware and operational costs. By using fewer computational resources, organizations can save on infrastructure expenses when deploying machine learning models at scale.

3) Scalability: ‍Optimized inference allows for seamless scalability, enabling models to handle increased workloads and accommodate growing user demands, without sacrificing performance.

4) Energy efficiency: Inference optimization contributes to energy savings, and can lower the operational costs associated with power consumption.

5) Resource compatibility: Models optimized for inference can be deployed on a wide range of hardware, including edge devices with limited computational capabilities, making machine learning more accessible in various contexts.

6) Enhanced user experience: Faster and more efficient inference directly impacts the user experience by reducing waiting times and enabling smoother interactions with AI-powered systems.

7) Deployment flexibility: Optimized models are easier to deploy across various environments, from cloud servers to edge devices, allowing organizations to leverage machine learning in diverse scenarios.

Related Articles

No items found.

Instruction tuning

Instruction tuning is the process of fine tuning a language model on datasets of instruction output pairs. The purpose is to make the model more likely to follow intructions given by the user, as opposed to attempting to finish the instruction.

Image credits: S. Zhang et al, https://arxiv.org/pdf/2308.10792.pdf

Related Articles

No items found.

Kernel

A kernel is a single function which gets executed on a GPU.

Related Articles

No items found.

Kubernetes

Kubernetes is an open-source platform designed to automate deploying, scaling, and operating application containers. It groups containers that make up an application into logical units for easy management and discovery. Developed by Google, Kubernetes is widely used for cloud-native applications due to its efficiency in managing containerized environments. In AI, Kubernetes is often used to scale Inference Servers over nodes.

Related Articles

No items found.

https://kubernetes.io/

LLaMA

LLaMA is a family of language models released by Meta. LLaMA-2 models are released with weights, and are free for many commercial use cases.

Related Articles

No items found.

Language model

A language model is a machine learning model which is trained to be able to model natural language. It learns statistical patterns, relationships, and structures of language by analyzing large datasets of text. This understanding allows it to predict and generate coherent and contextually relevant text. Language models are fundamental components of natural language processing (NLP) systems and are used for various tasks, including language translation, text generation and sentiment analysis.

Related Articles

No items found.

Large language model

A large language model (LLM) is a specific type of language model which is characterized by its extensive size (typically measured in terms of the number of parameters (learnable weights) it contains). Large language models have tens or hundreds of millions, or even billions, of parameters. These models are pre-trained on vast amounts of text data to capture a broad and deep understanding of language. Notable examples include GPT-3, BERT, and T5. Large language models are known for their exceptional performance on a wide range of NLP tasks and their ability to generate high-quality text.

Related Articles

No items found.

Latency

Latency refers to the time taken from when an input is provided to the model until an output is received. Low latency is critical in real-time applications where swift responses are essential as it directly impacts user experience.

Related Articles

No items found.

Llava

LLaVA represents a novel end-to-end trained large multimodal model that combines a vision encoder and Vicuna for general-purpose visual and language understanding, achieving impressive chat capabilities mimicking spirits of the multimodal GPT-4 and setting a new state-of-the-art accuracy on Science QA.

LLaVa — https://huggingface.co/docs/transformers/model_doc/llava

‍

Related Articles

No items found.

Machine learning (ML)

Machine learning (ML) is a subset of artificial intelligence (AI). It involves the use and development of computer systems which are able to learn and adapt without following explicit instructions, by using algorithms and statistical models to analyze and draw inferences from patterns in data. It encompasses various techniques, including supervised learning, unsupervised learning, reinforcement learning, and deep learning.

Related Articles

No items found.

Machine learning inference

Machine learning inference is often referred to as "moving a model into production". It is the vital process of using a trained machine learning model to make predictions or classifications on fresh, unprocessed data, thus enabling efficient, cost-effective, and scalable deployments of machine learning solutions.

Related Articles

No items found.

Mistral

Mistral is a series language model created by Mistral.ai. It is an open source model, and is widely considered to be France's closest answer to OpenAI's language models.

Related Articles

No items found.

Mixture of Expert Models (MoE)

Mixture of Expert (MoE) models are a type of conditional computation where parts of the network are activated on a per-example basis. It has been proposed as a way to dramatically increase model capacity without a proportional increase in computation [1].

Related Articles

No items found.

Model

A model in the context of machine learning is an object which is used to transform input data into insights. Models usually consist of some fixed data encoding their knowledge, and then an algorithm for generating results from the combination of this data, and input data.

There are a number of popular model types. These include:

Large language models (see large language models definition)
Deep neural networks (see deep neural networks definition)
Linear regression
Logistic regression
Decision trees
Linear discrimination analysis
Naive bayes
Support vector machines
Learning vector quantization
K-nearest neighbours
Random forest

Related Articles

No items found.

Model compilation

Model compilation is an essential step in the deployment of AI models. It is a process applied in some deep learning frameworks to prepare a model for inference. Software frameworks designed to make training machine learning models easy, often leave a lot of inference-time performance on the table because they must be flexible for practitioners to be able to experiment rapidly. Compilation takes the output of a training process and squeezes out this flexibility, leaving only the information required to run the model in inference. In short, it tailors models to the target hardware, which improves efficiency, reduces latency, and enables their use in various devices and applications.

Related Articles

No items found.

Model monitoring

Model monitoring is an important part of the MLOps pipeline. Observability and monitoring are a key part of building reliable and flexible operations. Model monitoring describes these principles as applied to machine learning models. This might involve subsampling and saving the input data, tracking model performance, producing online accuracy metrics, as well as encompassing techniques from standard devops observability best practises.

Related Articles

Shifting sands: OpenAI's fluctuating model performance and the impact on developers

Model parallelism

Model parallelism is a form of parallelism where a model is divided up to sit on different GPUs. It is useful for increasing the speed of models at inference and during training.

Not to be confused with data parallelism (see data parallelism definition), both can be applied in tandem make the most of available resources, reduce training and deployment times, and optimize the end-to-end machine learning process for better results and faster development.

Related Articles

No items found.

Model serving

Model serving is the process of taking a machine learning model and putting it into a server. A server is a continuously running listener process which waits for requests from end-users, processes them, and then sends responses. This should be distinguished from, for example, batch processing - where a process has a list of data that it churns through on some regular schedule. This paradigm is the foundation of the modern web, and is the main way in which machine learning models are put into production today.

The Titan Takeoff Inference Server is a fast way to run machine learning model inference in a web server.

Related Articles

Introducing the Titan Takeoff Inference Server 🛫

Multi-GPU inference

Multi-GPU deployments allow the distributed inference of large language models (LLMs) by distributing those models across multiple GPUs. This allows for the inference of larger models and enables larger batch sizes. It is advantageous for applications which require high throughput, reduced latency, and efficient utilization of computational resources.

Related Articles

No items found.

Natural language processing (NLP)

Natural Language Processing (NLP) is a branch of artificial intelligence (AI) and is the processing of natural language to extract insights from human language, combining the power of computational linguistics and AI.

Typical uses of NLP within AI include:

1) Text analysis: NLP can be used for analyzing and extracting insights from textual data, including sentiment analyisis, text summarization and topic modelling.

2) Machine translation: NLP enables automatic translation of text and/or speech from language into another, therefore benefitting communication across cultures and is benefitcial for content localization.

3) Chatbots and virtual assistants: NLP powers chatbots and virtual assistants as it can engage in natural language conversations. This is especially useful for customer support functions, answering questions and automating manual tasks.

4) Search engines: NLP techniques improve the accuracy and relevance of search engine results, thus improving user experience.

5) Speech recognition: NLP has been used in various speech recognition systems, as it enables voice commands and transcription services. It was fundamental in the creation of voice assistants including both Siri and Alexa.

6) Text generation: NLP models' ability to generate human-like text make them particularly useful for content generation and creative writing tasks,

Related Articles

No items found.

Natural language understanding (NLU)

Natural language understanding (NLU) are asks which involve processing language, without being required to generate new text. Tasks typically include: classification, closed question answering and named entity extraction.

Related Articles

No items found.

Neural networks

(Artificial) neural networks are a type of machine learning model inspired by biological networks in mammalian brains. These networks of neurons (hence, neural networks) process information by composing trillions of simple computations into massively-connected networks which are capable of performing all kinds of difficult tasks. Artificial neural networks are highly simplified when compared to their biological forebears, but the underlying principle of composing simple computations to produce intelligent results is the same (see deep learning, deep neural networks).

Image credits: C. Gershenson, https://arxiv.org/ftp/cs/papers/0308/0308031.pdf

Related Articles

No items found.

Ngram

An ngram is a set of n adjacent tokens. For example, if each word is a token then "my name is" is a 3-gram. Ngram models, which estimate the probabilities of all nsets of models, were considered state-of-the-art within language modeling, and are a useful pairing with large language models (LLMs), specuulative decoding tools and other applications.

Related Articles

No items found.

Node

A node refers to a single computer or machine within a larger network of computers that work together. Each node might perform a portion of a larger task in parallel computing. When deploying Generative AI models a node typically refers to a GPU or a defined collection of GPUs.

Related Articles

No items found.

On-prem

Companies that don't make use of cloud computing services must maintain compute capability themselves, in the form of a large number of networked servers. These services are stored "on the premises", commonly abbreviated as "on-prem". This is often a result of regulatory or privacy-sensitivity. See Cloud computing.

Related Articles

No items found.

Paged Attention

Paged Attention is a memory management technique for optimizing GPU usage during LLM inference by partitioning the key-value cache into smaller, non-contiguous blocks called pages. This approach minimizes memory fragmentation, allowing for dynamic adjustment of sequence lengths and batch sizes while maximizing throughput and GPU efficiency.

‍

Related Articles

Optimizing GPU Memory for LLMs: A Deep Dive into Paged Attention

Perplexity

Perplexity is a measurement of how well a probability distribution or probability model predicts a sample. A low perplexity indicates the model is good at predicting the sample.

Related Articles

No items found.

Pretrained model

Pretraining is the process of producing a general purpose, flexible model from a massive corpus of general purpose data. Modern machine learning training (especially in language processing) usually has two training phases: pretraining - where the model is taught to understand general language, logic, and conceptual features, and, finetuning - where the model is taught to understand concepts or language specific to a domain, for example, finance, construction, or scientific data (see finetuning)

Related Articles

No items found.

Prompt engineering

Prompt engineering is the process of writing better prompts to large language models to get the desired output more often. It is a complex language task which requires deep reasoning as it involves closely examining a model's errors. hypothesizing what is missing and/or misleading in the current prompt, and then, communicating the task more clearly back to the large language model. There are two easy methods for prompting a large language model to improve its output: asking the model to "think step by step", and, instructing the model to reflect on its outputs [1].

Image credits: Q. Ye et al, https://arxiv.org/pdf/2311.05661.pdf

Related Articles

When should I fine tune my LLM - Low effort strategies that beat fine-tuning

Pruning

Pruning is a machine learning optimization technique which is applicable to deep neural networks (DNNs). Pruning involves finding neurons (often called weights) in neural networks that do not contribute significantly to the performance of the model, and then removing them. This can improve the processing speed of the model, so long as it is done in a way which can be accelerated by the underlying hardware. It typically results in reduced model size, improved efficiency, better generalization and increased interpretability [1].

Image credits: T. Liang et al, https://arxiv.org/pdf/2101.09671.pdf

Related Articles

No items found.

Public Cloud

A public cloud is a platform that uses the standard cloud computing model to make resources, such as virtual machines, applications, or storage, available to users over the internet. It's operated by third-party cloud service providers, offering scalability, reliability, and cost-efficiency, where resources are shared among multiple customers and billed on a pay-per-use basis.

‍

Related Articles

No items found.

Quantization

Quantization is a machine learning optimization technique which is applicable to deep neural networks. Neural networks store a large number of variables (called neurons, or weights) that encode the models knowledge of the task they are trained to perform. During training numbers must be stored at (relatively) high precision, to make sure the model learns from the data effectively. In inference time, it is often possible, without a substantial drop in model ability, to decrease the precision with which these weights are stored. This can substantially reduce the amount of space the model takes up in memory, and, with proper optimization, can also speed up the models processing of incoming data.

Related Articles

Deploying LLMs on small devices: An introduction to quantization

Quantization aware training

Quantization aware training is an optimization technique for performing quantization without incurring substantial accuracy losses. The goal of quantization aware training is to find the best way to reduce the stored precision of a model with regards to its performance on a data set. To that end, during quantization aware training, quantization proceeds whilst simultaneously attempting to keep the model performance on a fixed dataset the same. (see quantization)

Image credits: P. Novak et el, https://www.researchgate.net/publication/351925867_Quantization_and_Deployment_of_Deep_Neural_Networks_on_Microcontrollers

Related Articles

No items found.

RAG (Retrieval Augmented Generation)

Retrieval Augmented Generation (RAG) is a method for enhancing factuality and groundedness of the outputs of a machine learning model with a corpus. Unconstrained generation from LLMs is prone to hallucination, and finetuning to add capabilities or knowledge to a model can be difficult and error-prone. Allowing access to a corpus of data at model runtime, for example, a company wiki or open source documentation, can add capabilities without requiring finetuning. It can also reduce hallucinations, where information retrieval is preferred by the model over de novo generation. See hallucination.

Related Articles

When should I fine tune my LLM - Low effort strategies that beat fine-tuning

Rate Limits

Rate limits in the context of API-accessed Large Language Models (LLMs) like ChatGPT refer to the policies that restrict the number of API requests a user or application can make within a specified time period. These limits are implemented to ensure equitable access, prevent abuse, and maintain the performance and reliability of the service for all users. Exceeding these limits typically results in temporary denial of access until the limit resets. Self-hosted LLMs do not experience the same kind of rate limiting.

‍

Related Articles

No items found.

Recurrent neural network (RNN)

A recurrent neural network (RNN) processes sequences one-by-one and produces intermediate states which are passed between inferences to maintain a memory of previously seen items in the sequence.

Related Articles

No items found.

Repetition penalty

Repetition penalty is a factor applied to discourage the model from generating repetitive text or phrases. By adjusting this penalty, users can influence the model's output, reducing the likelihood of it producing redundant or repeated content. A higher repetition penalty generally results in more diverse outputs, whilst a lower value might lead to more repetition.

Related Articles

No items found.

Rust

Rust is a popular programming language which emphasizes performance, memory safety, and developer productivity. Rust's strong type system and zero-cost abstractions allow developers to write very robust code, whilst its manual memory management means that rust performance is best-in-class. The Titan Takeoff Inference Server's batching and serving infrastructure are written in the rust language.

Related Articles

No items found.

Sampling temperature

Sampling temperature is a parameter used during the text generation process in large language models (LLMs). It controls the randomness of the model's output. A higher temperature results in more random and diverse outputs, whilst a lower temperature makes the output more deterministic and focused on the most likely predictions.

Related Articles

No items found.

Self-hosted models

Self-hosted models are AI models that are run and maintained on a business' own infrastructure rather than relying on third-party providers. It is the most private and secure method of deploying large language models, and often, since there is no reliance on third party providers, it is significantly cheaper than using API-based model deployments, when scaling.

Typically, self-hosting is considered to be an incredibly complex and time consuming process for machine learning teams to build and maintain. This is why TitanML has built the Titan Takeoff Inference Server - so that machine learning teams can have the benefit of private and secure large language model deployments, effortlessly.

Related Articles

No items found.

Sentiment analysis

Sentiment analysis, also known as opinion mining, is the process of extracting the sentiment of a body of text. It aims to classify text into different categories or sentiments, such as positive, negative, or neutral, to understand the attitudes, opinions, and emotions conveyed by the author.

Related Articles

No items found.

Serving

Serving is the act of hosting an AI model such that it is able to be used at scale and power downstream applications.

Related Articles

No items found.

Speculative decoding

Speculative decoding is a sampling method for accelerating text generation. It accelerates the process by employing a smaller language model to produce candidate text samples. These candidates are evaluated by a larger model, and only approved text is accepted.

Image credits: R. Zhu, TitanML

Speculative decoding is typically used to:

1) Enhance diversity in output.

2) Reduce repetition.

3) Improve quality and contextuality.

4) Explore various ideas.

5) Adapt to different interpretations.

6) Mitigate bias.

7) Enhance user experience.

8) Choose the best response.

Related Articles

In the fast lane! Speculative decoding - 10x larger model, no extra cost

Supervised learning

Supervised learning is machine learning where the training process includes data labelled by some supervisory process, usually a human labeller.

Related Articles

No items found.

Synthetic data

One of the most expensive and time consuming parts of the MLOps pipeline is the construction of datasets for machine learning training. This is usually performed by human labellers at high cost. The promise of synthetic data is that this data can be constructed by automatic processes. There are various methods for doing so - the most promising of which is to use other machine learning models to construct the data.

Related Articles

No items found.

TPU

TPUs are specialist deep-learning hardware designed by Google for deep learning training and inference.

Related Articles

No items found.

Tensor Parallelelism

Tensor parallelism is a technique used to distribute a large model across multiple GPUs. For instance, during the multiplication of input tensors with the first weight tensor, the process involves splitting the weight tensor column-wise, multiplying each column separately with the input, and then concatenating the resulting outputs. These outputs are transferred from the GPUs and combined to produce the final result, as illustrated below.

Image courtesy of Anton Lozkhov — Source: HuggingFace

Related Articles

No items found.

Throughput

Throughput denotes the number of input samples or tasks that a model can process within a specific time frame. It is a measure of the system's capacity and efficiency in handling multiple requests.

Typically, machine learning researchers refer to throughput as being either high or low:

High throughput:

Advantages: High throughput is beneficial when speed, real-time processing, and scalability are critical. It allows AI systems to handle a large number of tasks or data points quickly and efficiently.

Use cases: High throughput is favored in applications such as autonomous vehicles, real-time financial trading, customer support chatbots, content delivery networks, and situations where rapid decision-making is crucial.

Low throughput:

Advantages: Low throughput might be acceptable or even preferable as it may allow for deeper analysis, more complex computations, and a focus on accuracy over speed.

Use cases: Low throughput can be suitable for tasks such as scientific simulations, complex data analysis, research experiments, and applications where precision and thoroughness are prioritized over immediate response times.

Related Articles

No items found.

Titan Takeoff Inference Server

The Titan Takeoff Inference Server is the flagship product of TitanML. The Titan Takeoff Inference Server is the easiest way to locally inference self-hosted models, applying state of the art techniques in inference optimizaiton and integrations with other software crucial for language models.

Related Articles

No items found.

Token

A token is a discrete chunk of a larger sentence. In language modelling a token can be a single character, a word, a subword, or a group of words.

Related Articles

No items found.

Tokenization

Tokenization, in the context of large language models (LLMs), is the process of converting input text into smaller units, or "tokens," which can then be processed by the model. This process is a critical preprocessing step before feeding data to a large language model (LLM), as it ensures the text is in a format the model can understand and process.

Related Articles

No items found.

Top K

Top K sampling is a text generation strategy in which the model considers only the top 'K' most likely next tokens for its next word prediction. By restricting the pool of possible tokens, this method ensures the generated content remains coherent and contextually relevant.

Related Articles

No items found.

No results found. Please try different keywords 😉