What is model specialisation and why do I keep on banging on about it?
The phrase ‘use a sledgehammer to crack a nut’ is a comical one — but it is no exaggeration to say that this is exactly how we are treating Large Language Models (LLMs). In this article, I discuss the more appropriate, sustainable alternatives: fine-tuned specialised NLP models (or the nutcrackers to labour the metaphor).
Background
Large language models (LLMs), like the ones that power ChatGPT, have been grabbing our timelines for the last 6 months, alongside seemingly never-ending announcements of new startups and corporate products driven by this technology. However, these models are incredibly computationally expensive. The models powering ChatGPT likely run into the hundreds of billions of parameters — requiring an insane amount of computational power and GPUs to keep the show on the road (hence the OpenAI deal with Azure).
This computational complexity is part of the genius behind why ChatGPT is so impressive at seemingly everything. However, this complexity is the exact same thing that makes LLMs so difficult to deploy, expensive, and slow to inference. Businesses across various sectors that are working to deploy LLMs into production are facing enormous challenges getting even much smaller generalist LLMs into production for a reasonable cost/latency trade-off.
What is model specialisation?
Specialised models are ones which are designed just for the task required, as well as considering deployment and hardware constraints. By contrast, models like GPT-4 are totally general — being able to do pretty much any task with great performance.
For example, ChatGPT is able to do everything from writing love letters to reviewing contracts, whereas most use cases (especially enterprise-level) are incredibly specific — for example, categorising German language CVs. Using a ChatGPT model for this task is like using a sledgehammer to crack open a walnut.
So a specialised model in this case would be a dedicated German language classifier which would be rubbish at everything else. This model would be significantly smaller but has the performance that you need for your task, and would be the metaphorical equivalent of using a nutcracker to open a walnut.
Why model specialisation is awesome
The more capabilities and performance a model has (typically), the more computationally expensive it is — so specialised models tend to be much less computationally intensive. This resolves a lot of the issues of productionising LLMs: Specialised models are significantly smaller, faster, and easier to deploy even on cheaper legacy hardware.
And if done well, these models are just as (sometimes more) accurate as their larger generalist counterparts. For example, Markus Leippold showed fairly convincingly that for a financial sentiment analysis task, a specialised fine-tuned FinBERT significantly outperformed GPT-3 (link — see results below). FinBERT has approximately 1000x fewer parameters than GPT-3, making the cost of inference fractions of a penny on the dollar, for better performance — seems like a no-brainer, right!
Now performance gains for specialised models aren’t always as impressive as this FinBERT demonstration. Sometimes, we see slight accuracy drops — which is normally commercially fine when you’re looking at a 99% cost saving! But there is typically a Pareto front of performance/latency trade-off, and the best models in the smaller end of the spectrum are all the highly specialised ones — so for the best performance at a given size, specialised models are the winners.
So how does this impact the way that I build my NLP applications?
Specialised models aren’t always the right choice for your application. For instance, if you’re building a generalist chatbot, super large generalist models may be more appropriate. However, given that you are building something relatively domain or task-specific, you’ll almost always do better with an element of specialisation.
So here is how I would suggest building the best deployment ready NLP model:
- Define your task very clearly.
- Collect some high-quality fine-tuning data.
- Finetune a relevant model from an open-source model repository like HuggingFace.
- Compress the model using techniques like knowledge distillation, Neural Architecture Search, quantisation, and pruning.
- Select the compressed model that most suits your required accuracy/latency trade-off.
- Optimise the model for deployment using a graph compiler like TensorRT, ONNX, or TVM etc (dependent on hardware set-up).
- Deploy, monitor, and collect data.
- Fine-tune appropriately frequently.
This process will result in significantly smaller, cheaper, and faster models specialised just for the task that you care about, meaning you only pay for the performance that you really need.
Now this process isn’t trivial, especially 3–6. Fortunately, this is changing. TitanML was created to automate steps 3–6, allowing businesses to build better specialised models easily in hours. Great solutions are emerging to make the rest of the steps easy as well — that’ll make building specialised models trivial.
So why am I so passionate about specialised models?
The latest hype over LLMs and NLP models is incredibly exciting, but it is more than just hype. NLP models are going to change the way that we live and work over the next 5 years and will be deeply integrated into almost everything that we do. We will not be able to support this future using models like ChatGPT — there simply aren’t enough GPUs in the world, not to mention the significant energy requirements in an ever-warming world. I fear that this will lead to LLM technology being limited to a small number of use cases just for those who can afford it.
But fortunately, we are currently cracking open nuts with sledgehammers, and we can use nutcrackers instead. In order to create this exciting future of lives enabled and enriched with NLP technology, we need to be using the nutcrackers — the specialised, resource-efficient models. These models are not only beneficial to the companies using them — i.e., they are cheaper, faster, and easier to deploy. But potentially more importantly, they might just be our best hope of really enabling the AI future for everyone.
About TitanML
TitanML enables machine learning teams to effortlessly and efficiently deploy large language models (LLMs). Their flagship product Takeoff Inference Server is already supercharging the deployments of a number of ML teams.
Founded by Dr. James Dborin, Dr. Fergus Finn and Meryem Arik, and backed by key industry partners including AWS and Intel, TitanML is a team of dedicated deep learning engineers on a mission to supercharge the adoption of enterprise AI.
Written by Meryem Arik, Co-founder of TitanML
Deploying Enterprise-Grade AI in Your Environment?
Unlock unparalleled performance, security, and customization with the TitanML Enterprise Stack