Doubleword | When should I fine tune my LLM - Low effort strategies that beat fine-tuning

What is finetuning?

If you're working with Large Language Models (LLMs), chances are that you'll have heard of finetuning as a technique to improve model quality. Even OpenAI's zero shot GPT models have been finetuned to reach such high levels of performance. But what exactly is finetuning?

Finetuning is the process of training a model (in this case a foundational LLM) on domain or task specific data to improve its performance on a downstream task.

When might you want to finetune?

Generally speaking, you might want to finetune in the following cases:

Knowledge injection (your foundation model doesn’t know things it needs to know)
Output forming (I need the outputs of the model in a certain format)
Tone (I want my model to ‘talk’ in a certain way)
Task fine-tuning (I want my model to chat rather than fill in the gaps)

Despite the generally good performance of many open source foundational LLMs, these models may not perform as well in specific tasks. Finetuning often pops up as the first solution to these situations.

Difficulties of finetuning

While finetuning can be very useful, it presents significant challenges:

Requires significant GPU resources (alongside associated cost)
Requires collecting and labelling high quality finetuning data
Requires specialist skills and infrastructure
Needs to be done often If the training data changes frequently

We know how challenging finetuning can be, therefore, finetuning your language model should be a last resort, rather than the first thing you should try. So in this article I’m going to explore some alternatives that you can try instead of finetuning.

Using RAG for knowledge injection

One of the key reasons why people decide to finetune is they want their model to reason about things that the base model doesn’t know - so you want to teach the model extra pieces of information.

One alternative to finetuning for the purpose of knowledge injection is RAG (retrieval augmented generated). This is when you give your model the ability to ‘search’ a knowledge store where you keep all the relevant information - the result from this search is then passed into the model as ‘context’.

This makes the model significantly more accurate and less likely to hallucinate and make things up. Another advantage of using RAG over finetuning is that it allows you to reason about constantly changing information - just by updating the vector database the model will now ‘know’ about the new information.

Why try it?

Less likely to hallucinate (make things up)
Provides references to sources
Allows you to update the information as often as required through the connected vector database

Downsides?

Might still not be accurate enough in which case finetuning might be needed - but it's a good first pass (or to be used in combination with finetuning)

‍

‍

From our experience at TitanML - RAG performs astonishingly well, especially for enterprise use cases where hallucination is very damaging. The Titan Takeoff RAG Engine (currently in Beta with development partners) is our way of making RAG better for users who want to self-host their language models. The Titan Takeoff RAG Engine is a plug and play way to create a RAG application entirely through self-hosted components so you can build and deploy your RAG application with total privacy and transparency.

Using constrained output for output forming

We often see people wanting to use finetuning for extractive workloads, i.e. when they want to extract information from a document. Typically they want the language model response to be in a predictable JSON format.

Currently there are two options on how to do this; either you can try to use prompting or you can finetune. However, neither of these are ideal since in neither case does it guarantee that the response is in your desired format.

For this finetuning use case we always prefer using constrained output generation instead of finetuning.

Why try it?

Much easier - all you need to do is write JSON
Guaranteed to adhere to the JSON schema every time rather than just increasing the probabilities
You can change the schema whenever you want with no extra training

Downsides?

Requires more specific prompting including context
Still an active area of research

Demonstration of Constrained Output Generation with Titan Takeoff Server — Demonstration of constrained output generation with the Titan Takeoff Inference Server

We have built JSON and Regex controlled generation into our Titan Takeoff Inference Server, so all of our clients can do this kind of controlled generation in a foolproof and easy way. As you can see in the GIF above, all that needs to be done is to specify a regex string. This is perfect for extractive workloads which our clients love!

Using a better model and prompt engineering for tone and task finetuning

As a general rule of thumb, the bigger your model is, the better it is at following instructions. Therefore, you might be able to go a long way just with prompt engineering and using a better model. For example, if I want my model to speak in a pirate voice, it might be much easier to get GPT-4 to do this than using a LLaMA-2 7B model.

LLaMA-2 has trouble understadning instructions

However, in this case it should be considered whether you have any deployment requirements - for example, if you require inference on CPU then going to the extra effort of finetuning might be worth it.

We’ve tried to make this process of trying out different models and prompts as easy as possible in the Titan Takeoff Inference Server - you can get to inference in just a single line of code - allowing you to try out dozens of different models and prompts easily.

Why try it?

Very easy to try out (especially with theTitan Takeoff Inference Server) - worth a shot!

Downsides?

The model you deploy is bigger and more expensive than the model you would have deployed alternatively
Might not work well enough

Conclusion

So, in turns out that there are some alternatives to finetuning! Not all of them are guaranteed to work every time, however they are usually simpler and should always be tried as a first pass before trying to collect all of that data needed for finetuning and set up all of that infrastructure! Happy building!

When should I fine tune my LLM - Low effort strategies that beat fine-tuning

What is finetuning?

When might you want to finetune?

Difficulties of finetuning

Using RAG for knowledge injection

‍

Using constrained output for output forming

Using a better model and prompt engineering for tone and task finetuning

Conclusion

Footnotes

Table of contents:

Want to learn more?