1.109B times faster serving of finetuned LLMs

gdiamos · on Aug 17, 2023

This enables you to chain and switch between thousands of finetuned LLMs.

A free notebook that lets you train a model using this optimization: https://colab.research.google.com/drive/1QMeGzR9FnhNJJFmcHtm...

A walkthrough that shows how to finetune a Llama v2 model: https://lamini-ai.github.io/Examples/llama_v2_example

A blog explaining how to support 10,000 fine tuned Llama 2 models one server, and reduce the time needed to switch between them by 1.109 billion times: https://www.lamini.ai/blog/one-billion-times-faster-finetuni...

We were motivated to invent this optimization after users trained 5,758 trained different finetuned models in 3 weeks, and our servers were overloaded by the time needed to switch between so many different finetuned models.

Note that this speeds up inference on servers running multiple finetuned PEFT models. It does not speed up the training phase of fine tuning other than by a modest amount (e.g. 1-3x) from using PEFT. Sorry for the confusing title, this is a new concept and it is hard to cram this paragraph/blog into a title. Happy to update it if anyone can come up with a clearer title (that isn’t a page long).