I'm really excited by the concept of local LLMs, we give to much of our data to the cloud. We should embrace local first principles with all these new AI tools.
Sure local inference if harder than in the cloud, but hardware is getting better all the time, and we are still early on the optimisation curve when it comes to LLMs.
I'm looking forward to hopefully seeing smaller less resource intensive models that are easer to run locally, even on mobile.
Does anyone who of any research info "trimming" or "stripping" less used parts of a LLM so that you can take a trained weights and make them smaller (obviously with loss of some sort)?
This particular example is built on top of llama.cpp. It has a few benefits:
1. It's hopefully easier to install (though still not nearly easy enough)
2. All prompts you send through it - along with their responses - are automatically logged to a SQLite database. This is fantastic for running experiments and figuring out what kinds of things work.
3. The same LLM tool works for other models as well - you can run "llm -m $MODEL $PROMPT" against OpenAI models, Anthropic models, other self-hosted models, models hosted on Replicate - all handled by plugins, which should make it really easy to add support for other models too.
My ultimate goal with LLM is that when someone releases a new model it will quickly be supported by an LLM plugin, which should make it MUCH easier to install and run these things without having to figure out a brand new way of doing it every single time.
Sure local inference if harder than in the cloud, but hardware is getting better all the time, and we are still early on the optimisation curve when it comes to LLMs.
I'm looking forward to hopefully seeing smaller less resource intensive models that are easer to run locally, even on mobile.
Does anyone who of any research info "trimming" or "stripping" less used parts of a LLM so that you can take a trained weights and make them smaller (obviously with loss of some sort)?