We use instruct models extensively as we find smaller models fine tuned to our prompts perform better when general chat models that are much larger. This lets us run inference that can be 1000x cheaper than 3.5, meaning both money saving and much better latencies.
This feels like a valid use for langchain then. Thanks for sharing.
Which models do you use and for what use cases? 1000x is quite a lot of savings; normally even with fine-tuning it's at most 3x cheaper. Any cheaper we'd need to get like $100k of hardware.