Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

An A100 is probably 2-4k tokens/second on a 20B model with batched inference.

Multiply the number of A100's you need as necessary.

Here, you don't really need the ram. If you could accept fewer tokens/second, you could do it much cheaper with consumer graphics cards.

Even with A100, the sweet-spot in batching is not going to give you 1k/process/second. Of course, you could go up to H100...



You can batch only if you have distinct chat in parallel,


> > if I want to run 20 concurrent processes, assuming I need 1k tokens/second throughput (on each)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: