An A100 is probably 2-4k tokens/second on a 20B model with batched inference. Mu... | Hacker News

Hacker Newsnew | past | comments | ask | show | jobs | submit

		mlyle 5 months ago \| parent \| context \| favorite \| on: Open models by OpenAI An A100 is probably 2-4k tokens/second on a 20B model with batched inference. Multiply the number of A100's you need as necessary. Here, you don't really need the ram. If you could accept fewer tokens/second, you could do it much cheaper with consumer graphics cards. Even with A100, the sweet-spot in batching is not going to give you 1k/process/second. Of course, you could go up to H100...

d3m0t3p 5 months ago [–]

You can batch only if you have distinct chat in parallel,

mlyle 5 months ago | [–]

> > if I want to run 20 concurrent processes, assuming I need 1k tokens/second throughput (on each)

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact