Part of it is that, but part of it is that people pay for getting from 95% optim...

Part of it is that, but part of it is that people pay for getting from 95% optimal to 99% optimal, and doing that is actually a lot of work. If you peek inside the matrix multiplication library you'll note that it's not just "we have the best algorithm for the last 7 GPU microarchitectures" but also 7 implementations for the latest architecture because that's just how you need to be to go fast. Kind of like how if you take an uninformed look at glibc memcpy you'll see there is an AVX2 path and a ERMS path but also it will switch between algorithms based on the size of the input. You can easily go "yeah my SSE2 code is tiny and gets decent performance" but if you stop there you're leaving something on the table, and with GPUs it's this but even more extreme.