I believe that VRAM has massively shot up in price, so this is where a large part of the costs are. Besides I wouldn't be very surprised if Nvidia has such strong market share they can effectively tell suppliers to not let others sell high capacity cards. Especially because VRAM suppliers might worry about ramping up production too much and then being left with an oversupply situation.
This could well be the reason why the rumored RDNA5 will use LPDDR5X/LPDDR5X instead of GDDR7 memory, at least for the low/mid range configurations (the top-spec and enthusiast configurations AT0 and AT2 configurations will still use GDDR7 it seems).
It is not really clear if it will be called as UDNA or RDNA5, I was just referring to the next-gen graphics architecture from AMD and referring as RDNA5 is just clearer that this is the next-gen architecture.
I don't really know what I'm talking about (whether about graphic cards or in AI inference), but if someone figures out how to cut the compute needed for AI inference significantly then I'd guess the demand for graphic cards would suddenly drop?
Given how young and volatile this domain still is, it doesn't seem unreasonable to be wary of it. Big players (google, openai and the likes) are probably pouring tons of money into trying to do exactly that
I would suspect that for self hosted LLMs, quality >>> performance, so the newer releases will always expand to fill capacity of available hardware even when efficiency is improved.