No, it'll certainly be more expensive in any conceivable model that handles all ...

No, it'll certainly be more expensive in any conceivable model that handles all three modalities, but if the model uses an architecture like current autoregressive, token-based multimodal LLMs/VLMs, tokens will make just as much sense as the basis for pricing, and be similarly straightforward, as with text and images.