You're right about local software comparisons, but this is different. If I'm com...

You're right about local software comparisons, but this is different. If I'm comparing two SaaS platforms, wall clock time to achieve a similar task is a fair metric to use. The only caveat is if the service offers some kind of tiered performance pricing, like if we were comapring a task performed on an AWS EC2 instance vs Azure VM instance, but that is not the case with these LLMs.

So yes, it may be that the wall clock time is not reflective of the performance of the model, but it is reflective of the performance of the SaaS offerings.