*It's common knowledge not to bother running a deep model like ALBERT, BERT, or ...

It's common knowledge not to bother running a deep model like ALBERT, BERT, or XLNet, etc without at least a GPU. Training and inference with models of this size on CPU's is typically considered to be intractable.

I am not sure if this is common knowledge, and if it is, it is wrong. With read-ahead and sorting batches by length, we can easily reach 100 sentences per second on modern CPUs (8 threads) with BERT base. We use BERT in a multi-task setup, typically annotating 5 layers at the same time. This is many times faster than old HPSG parsers (which typically had exponential time complexity) and as fast as other neural methods used in a pipelined setup.