They are comparing the speed to execute training to 125K steps, not speed to a g...

		nl on Jan 9, 2020 \| parent \| context \| favorite \| on: ALBERT: A Lite BERT for Self-Supervised Learning o... They are comparing the speed to execute training to 125K steps, not speed to a given accuracy. In section 4.8 they compare accuracy at the same amount of training time for the biggest of each model and show that ALBERT is substantially better.