>a (fine-tuned) base Transformer model just trivially blowing everything else out of the water
"Attention is All You Need" was a Transformer model trained specifically for translation, blowing all other translation models out of the water. It was not fine-tuned for tasks other than what the model was trained from scratch for.
GPT-1/BERT were significant because they showed that you can pretrain one base model and use it for "everything".
>a (fine-tuned) base Transformer model just trivially blowing everything else out of the water
"Attention is All You Need" was a Transformer model trained specifically for translation, blowing all other translation models out of the water. It was not fine-tuned for tasks other than what the model was trained from scratch for.
GPT-1/BERT were significant because they showed that you can pretrain one base model and use it for "everything".