Phi-3 does well in benchmarks but underperforms IRL; for example, Phi-3-Medium gets beaten badly by Llama-3-8b on the LMSYS Chatbot Arena despite doing better on benchmarks.
Gemma's performance if anything seems understated on benchmarks: the 27b is currently ahead of Llama3-70b on the Chatbot Arena leaderboard.
I suspect Phi-3 is not robust to normal human input like typos and strange grammar since it's only trained on filtered "high quality" tokens and synthetic data. Since it doesn't need to waste a ton of parameters learning how to error correct input, it's much smarter on well curated benchmarks compared to its weight class. However, it can't operate out of distribution at all.
Personally vibe checking Phi-3-Medium is worse in my experience, no matter how well you spell — it just isn't good at all compared to Llama3-8b, despite being significantly larger in param count. I suspect the "high quality tokens" were "high quality" in the sense that they resembled tokens one might encounter in benchmarks, and not "high quality" in the sense of representing human-like input/output.
Gemma's performance if anything seems understated on benchmarks: the 27b is currently ahead of Llama3-70b on the Chatbot Arena leaderboard.