You do not have to context switch the kernel thread when context switching the m...

You do not have to context switch the kernel thread when context switching the multiplexed user thread. So the context switch should in principle be much faster. There are second order effects of course, for example the new user thread might touch cold cache lines so the context switch speed up might not make much of a difference.

Normally on an M:N setup the kernel threads are pinned one per phisical hardware thread (i.e. core or SMT thread), so as long a they are the only program running in that core, they are never preempted.