My team does quite a bit of this. We handle it in two different ways: For some c...

My team does quite a bit of this. We handle it in two different ways:

For some clusters we carve nodes out of VMWare simply using OS templates. For other nodes we we use cheap-and-deep blade servers and install the OS on bare-metal using PXE. Once the nodes are provisioned we use ansible to deploy Kubernetes. (Lately it's been RKE2 on top of Rocky.)

Generally speaking VM-based nodes are extremely reliable and seldom have to be rebuilt. (If we're paying to run VMWare its because the underlying hardware is high-quality.) Bare-metal nodes, on the other hand, are built on inexpensive hardware and they tend to fail in different ways. When they fail we cordon and remove them from the cluster and put them in a list. (We maintain sufficient overcapacity to soak failures as they come.)

If we're using persistence we have to take care that the statefulsets are configured correctly. Sometimes we use local-disk persistence so that our services can benefit from local NVME performance. Other times we use NFS (when we need persistence but not performance.)

We monitor cluster node health internally to Kubernetes and also externally using Nagios (shudder).

Kubernetes upgrades are a pain in the ass. Lots of times we'll just set up a second cluster to avoid the risk of a failure during an upgrade.