It's rather complex, there's a whole team handling it, we have servers in ~20 da...

It's rather complex, there's a whole team handling it, we have servers in ~20 data centers distributed globally, large pools have > 20k pods running on each, each dc has 100TB to 1PB of RAM available.

We have pod affinity rules (we usually flush entire racks for infra updates) so failures don't bring down services.

Node failure is rather unusual, it's more likely that we either need to flush a rack to update it or some service has some issue.

We have separate environments with isolated hardware pools for production and testing (it may be colocated in the same dc).

Nodes have high performance NAS available and ephemeral local storage (SSDs) that it's wiped on pod restart.

If a node fails, you remove it from the pool and send someone to replace it when feasible.

Provisioning depends on the application, you can provision your own pod (if you have the right access), but applications tend to have deployer services that handle provisioning for them.