IIRC, the lifecycle hook only prevents destruction of the resource if it needs t...

linuxftw · 2025-02-28T13:23:13 1740748993

I find this statement to be technically correct, but practically untrue. Having worked in large terraform deployments using TFE, it's very easy for a resource to get deleted by mistake.

Terraform's provider model is fundamentally broken. You cannot spin up a k8s server and then subsequently use the k8s modules to configure the server in the same workspace. You need a different workspace to import the outputs. The net result was we had like 5 workspaces which really should have been one or two.

A seemingly inconsequential change in one of the predecessor workspaces could absolutely wreck the later resources in the latter workspaces.

It's very easy in such a scenario to trigger a delete and replace, and for larger changes, you have to inspect the plan very, very carefully. The other pain point was I found most of my colleagues going "IDK, this is what worked in non-prod" whilst plans were actively destroying and recreating things, as long as the plan looked like it would execute and create whatever little thing they were working on, the downstream consequences didn't matter (I realize this is not a shortcoming of the tool itself).

JohnMakin · 2025-02-28T19:17:02 1740770222

This sounds like an operational issue and/or a lack of expertise with terraform. I use terraform (self hosted, I guess you’d call it?) and manage not only kubernetes clusters but helm deployments with it just fine and without the issues you are describing. Honestly, this is just my honest feedback, I see things and complaints a lot like this in consulting, where they expect terraform to magically solve their terrible infrastructure and automation decisions. It can’t, but it absolutely provides you the tooling to avoid what I think you are describing.

It’s fair to complain that terraform requires weird areas of expertise that aren’t that intuitive and take a little bit of a learning curve, but it’s not really fair to complain that it should prevent bad practices and inexperience from causing the issues they typically do.

trallnag · 2025-02-28T22:14:45 1740780885

Terraform explicitly recommends in the Kubernetes provider documentation that the the cluster creation itself and everything else related to Kubernetes should live in different states.

https://registry.terraform.io/providers/hashicorp/kubernetes...

> The most reliable way to configure the Kubernetes provider is to ensure that the cluster itself and the Kubernetes provider resources can be managed with separate apply operations. Data-sources can be used to convey values between the two stages as needed.

nunez · 2025-03-01T04:28:40 1740803320

I agree with you (this is something that OpenTofu is trying to fix), but the way I do k8s provisioning in Terraform is to have one module that brings up a cluster, another to print the cluster's Kubeconfig, then, finally, another to use the Kubeconfig to provision Kubernetes resources. It's not perfect but it gets the job done most of the time.

JohnMakin · 2025-03-01T11:44:25 1740829465

this is best practice. I couldnt imagine doing it any other way and would flatly refuse.

There are shortcomings in the kubernetes provider as well that make wanting to maintain that in one state file a nonstarter for me.

jorams · 2025-02-28T10:22:21 1740738141

The Google Cloud Terraform provider includes, on Cloud SQL instances, an argument "deletion_protection" that defaults to true. It will make the provider fail to apply any change that would destroy that instance without first applying a change to set that argument to false.

That's what I expected lifecycle.prevent_destroy to do when I first saw it, but indeed it does not.

jmholla · 2025-02-28T02:23:21 1740709401

I'm pretty sure you are. I've had it protect me from `terraform destroy`.

jtimdwyer · 2025-02-28T04:16:56 1740716216

I think the previous post is saying a resource removed from a configuration file rather than an invocation explicitly deleting the resource in a command line. Of course if it’s removed from the config file, presumably the lifecycle configuration was as well!

glenngillen · 2025-02-28T06:06:45 1740722805

Yeah, that's a legit challenge that it would be great if there was a better built-in solution for (I'm fairly sure you can protect against it with policy as code via Sentinel or OPA, but now you're having to maintain a list of protected resources too).

That said the failure mode is also a bit more than "a badly reviewed PR". It's:

* reviewing and approving a PR that is removing a resource * approving a run that explicitly states how many resources are going to be destroyed, and lists them * (or having your runs auto approve)

I've long theorised the actual problem here is that in 99% of cases everything is fine, and so people develop a form of review fatigue and muscle memory for approving things without actually reviewing them critically.

csomar · 2025-02-28T05:31:34 1740720694

This is not a terraform problem. This is your problem. Theoretically, you should be able to recreate the resource back with only a downtime or some services affected. You should centralize/separate state and have stronger protections for it.