Why Does Destroying Resources via TF Suck?(newsletter.masterpoint.io)

27 pointsby mooreds14 days ago9 comments

MPSimmons14 days ago
Cloud providers in general haven't gone very far toward providing hooks for validation.
It seems easier for the cloud provider to implement the equivalent of a dry-run flag in API calls that validate that the call would succeed (even if it's best effort determination) which could be used by tools like Terraform during the planning and dependency tree generation.
Instead, you have platform providers like AzureRM that squint at the supplied objects and make a guess as to whether that looks valid, which causes a ton of failures upon actual application. For instance, if you try to create storage with a redundancy level not supported by the region you're adding it to, Terraform will pass a plan stage, but the actual application of the resource will fail because the region doesn't support that level of redundancy.
There are unlimited other examples in a similar vein, all of which could be resolved if API providers had a dryrun flag.
akersten14 days ago
The most confusing part of terraform for me is that terraform's view of the infrastructure is a singleton config file that is often stored in that very infrastructure. And then you have to share that somehow with your team and be very careful that no one gets it out of sync.
Why don't cloud providers have a nice way for tools like TF to query the current state of the infra? Maybe they do and I'm doing IaC wrong?
- cobolexpert14 days ago
  At $WORK we have a Git repo set up by the devops team, where we can manage our junk by creating Terraform resources in our main AWS account.
  The state however is always stored in a _separate AWS account_ that only the devops team can manage. I find this to be a reasonable way of working with TF. I agree that it is confusing though, because one is using $PROVIDER to both create things and manage those things at the same time, but conceptually from TF’s perspective they are very different things.
- colechristensen14 days ago
  There are three things:
  * Your terraform code
  * The state terraform holds which is what it thinks your infrastructure state is
  * The actual state of your infrastructure
  >Why don't cloud providers have a nice way for tools like TF to query the current state of the infra?
  What a terraform provider is is code that queries the targeted resources through whatever APIs they provide. I guess you could argue these APIs could be better, faster, or more tuned towards infrastructure management... but gathering state from whatever resources it manages is one of the core things terraform does. I'm not sure what you're asking for.
  - akersten12 days ago
    I want to get rid of this:
    > * The state terraform holds which is what it thinks your infrastructure state is
    Why does Terraform need that. Why can't it just call `iac.amazonaws.com/query` (or other magical endpoint) and then diff the terraform code against the actual infrastructure? I am willing to understand if the answer is "well 8 different teams work on AWS so we can't get them all to agree on how to dump their infra as JSON," but this feels like a huge (and obvious) developer experience improvement that could be made.
    colechristensen11 days ago
    * mapping resources to reality: your instance "bob-7" is actually "i-12bc50812ab2", there are lots of circumstances where what you declare has to be matched to something that exists because the model of the resource just doesn't match what you can declare
    * drift: terraform can tell you when something you created but didn't specify has changed (there are often a very large number of parameters you don't necessarily want to specify each one)
    * speed: there are a lot of things that could be looked up but doing so is slow and APIs have rate limits. wouldn't it be great if all the providers had fast apis that allowed us all to do all these things? sure. but we don't have that sometimes for technical reasons sometimes just because teams don't bother sometimes because the architecture of solutions would have to be very fundamentally changed in order to make them fast
  - fragmede14 days ago
    for the plan file to be updated to the state of the world in a non-conusing way so that apply does the right thing without a chance it's gonna blow things up.
    colechristensen14 days ago
    This is really up to the writer of the provider (very often the service itself) to have the provider code correctly model how the service works. It very often doesn't and allows you to plan error-free what will fail during apply.
    It's not an API issue but a terraform provider issue having missing or incomplete code (i.e. https://github.com/hashicorp/terraform-provider-aws )
- raffraffraff14 days ago
  There is the code, the recorded state of the infra when you applied the code and the actual state at some point in the future (which may have drifted) . You store the code in git, the recorded state (which contains unique IDs, ARNs etc) in a bucket and you read the "actual state" next time you run a plan, and you detect drift.
  These days people store the state in terraform cloud or spaceliftor env0 or whatever. Doesn't have to be the same infra you deployed.
  If you were a lunatic you could not use a state backend and just let it create state files in the terraform code directory, check the file into git with all those secrets and unique ids etc.
- cyberax14 days ago
  > Why don't cloud providers have a nice way for tools like TF to query the current state of the infra?
  They do! In fact, this is my greatest pet peeve with TF, it adds state when it's not needed.
  I was doing infra-as-code without TF with AWS long time ago. It went like this:
  env_tag = "${project_name}-${env_name}" aws_instances = conn.describe_instances(filter_by_tag={"env_tag": env_tag}) if len(aws_instances) != 1: conn.launch_aws_instances(tags={"env_tag": env_tag})
  AWS has tag-on-create now, making this sort of code reliable. Before that, you could do the same with instance idempotency tokens. GCP also has tags.
- pjjpo13 days ago
  One big reason I tend to build on GCP instead of AWS is it's much easier to use with Terraform. GCP's APIs are generally defined as a semantic unit while AWS has ad-hoc resources that get strung together by the console or CLIs, not the APIs. An example is a k8s cluster in AWS takes a dozen resources while in GCP it's just one.
  While there are then third party (I think) Terraform modules to try to abstract the AWS world into an easier to use interface, they can't really solve the problem that in the end Terraform manages resources and orchestrating changes including deletion across a dozen of resources is much harder than a single one.
  GCP is huge so I wouldn't be surprised if there are also problematic units there with less good definition. But I would still argue that there are cloud providers that provide a reasonable view into their infra fo IAC.
- don-code14 days ago
  > Why don't cloud providers have a nice way for tools like TF to query the current state of the infra? Maybe they do and I'm doing IaC wrong?
  This is technically how Ansible works. Here's an extensive list of modules that deploy resources in various public clouds: https://docs.ansible.com/projects/ansible/2.9/modules/list_o...
  That said, it looks like Ansible has deprecated those modules, and that seems fair - I haven't actually heard of anyone deploying infrastructure in a public cloud with Ansible in years. It found its niche is image generation and systems management. Almost all modern tools like Terraform, Pulumi, and even CloudFormation (albeit under the hood) keep a state file.
  - knowhy14 days ago
    I think there are active maintained modules https://docs.ansible.com/projects/ansible/latest/collections...
    At work we use Ansible to setup Route53 records for infrastructure hosted elsewhere. Not sure if that counts as infrastructure.
- mooreds14 days ago
  > The most confusing part of terraform for me is that terraform's view of the infrastructure is a singleton config file that is often stored in that very infrastructure.
  These folks also have an article about that: https://newsletter.masterpoint.io/p/how-to-bootstrap-your-st...
  - bigstrat200314 days ago
    That article is way overkill. One should just manually create the backend storage (S3 bucket or whatever you use). No reason to faff about with the steps in the article.
    GowGuy4713 days ago
    The reason to not create the bucket are because you want to ensure that you don’t have any click ops resources that you can’t track. If you manually create anything, that means it’s not in code and therefore the rest of the team doesn’t know where it lives, who created it, or when.
    catlifeonmars14 days ago
    This is excellent advice.
    When you have a hammer… as the expression goes. It’s crazy how many times that even knowing this, I have to catch myself and step back. IaC is a contextually different way of thinking and it’s easy to get lost.
- 14 days ago
  undefined
willi5954987914 days ago
I am not a fan of abreviations, this article didn't even have terraform written out once.
- parpfish14 days ago
  I assumed it was going to be about tensorflow
- GowGuy4713 days ago
  Sorry, with Terraform and OpenTofu both using “TF”, I default to that so that articles and my writing pertain to both.
jdalsgaard14 days ago
Most tools, frameworks and articles in IT, SaaS in particular, are about spinning up things. It is what people find exciting.
Work a few years in Ops and you learn that spinning up things is not a big part of your work. It's maintenance, such as deleting stuff.
Unfortunately this process is the hardest, and there's very little to help you do it right. Many tools, framework and vendors don't even have proper support for it.
Some even recommend 'rinse and repeat' instead of adjusting what you have - and this method is not great if you value uptime, nor if you have state that you want to preserve, such as customer data :-)
Deleting stuff, shutting services down, turning off servers - those are hard tasks in IT.
- jiggawatts14 days ago
  My acid test for provisioning automation products is asking: Can it rename deployed resources?
  Practically none can, even in market segments where this is highly relevant. For example: user identity and access management products. Women get married and change their name all the time!
  The next level up is the ability to rename a container such as an organisational unit or a security group.
  Then, products that can rearrange a hierarchy to accommodate a merger, split, or a new layer of management. This obviously needs to preserve the data. “Immutable infrastructure” where everything is recreated from scratch and the original is dropped is cheating.
  I’ve only ever seen one such provisioning tool, the rest don’t even begin to approach this level of capability.
sshine14 days ago
I love how terraform can describe what I’ve got. Sort of. Assuming I or my colleagues or my noob customers don’t modify resources on the same account.
I don’t love how unreliable providers are, even for creating resources. Clouds like DigitalOcean will 429 throttle me for making too many plans in a row with only 100+ resources. Sometimes the plan goes through, but the apply fails. Sometimes halfway through.
I’d rather use a cloud-specific API, unless I’m certain of the quality of the specific terraform provider.
14 days ago
undefined
based214 days ago
Because TF is lacking sequentials state descriptions in rare cases - ex: Termination protections in AWS.
dpkirchner14 days ago
Hell, let's talk about why ^c'ing the plan phase sucks.
otterley14 days ago
"Because referential integrity is a thing, and if you don't have all dependencies either explicitly declared or implicitly determinable in your plan, your cloud provider is going to enforce it for you."