TutorialSeptember 15, 2025•8 min read

Terraform Best Practices: Lessons from Someone Who Has Absolutely Never Deleted Production

Three years of Terraform lessons — anchored by a production incident that may or may not have happened, featuring an RDS database that may or may not have been destroyed. Hypothetically.

#Terraform#IaC#Best Practices#Cloud

Terraform Best Practices: Lessons from Someone Who Has Absolutely Never Deleted Production

Infrastructure as code — building systems that build themselves

Let me tell you about the worst afternoon of my professional life.

Or rather — let me tell you about a hypothetical afternoon. One that may or may not have involved a production RDS instance, a wrong Terraform workspace, and the slow, dawning horror of watching a destroy plan execute while my soul quietly left my body.

Did it happen to me? I'm not saying it did. I'm also not saying it didn't. What I will say is: I have been extraordinarily careful with terraform apply for the past three years, and there is a very specific reason for that. Whether that reason is lived trauma or an abundance of professional caution is something we may never resolve.

What I can tell you is this: somewhere in the multiverse, a version of me typed "yes" into a terminal, watched Terraform destroy a production RDS instance with 18 months of user data, and spent the next 4 hours recovering from a snapshot that was 6 hours old. Maybe that person was me. Maybe I was simply too chicken to ever let it get that far.

Either way — in case this ever happens to you, hypothetically — here is everything I know about never letting it happen again.

Directory Structure: Monorepo with Clear Separation

The first thing that enables everything else is a clean directory structure. After trying several approaches, this is what I settled on:

bash

infrastructure/
├── modules/
│   ├── vpc/
│   │   ├── main.tf
│   │   ├── variables.tf
│   │   ├── outputs.tf
│   │   └── README.md
│   ├── eks/
│   │   ├── main.tf
│   │   ├── variables.tf
│   │   └── outputs.tf
│   └── rds/
│       ├── main.tf
│       ├── variables.tf
│       └── outputs.tf
└── environments/
    ├── staging/
    │   ├── main.tf
    │   ├── variables.tf
    │   ├── terraform.tfvars
    │   └── backend.tf
    └── production/
        ├── main.tf
        ├── variables.tf
        ├── terraform.tfvars
        └── backend.tf

Modules live in modules/ and contain no environment-specific configuration. Each environment directory under environments/ calls those modules with environment-specific values. This separation means that to deploy something to staging, you work in environments/staging/. To deploy to production, you work in environments/production/. There is no workspace switching. The directory you are in is the environment you are affecting. This alone would have prevented my disaster.

Remote State: S3 Backend with DynamoDB Locking

Local state files are a team antipattern. The moment two people try to run terraform apply at the same time against local state, you have a corrupted state file and a bad time.

Store your state remotely. In AWS, the standard setup is an S3 bucket for state storage and a DynamoDB table for state locking.

hcl

terraform {
  backend "s3" {
    bucket         = "mycompany-terraform-state"
    key            = "environments/production/terraform.tfstate"
    region         = "us-east-1"
    encrypt        = true
    dynamodb_table = "terraform-state-lock"
    kms_key_id     = "arn:aws:kms:us-east-1:123456789012:key/your-kms-key-id"
  }

  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.0"
    }
  }

  required_version = ">= 1.6.0"
}

The DynamoDB table needs a single attribute: LockID (String), set as the partition key. When Terraform runs, it acquires a lock in this table. If another process tries to run at the same time, it sees the lock and waits or exits. This prevents concurrent apply operations from corrupting state.

State management rules I enforce strictly:

Enable versioning on the S3 bucket so you can restore a previous state file if something goes wrong

Enable server-side encryption on the bucket

Never edit the state file manually (use terraform state mv, terraform state rm for surgery)

Set up automated S3 replication to a second bucket in a different region for disaster recovery

Module Design: One Module Per Logical Unit

A good module is like a good function: it does one thing, has clearly defined inputs and outputs, and hides its internal complexity.

One module per logical infrastructure unit. The VPC module handles the VPC, subnets, route tables, NAT gateways, and internet gateway. The EKS module handles the cluster, node groups, and IAM roles. The RDS module handles the database, subnet groups, and parameter groups. Nothing leaks across module boundaries.

hcl

# modules/rds/variables.tf
variable "identifier" {
  description = "Unique identifier for the RDS instance"
  type        = string
}

variable "engine_version" {
  description = "PostgreSQL engine version"
  type        = string
  default     = "15.4"
}

variable "instance_class" {
  description = "RDS instance type"
  type        = string
}

variable "allocated_storage" {
  description = "Initial storage in GB"
  type        = number
  default     = 100
}

variable "multi_az" {
  description = "Enable Multi-AZ for high availability"
  type        = bool
  default     = true
}

variable "deletion_protection" {
  description = "Prevent accidental deletion"
  type        = bool
  default     = true
}

Note that deletion_protection defaults to true. Someone added that after an incident. Very professionally. Any module that can destroy data should have deletion protection on by default, with the caller explicitly setting it to false only in non-production environments.

Module versioning: if you publish modules to a private Terraform registry or reference them via Git tags, pin to a specific version. Do not use HEAD or latest. Breaking changes in a module should require a deliberate version bump in the caller.

Variable Hierarchy: Never Hardcode Anything

Every value that differs between environments belongs in a variable. Every value that differs between runs belongs in a variable. Nothing is hardcoded in main.tf.

hcl

# environments/production/terraform.tfvars
aws_region            = "us-east-1"
environment           = "production"
eks_cluster_version   = "1.29"
eks_node_instance_types = ["m6i.xlarge", "m6i.2xlarge"]
rds_instance_class    = "db.r6g.xlarge"
rds_multi_az          = true
rds_deletion_protection = true

hcl

# environments/staging/terraform.tfvars
aws_region            = "us-east-1"
environment           = "staging"
eks_cluster_version   = "1.29"
eks_node_instance_types = ["t3.large"]
rds_instance_class    = "db.t3.medium"
rds_multi_az          = false
rds_deletion_protection = false

For secrets (database passwords, API keys), do not put them in tfvars files. Reference them from AWS Secrets Manager or SSM Parameter Store using data sources:

hcl

data "aws_ssm_parameter" "db_password" {
  name            = "/production/rds/master-password"
  with_decryption = true
}

Workspace Strategy vs Directory Separation

Terraform workspaces allow a single configuration to manage multiple state files. You switch between them with terraform workspace select production. This is what the hypothetical version of me was incorrectly using when the hypothetical disaster hypothetically occurred.

My current recommendation: use directory separation for major environment boundaries (production, staging, dev), not workspaces. The cognitive overhead of remembering which workspace you are in is too high, and the consequences of being in the wrong workspace are severe.

Use workspaces for minor variations within an environment — for example, if you need to spin up a temporary clone of staging for a load test. In that case, workspaces with a clearly named prefix (loadtest-week23) make sense. But production vs staging should be different directories.

If you do use workspaces, add this to every main.tf that manages production resources:

hcl

locals {
  is_production = terraform.workspace == "production"
}

resource "aws_db_instance" "main" {
  # ...
  deletion_protection = local.is_production
  multi_az            = local.is_production
}

At minimum, make the consequences of being in the wrong workspace visible in the plan output.

Plan Before Every Apply: Make It Mandatory in CI

terraform plan should run before every terraform apply. Not sometimes. Always. And the plan output should be reviewed — not just "yes it passed," but actually read.

In CI, this means: run terraform plan on every pull request to infrastructure changes, post the plan output as a PR comment, and only allow terraform apply after the PR is merged to main.

Here is the GitHub Actions workflow I use:

yaml

name: Terraform

on:
  pull_request:
    paths:
      - 'infrastructure/**'
  push:
    branches:
      - main
    paths:
      - 'infrastructure/**'

jobs:
  terraform:
    name: Terraform Plan / Apply
    runs-on: ubuntu-latest
    permissions:
      id-token: write
      contents: read
      pull-requests: write

    steps:
      - uses: actions/checkout@v4

      - name: Configure AWS Credentials
        uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: arn:aws:iam::123456789012:role/TerraformCIRole
          aws-region: us-east-1

      - name: Setup Terraform
        uses: hashicorp/setup-terraform@v3
        with:
          terraform_version: 1.6.6

      - name: Terraform Init
        run: terraform init
        working-directory: infrastructure/environments/production

      - name: Terraform Validate
        run: terraform validate
        working-directory: infrastructure/environments/production

      - name: Terraform Plan
        id: plan
        run: terraform plan -out=tfplan -no-color
        working-directory: infrastructure/environments/production

      - name: Post Plan to PR
        if: github.event_name == 'pull_request'
        uses: actions/github-script@v7
        with:
          script: |
            const output = `#### Terraform Plan
            \`\`\`
            ${{ steps.plan.outputs.stdout }}
            \`\`\``;
            github.rest.issues.createComment({
              issue_number: context.issue.number,
              owner: context.repo.owner,
              repo: context.repo.repo,
              body: output
            })

      - name: Terraform Apply
        if: github.event_name == 'push' && github.ref == 'refs/heads/main'
        run: terraform apply tfplan
        working-directory: infrastructure/environments/production

The IAM role used by CI should have only the permissions needed to manage your specific resources. Not AdministratorAccess. Use OIDC-based federation so there are no long-lived access keys to rotate or leak.

Drift Detection

Real infrastructure drifts. Someone makes a manual change in the console "just this once." A vendor updates a managed resource. An automated process modifies a tag. Drift is normal. The problem is when drift accumulates silently.

Run terraform plan -detailed-exitcode on a schedule (nightly or weekly) and alert when exit code is 2 (changes detected). This is your drift detector.

bash

terraform plan -detailed-exitcode
# Exit code 0: no changes
# Exit code 1: error
# Exit code 2: changes detected

Wire this into your alerting. A drift detected alert at 2am is much better than discovering that your "infrastructure as code" no longer matches reality when you try to recreate an environment.

Common Mistakes I See Constantly

count vs for_each: Use for_each for maps and sets of strings. count creates indexed resources (aws_iam_user.users[0], aws_iam_user.users[1]). If you insert an item at the beginning of the list, all indexes shift and Terraform wants to recreate everything. for_each creates resources keyed by map key or set value, so insertions do not cause unnecessary recreation.

hcl

# Wrong for dynamic sets — index shifting causes chaos
resource "aws_iam_user" "users" {
  count = length(var.user_names)
  name  = var.user_names[count.index]
}

# Right — keyed by name, stable through insertions
resource "aws_iam_user" "users" {
  for_each = toset(var.user_names)
  name     = each.value
}

Not using data sources: If you need to reference something that exists outside your current configuration — an AMI ID, a Route53 zone, another team's VPC — use a data source. Do not hardcode the ID. Hardcoded IDs break when you change regions, run in a new account, or when the underlying resource is recreated.

Circular dependencies: Terraform builds a dependency graph from your resource references. If resource A references resource B and resource B references resource A, you get a circular dependency error. The fix is usually to extract the shared dependency into a separate resource or use depends_on explicitly to break the cycle.

The Import and Replace Commands

When you take over an existing AWS account or need to bring manually-created resources under Terraform management, use terraform import. It pulls the existing resource into your state file so Terraform knows about it.

bash

# Import an existing S3 bucket into Terraform state
terraform import aws_s3_bucket.uploads my-existing-bucket-name

# Import an existing RDS instance
terraform import aws_db_instance.main mydb-production

When you need to force-replace a resource (for example, an EC2 instance that is misbehaving and needs to be freshly provisioned), use terraform apply -replace instead of the old terraform taint:

bash

terraform apply -replace="aws_instance.web_server"

This marks a specific resource for replacement in the next plan/apply cycle without destroying and recreating everything else.

What I Would Do Differently. Hypothetically.

After the RDS incident that definitely did or did not happen, I adopted several non-negotiable rules — the kind of rules a person arrives at either through hard experience or through being extremely, almost suspiciously, cautious from day one:

Every production database resource has deletion_protection = true

No one runs terraform apply locally against production. Ever. CI/CD only.

Every apply requires a reviewed plan. The plan is saved to a file and the apply references that file — so what you approved is exactly what gets applied. No surprises.

State backups are automated and tested monthly (actually restore from them periodically, or they don't count).

Every production apply sends a Slack notification with the plan summary. The team should know when infrastructure changes. Always.

Whether these rules came from 4 hours of downtime and a very uncomfortable conversation with a CTO, or from pure foresight and good instincts — I'll leave that to your imagination.

Measure twice. Apply once. And for the love of everything, check your workspace before you type "yes."

Terraform Best Practices: Lessons from Someone Who Has Absolutely Never Deleted Production

Terraform Best Practices: Lessons from Someone Who Has Absolutely Never Deleted Production

Directory Structure: Monorepo with Clear Separation

Remote State: S3 Backend with DynamoDB Locking

Module Design: One Module Per Logical Unit

Variable Hierarchy: Never Hardcode Anything

Workspace Strategy vs Directory Separation

Plan Before Every Apply: Make It Mandatory in CI

Drift Detection

Common Mistakes I See Constantly

The Import and Replace Commands

What I Would Do Differently. Hypothetically.

More Stories

200 Databases, One Pipeline: A Kafka & Debezium War Story

2025 in Retrospect