Terraform Best Practices: Lessons from Someone Who Has Absolutely Never Deleted Production
Three years of Terraform lessons — anchored by a production incident that may or may not have happened, featuring an RDS database that may or may not have been destroyed. Hypothetically.
Terraform Best Practices: Lessons from Someone Who Has Absolutely Never Deleted Production
Infrastructure as code — building systems that build themselves
Let me tell you about the worst afternoon of my professional life.
Or rather — let me tell you about a hypothetical afternoon. One that may or may not have involved a production RDS instance, a wrong Terraform workspace, and the slow, dawning horror of watching a destroy plan execute while my soul quietly left my body.
Did it happen to me? I'm not saying it did. I'm also not saying it didn't. What I will say is: I have been extraordinarily careful with terraform apply for the past three years, and there is a very specific reason for that. Whether that reason is lived trauma or an abundance of professional caution is something we may never resolve.
What I can tell you is this: somewhere in the multiverse, a version of me typed "yes" into a terminal, watched Terraform destroy a production RDS instance with 18 months of user data, and spent the next 4 hours recovering from a snapshot that was 6 hours old. Maybe that person was me. Maybe I was simply too chicken to ever let it get that far.
Either way — in case this ever happens to you, hypothetically — here is everything I know about never letting it happen again.
Directory Structure: Monorepo with Clear Separation
The first thing that enables everything else is a clean directory structure. After trying several approaches, this is what I settled on:
infrastructure/
├── modules/
│ ├── vpc/
│ │ ├── main.tf
│ │ ├── variables.tf
│ │ ├── outputs.tf
│ │ └── README.md
│ ├── eks/
│ │ ├── main.tf
│ │ ├── variables.tf
│ │ └── outputs.tf
│ └── rds/
│ ├── main.tf
│ ├── variables.tf
│ └── outputs.tf
└── environments/
├── staging/
│ ├── main.tf
│ ├── variables.tf
│ ├── terraform.tfvars
│ └── backend.tf
└── production/
├── main.tf
├── variables.tf
├── terraform.tfvars
└── backend.tfModules live in modules/ and contain no environment-specific configuration. Each environment directory under environments/ calls those modules with environment-specific values. This separation means that to deploy something to staging, you work in environments/staging/. To deploy to production, you work in environments/production/. There is no workspace switching. The directory you are in is the environment you are affecting. This alone would have prevented my disaster.
Remote State: S3 Backend with DynamoDB Locking
Local state files are a team antipattern. The moment two people try to run terraform apply at the same time against local state, you have a corrupted state file and a bad time.
Store your state remotely. In AWS, the standard setup is an S3 bucket for state storage and a DynamoDB table for state locking.
terraform {
backend "s3" {
bucket = "mycompany-terraform-state"
key = "environments/production/terraform.tfstate"
region = "us-east-1"
encrypt = true
dynamodb_table = "terraform-state-lock"
kms_key_id = "arn:aws:kms:us-east-1:123456789012:key/your-kms-key-id"
}
required_providers {
aws = {
source = "hashicorp/aws"
version = "~> 5.0"
}
}
required_version = ">= 1.6.0"
}The DynamoDB table needs a single attribute: LockID (String), set as the partition key. When Terraform runs, it acquires a lock in this table. If another process tries to run at the same time, it sees the lock and waits or exits. This prevents concurrent apply operations from corrupting state.
State management rules I enforce strictly:
Module Design: One Module Per Logical Unit
A good module is like a good function: it does one thing, has clearly defined inputs and outputs, and hides its internal complexity.
One module per logical infrastructure unit. The VPC module handles the VPC, subnets, route tables, NAT gateways, and internet gateway. The EKS module handles the cluster, node groups, and IAM roles. The RDS module handles the database, subnet groups, and parameter groups. Nothing leaks across module boundaries.
# modules/rds/variables.tf
variable "identifier" {
description = "Unique identifier for the RDS instance"
type = string
}
variable "engine_version" {
description = "PostgreSQL engine version"
type = string
default = "15.4"
}
variable "instance_class" {
description = "RDS instance type"
type = string
}
variable "allocated_storage" {
description = "Initial storage in GB"
type = number
default = 100
}
variable "multi_az" {
description = "Enable Multi-AZ for high availability"
type = bool
default = true
}
variable "deletion_protection" {
description = "Prevent accidental deletion"
type = bool
default = true
}Note that deletion_protection defaults to true. Someone added that after an incident. Very professionally. Any module that can destroy data should have deletion protection on by default, with the caller explicitly setting it to false only in non-production environments.
Module versioning: if you publish modules to a private Terraform registry or reference them via Git tags, pin to a specific version. Do not use HEAD or latest. Breaking changes in a module should require a deliberate version bump in the caller.
Variable Hierarchy: Never Hardcode Anything
Every value that differs between environments belongs in a variable. Every value that differs between runs belongs in a variable. Nothing is hardcoded in main.tf.
# environments/production/terraform.tfvars
aws_region = "us-east-1"
environment = "production"
eks_cluster_version = "1.29"
eks_node_instance_types = ["m6i.xlarge", "m6i.2xlarge"]
rds_instance_class = "db.r6g.xlarge"
rds_multi_az = true
rds_deletion_protection = true# environments/staging/terraform.tfvars
aws_region = "us-east-1"
environment = "staging"
eks_cluster_version = "1.29"
eks_node_instance_types = ["t3.large"]
rds_instance_class = "db.t3.medium"
rds_multi_az = false
rds_deletion_protection = falseFor secrets (database passwords, API keys), do not put them in tfvars files. Reference them from AWS Secrets Manager or SSM Parameter Store using data sources:
data "aws_ssm_parameter" "db_password" {
name = "/production/rds/master-password"
with_decryption = true
}Workspace Strategy vs Directory Separation
Terraform workspaces allow a single configuration to manage multiple state files. You switch between them with terraform workspace select production. This is what the hypothetical version of me was incorrectly using when the hypothetical disaster hypothetically occurred.
My current recommendation: use directory separation for major environment boundaries (production, staging, dev), not workspaces. The cognitive overhead of remembering which workspace you are in is too high, and the consequences of being in the wrong workspace are severe.
Use workspaces for minor variations within an environment — for example, if you need to spin up a temporary clone of staging for a load test. In that case, workspaces with a clearly named prefix (loadtest-week23) make sense. But production vs staging should be different directories.
If you do use workspaces, add this to every main.tf that manages production resources:
locals {
is_production = terraform.workspace == "production"
}
resource "aws_db_instance" "main" {
# ...
deletion_protection = local.is_production
multi_az = local.is_production
}At minimum, make the consequences of being in the wrong workspace visible in the plan output.
Plan Before Every Apply: Make It Mandatory in CI
terraform plan should run before every terraform apply. Not sometimes. Always. And the plan output should be reviewed — not just "yes it passed," but actually read.
In CI, this means: run terraform plan on every pull request to infrastructure changes, post the plan output as a PR comment, and only allow terraform apply after the PR is merged to main.
Here is the GitHub Actions workflow I use:
name: Terraform
on:
pull_request:
paths:
- 'infrastructure/**'
push:
branches:
- main
paths:
- 'infrastructure/**'
jobs:
terraform:
name: Terraform Plan / Apply
runs-on: ubuntu-latest
permissions:
id-token: write
contents: read
pull-requests: write
steps:
- uses: actions/checkout@v4
- name: Configure AWS Credentials
uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: arn:aws:iam::123456789012:role/TerraformCIRole
aws-region: us-east-1
- name: Setup Terraform
uses: hashicorp/setup-terraform@v3
with:
terraform_version: 1.6.6
- name: Terraform Init
run: terraform init
working-directory: infrastructure/environments/production
- name: Terraform Validate
run: terraform validate
working-directory: infrastructure/environments/production
- name: Terraform Plan
id: plan
run: terraform plan -out=tfplan -no-color
working-directory: infrastructure/environments/production
- name: Post Plan to PR
if: github.event_name == 'pull_request'
uses: actions/github-script@v7
with:
script: |
const output = `#### Terraform Plan
\`\`\`
${{ steps.plan.outputs.stdout }}
\`\`\``;
github.rest.issues.createComment({
issue_number: context.issue.number,
owner: context.repo.owner,
repo: context.repo.repo,
body: output
})
- name: Terraform Apply
if: github.event_name == 'push' && github.ref == 'refs/heads/main'
run: terraform apply tfplan
working-directory: infrastructure/environments/productionThe IAM role used by CI should have only the permissions needed to manage your specific resources. Not AdministratorAccess. Use OIDC-based federation so there are no long-lived access keys to rotate or leak.
Drift Detection
Real infrastructure drifts. Someone makes a manual change in the console "just this once." A vendor updates a managed resource. An automated process modifies a tag. Drift is normal. The problem is when drift accumulates silently.
Run terraform plan -detailed-exitcode on a schedule (nightly or weekly) and alert when exit code is 2 (changes detected). This is your drift detector.
terraform plan -detailed-exitcode
# Exit code 0: no changes
# Exit code 1: error
# Exit code 2: changes detectedWire this into your alerting. A drift detected alert at 2am is much better than discovering that your "infrastructure as code" no longer matches reality when you try to recreate an environment.
Common Mistakes I See Constantly
count vs for_each: Use for_each for maps and sets of strings. count creates indexed resources (aws_iam_user.users[0], aws_iam_user.users[1]). If you insert an item at the beginning of the list, all indexes shift and Terraform wants to recreate everything. for_each creates resources keyed by map key or set value, so insertions do not cause unnecessary recreation.
# Wrong for dynamic sets — index shifting causes chaos
resource "aws_iam_user" "users" {
count = length(var.user_names)
name = var.user_names[count.index]
}
# Right — keyed by name, stable through insertions
resource "aws_iam_user" "users" {
for_each = toset(var.user_names)
name = each.value
}Not using data sources: If you need to reference something that exists outside your current configuration — an AMI ID, a Route53 zone, another team's VPC — use a data source. Do not hardcode the ID. Hardcoded IDs break when you change regions, run in a new account, or when the underlying resource is recreated.
Circular dependencies: Terraform builds a dependency graph from your resource references. If resource A references resource B and resource B references resource A, you get a circular dependency error. The fix is usually to extract the shared dependency into a separate resource or use depends_on explicitly to break the cycle.
The Import and Replace Commands
When you take over an existing AWS account or need to bring manually-created resources under Terraform management, use terraform import. It pulls the existing resource into your state file so Terraform knows about it.
# Import an existing S3 bucket into Terraform state
terraform import aws_s3_bucket.uploads my-existing-bucket-name
# Import an existing RDS instance
terraform import aws_db_instance.main mydb-productionWhen you need to force-replace a resource (for example, an EC2 instance that is misbehaving and needs to be freshly provisioned), use terraform apply -replace instead of the old terraform taint:
terraform apply -replace="aws_instance.web_server"This marks a specific resource for replacement in the next plan/apply cycle without destroying and recreating everything else.
What I Would Do Differently. Hypothetically.
After the RDS incident that definitely did or did not happen, I adopted several non-negotiable rules — the kind of rules a person arrives at either through hard experience or through being extremely, almost suspiciously, cautious from day one:
Whether these rules came from 4 hours of downtime and a very uncomfortable conversation with a CTO, or from pure foresight and good instincts — I'll leave that to your imagination.
Measure twice. Apply once. And for the love of everything, check your workspace before you type "yes."
More Stories
200 Databases, One Pipeline: A Kafka & Debezium War Story
A VP said "Can't you just, like, combine them?" There were 200+ store MySQL databases. I smiled and said sure. Here's what actually happened.
9 min read →Story2025 in Retrospect
The last day of the year. Reflections on habits, discipline, growth, and the messy journey of becoming better. Here's to doing it, not just saying it.
4 min read →