Terraform has become the de facto standard for infrastructure as code. But managing Terraform at scale requires understanding state management, module design, workspace strategy, and remote backends.
1. Remote State Backends
Never use local state in production:
# S3 backend (AWS)
terraform {
backend "s3" {
bucket = "my-company-terraform-state"
key = "prod/network/terraform.tfstate"
region = "us-east-1"
dynamodb_table = "terraform-state-lock"
encrypt = true
}
}
# GCS backend (GCP)
terraform {
backend "gcs" {
bucket = "my-company-terraform-state"
prefix = "prod/network"
}
}
Why remote state:
- Team collaboration (shared state)
- State locking (prevent concurrent modifications)
- Backup and versioning
- Encryption at rest
DynamoDB locking:
# Create the lock table
aws dynamodb create-table \
--table-name terraform-state-lock \
--attribute-definitions AttributeName=LockID,AttributeType=S \
--key-schema AttributeName=LockID,KeyType=HASH \
--billing-mode PAY_PER_REQUEST
2. State File Structure
Organize state files by environment and component:
terraform/
├── environments/
│ ├── prod/
│ │ ├── networking/
│ │ ├── database/
│ │ └── kubernetes/
│ ├── staging/
│ │ ├── networking/
│ │ └── database/
│ └── dev/
│ └── networking/
└── modules/
├── vpc/
├── ecs-service/
└── rds/
Each leaf directory has its own state file and backend config. This isolates blast radius — if one component's state corrupts, others are unaffected.
3. Module Design
Good module interface:
# modules/vpc/main.tf
variable "vpc_cidr" {
type = string
description = "CIDR block for the VPC"
}
variable "environment" {
type = string
description = "Environment name (dev, staging, prod)"
validation {
condition = contains(["dev", "staging", "prod"], var.environment)
error_message = "Environment must be dev, staging, or prod."
}
}
variable "enable_nat_gateway" {
type = bool
default = true
description = "Whether to deploy a NAT Gateway"
}
variable "tags" {
type = map(string)
default = {}
description = "Additional tags for all resources"
}
# Outputs
output "vpc_id" {
value = aws_vpc.main.id
description = "The ID of the VPC"
}
output "private_subnet_ids" {
value = aws_subnet.private[*].id
description = "IDs of the private subnets"
}
Module design principles:
- One module = one responsibility
- Use
descriptionon every variable and output - Validate inputs with
validationblocks - Return meaningful outputs (IDs, ARNs, DNS names)
- Don't hardcode environment-specific values
- Version your modules (Git tags)
Calling the module:
module "vpc" {
source = "git::https://github.com/my-org/tf-modules.git//vpc?ref=v1.2.0"
vpc_cidr = "10.0.0.0/16"
environment = "production"
tags = { CostCenter = "Platform", Owner = "Infra" }
}
resource "aws_security_group" "web" {
vpc_id = module.vpc.vpc_id
# ...
}
4. Workspaces vs Directories
| Approach | Pros | Cons |
|---|---|---|
| Workspaces | Single config, multiple states | Easy to accidentally modify wrong workspace |
| Separate directories | Clear isolation, different backends | More code duplication |
Recommendation: Use separate directories for different environments. Use workspaces only for short-lived feature branches or preview environments.
# Directory-based approach (recommended)
cd terraform/environments/prod/networking
terraform plan
terraform apply
# Workspace approach (for ephemeral environments)
terraform workspace new feature-x
terraform workspace select feature-x
terraform plan
5. CI/CD for Terraform
name: Terraform
on:
pull_request:
paths: ['terraform/**']
push:
branches: [main]
paths: ['terraform/**']
jobs:
plan:
runs-on: ubuntu-latest
defaults:
run:
working-directory: terraform/environments/prod
steps:
- uses: actions/checkout@v4
- uses: hashicorp/setup-terraform@v3
with: { terraform_version: '1.8.0' }
- name: Terraform Init
run: terraform init
- name: Terraform Format Check
run: terraform fmt -check -recursive
- name: Terraform Validate
run: terraform validate
- name: Terraform Plan
run: terraform plan
apply:
needs: plan
if: github.ref == 'refs/heads/main'
runs-on: ubuntu-latest
defaults:
run:
working-directory: terraform/environments/prod
steps:
- uses: actions/checkout@v4
- uses: hashicorp/setup-terraform@v3
- run: terraform init
- run: terraform apply -auto-approve
6. State Management Commands
# List resources in state
terraform state list
# Show a specific resource
terraform state show aws_instance.web
# Move a resource (rename or move between modules)
terraform state mv aws_instance.web module.web.aws_instance.web
# Remove a resource from state (without destroying it)
terraform state rm aws_instance.legacy
# Replace a failed resource
terraform apply -replace="aws_instance.web"
# Import existing infrastructure
terraform import aws_s3_bucket.my_bucket my-existing-bucket
# View state versions (with S3 backend versioning)
aws s3api list-object-versions --bucket my-terraform-state --prefix prod/network/terraform.tfstate
7. Common Terraform Anti-Patterns
1. Hardcoding secrets in variables files:
# BAD
variable "db_password" {
default = "supersecret123"
}
# GOOD — use a secrets manager
data "aws_secretsmanager_secret_version" "db_password" {
secret_id = "terraform/db-password"
}
2. Monolithic state file:
One state file for everything means a full-plan on every change. Break into components (network, database, app, monitoring).
3. Not pinning provider versions:
terraform {
required_providers {
aws = {
source = "hashicorp/aws"
version = "~> 5.0" # Pin major version
}
}
}
4. Using count and for_each interchangeably without understanding the difference:
# count: creates a list of resources indexed by number
resource "aws_iam_user" "this" {
count = length(var.users)
name = var.users[count.index]
}
# Removing a user from the MIDDLE of var.users will shift all indices!
# for_each: creates a map of resources keyed by each value
resource "aws_iam_user" "this" {
for_each = toset(var.users)
name = each.key
}
# Removing a user only affects that one resource
8. Best Practices Checklist
- Remote state with locking
- State per environment + component
- Module versioning (Git tags)
- Input validation blocks
- Sensitive variables marked as
sensitive = true - Lifecycle rules (
prevent_destroyfor critical resources) - CI/CD with plan/apply pipeline
- Regular
terraform validateandfmtin CI - State backup and versioning
- No hardcoded secrets in config
Summary
Terraform at scale is about state management. A well-organized state structure with remote backends, module composition, and clear environment separation prevents the infrastructure drift and configuration chaos that plagues teams managing cloud resources manually.
Start with one environment (dev), one component (networking), a remote S3 backend, and a few modules. Then expand as your infrastructure grows.