Infrastructure as Code with Terraform

Terraform has become the de facto standard for infrastructure as code. But managing Terraform at scale requires understanding state management, module design, workspace strategy, and remote backends.

1. Remote State Backends

Never use local state in production:

# S3 backend (AWS)
terraform {
  backend "s3" {
    bucket         = "my-company-terraform-state"
    key            = "prod/network/terraform.tfstate"
    region         = "us-east-1"
    dynamodb_table = "terraform-state-lock"
    encrypt        = true
  }
}

# GCS backend (GCP)
terraform {
  backend "gcs" {
    bucket = "my-company-terraform-state"
    prefix = "prod/network"
  }
}

Why remote state:

  • Team collaboration (shared state)
  • State locking (prevent concurrent modifications)
  • Backup and versioning
  • Encryption at rest

DynamoDB locking:

# Create the lock table
aws dynamodb create-table \
  --table-name terraform-state-lock \
  --attribute-definitions AttributeName=LockID,AttributeType=S \
  --key-schema AttributeName=LockID,KeyType=HASH \
  --billing-mode PAY_PER_REQUEST

2. State File Structure

Organize state files by environment and component:

terraform/
├── environments/
│   ├── prod/
│   │   ├── networking/
│   │   ├── database/
│   │   └── kubernetes/
│   ├── staging/
│   │   ├── networking/
│   │   └── database/
│   └── dev/
│       └── networking/
└── modules/
    ├── vpc/
    ├── ecs-service/
    └── rds/

Each leaf directory has its own state file and backend config. This isolates blast radius — if one component's state corrupts, others are unaffected.

3. Module Design

Good module interface:

# modules/vpc/main.tf
variable "vpc_cidr" {
  type        = string
  description = "CIDR block for the VPC"
}

variable "environment" {
  type        = string
  description = "Environment name (dev, staging, prod)"
  validation {
    condition     = contains(["dev", "staging", "prod"], var.environment)
    error_message = "Environment must be dev, staging, or prod."
  }
}

variable "enable_nat_gateway" {
  type        = bool
  default     = true
  description = "Whether to deploy a NAT Gateway"
}

variable "tags" {
  type        = map(string)
  default     = {}
  description = "Additional tags for all resources"
}

# Outputs
output "vpc_id" {
  value       = aws_vpc.main.id
  description = "The ID of the VPC"
}

output "private_subnet_ids" {
  value       = aws_subnet.private[*].id
  description = "IDs of the private subnets"
}

Module design principles:

  • One module = one responsibility
  • Use description on every variable and output
  • Validate inputs with validation blocks
  • Return meaningful outputs (IDs, ARNs, DNS names)
  • Don't hardcode environment-specific values
  • Version your modules (Git tags)

Calling the module:

module "vpc" {
  source      = "git::https://github.com/my-org/tf-modules.git//vpc?ref=v1.2.0"
  vpc_cidr    = "10.0.0.0/16"
  environment = "production"
  tags        = { CostCenter = "Platform", Owner = "Infra" }
}

resource "aws_security_group" "web" {
  vpc_id = module.vpc.vpc_id
  # ...
}

4. Workspaces vs Directories

Approach Pros Cons
Workspaces Single config, multiple states Easy to accidentally modify wrong workspace
Separate directories Clear isolation, different backends More code duplication

Recommendation: Use separate directories for different environments. Use workspaces only for short-lived feature branches or preview environments.

# Directory-based approach (recommended)
cd terraform/environments/prod/networking
terraform plan
terraform apply

# Workspace approach (for ephemeral environments)
terraform workspace new feature-x
terraform workspace select feature-x
terraform plan

5. CI/CD for Terraform

name: Terraform

on:
  pull_request:
    paths: ['terraform/**']
  push:
    branches: [main]
    paths: ['terraform/**']

jobs:
  plan:
    runs-on: ubuntu-latest
    defaults:
      run:
        working-directory: terraform/environments/prod
    steps:
      - uses: actions/checkout@v4
      - uses: hashicorp/setup-terraform@v3
        with: { terraform_version: '1.8.0' }

      - name: Terraform Init
        run: terraform init

      - name: Terraform Format Check
        run: terraform fmt -check -recursive

      - name: Terraform Validate
        run: terraform validate

      - name: Terraform Plan
        run: terraform plan

  apply:
    needs: plan
    if: github.ref == 'refs/heads/main'
    runs-on: ubuntu-latest
    defaults:
      run:
        working-directory: terraform/environments/prod
    steps:
      - uses: actions/checkout@v4
      - uses: hashicorp/setup-terraform@v3
      - run: terraform init
      - run: terraform apply -auto-approve

6. State Management Commands

# List resources in state
terraform state list

# Show a specific resource
terraform state show aws_instance.web

# Move a resource (rename or move between modules)
terraform state mv aws_instance.web module.web.aws_instance.web

# Remove a resource from state (without destroying it)
terraform state rm aws_instance.legacy

# Replace a failed resource
terraform apply -replace="aws_instance.web"

# Import existing infrastructure
terraform import aws_s3_bucket.my_bucket my-existing-bucket

# View state versions (with S3 backend versioning)
aws s3api list-object-versions --bucket my-terraform-state --prefix prod/network/terraform.tfstate

7. Common Terraform Anti-Patterns

1. Hardcoding secrets in variables files:

# BAD
variable "db_password" {
  default = "supersecret123"
}

# GOOD — use a secrets manager
data "aws_secretsmanager_secret_version" "db_password" {
  secret_id = "terraform/db-password"
}

2. Monolithic state file:

One state file for everything means a full-plan on every change. Break into components (network, database, app, monitoring).

3. Not pinning provider versions:

terraform {
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.0"  # Pin major version
    }
  }
}

4. Using count and for_each interchangeably without understanding the difference:

# count: creates a list of resources indexed by number
resource "aws_iam_user" "this" {
  count = length(var.users)
  name  = var.users[count.index]
}
# Removing a user from the MIDDLE of var.users will shift all indices!

# for_each: creates a map of resources keyed by each value
resource "aws_iam_user" "this" {
  for_each = toset(var.users)
  name     = each.key
}
# Removing a user only affects that one resource

8. Best Practices Checklist

  • Remote state with locking
  • State per environment + component
  • Module versioning (Git tags)
  • Input validation blocks
  • Sensitive variables marked as sensitive = true
  • Lifecycle rules (prevent_destroy for critical resources)
  • CI/CD with plan/apply pipeline
  • Regular terraform validate and fmt in CI
  • State backup and versioning
  • No hardcoded secrets in config

Summary

Terraform at scale is about state management. A well-organized state structure with remote backends, module composition, and clear environment separation prevents the infrastructure drift and configuration chaos that plagues teams managing cloud resources manually.

Start with one environment (dev), one component (networking), a remote S3 backend, and a few modules. Then expand as your infrastructure grows.