Blog Article

Terraform

Apr 17, 2026

7 min read

33 views

Day 16: Creating Production-Grade Infrastructure with Terraform

Seven days of building the FastAPI stack. Today: measure it against the production-grade checklist, fill the gaps, and add the automation that makes it safe for a real team to operate.

Production-grade is not a vague aspiration — it is a checklist. Chapter 8 of Terraform: Up & Running defines it as a set of concrete requirements that infrastructure must meet before it can be trusted with real users, real data, and real incidents. Most teams skip half the list and discover the gaps at the worst possible time.

After Days 9–15, the FastAPI stack covers a lot of it. Let's audit what is done, what is missing, and how to close each gap.

The Production-Grade Checklist Audit

Requirement	Status	Notes
Automated, repeatable deployment	YES	`terraform apply` from any machine
Parameterized configuration	YES	`map(object)` env config, no hardcoding
Zero-downtime deployments	YES	Instance refresh (Day 12)
High availability	YES	Multi-AZ ASG, min_size ≥ 2 in prod
Auto-scaling	YES	ASG with max_size > min_size
Secrets management	YES	AWS Secrets Manager + IAM role (Day 13)
Encryption at rest	YES	S3 state bucket (AES256), RDS encrypted
Health checks	YES	ALB + ASG ELB health checks, `/health` endpoint
CloudWatch monitoring	YES	CPU alarm in prod (Day 10)
Centralized logging	YES	CloudWatch Log Group (Day 11)
Multi-region	YES	Provider aliases (Day 14)
Module versioning	YES	Semantic tags on GitHub (Day 9)
State locking	YES	DynamoDB + S3 backend (Day 5)
Alerting	PARTIALLY	Alarm exists but no SNS notification wired up
Network isolation	PARTIALLY	Using default VPC — not private subnets
Automated testing	NO	No tests exist for the module
CI/CD pipeline	NO	`terraform apply` is manual
Module documentation	NO	No READMEs or auto-generated docs
Cost controls	NO	No budget alerts or right-sizing analysis

The YES items are real. The three red ones are what distinguish "works on my machine" from production-grade. This post addresses all three.

A YES means the requirement is met and exercised regularly. PARTIALLY means the mechanism exists but is incomplete (an alarm with nowhere to send notifications is not alerting). NO means the gap is unaddressed today.

Gap 1: CI/CD Pipeline

Without this: two engineers running terraform apply from different laptops can overwrite each other's changes, and there is no audit trail of who applied what and when.

Running terraform apply manually is fine for learning. In a team, it is a coordination problem — two engineers can apply conflicting changes if they are both working from different local states, and there is no record of who applied what and when.

The standard pattern:

On pull request: run terraform plan and post the output as a PR comment. The reviewer sees exactly what will change before approving.
On merge to main: run terraform apply automatically.

GitHub Actions workflow

Create .github/workflows/terraform.yml in the infrastructure repo:

name: Terraform

on:
  push:
    branches: [main]
  pull_request:
    branches: [main]

env:
  TF_VERSION: "1.14.9"
  AWS_REGION: "us-east-1"

permissions:
  id-token: write      # required for OIDC auth to AWS
  contents: read
  pull-requests: write # required to post plan output as a PR comment

jobs:
  terraform:
    runs-on: ubuntu-latest

    steps:
      - uses: actions/checkout@v4

      - name: Configure AWS credentials (OIDC)
        uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: arn:aws:iam::${{ secrets.AWS_ACCOUNT_ID }}:role/GitHubActionsRole
          aws-region: ${{ env.AWS_REGION }}

      - name: Set up Terraform
        uses: hashicorp/setup-terraform@v3
        with:
          terraform_version: ${{ env.TF_VERSION }}

      - name: Terraform Init
        run: terraform init
        working-directory: prod/

      - name: Terraform Format Check
        run: terraform fmt -check -recursive
        working-directory: prod/

      - name: Terraform Validate
        run: terraform validate
        working-directory: prod/

      - name: Terraform Plan
        id: plan
        run: terraform plan -no-color -out=tfplan
        working-directory: prod/
        continue-on-error: true   # post the comment even if plan fails

      - name: Post plan to PR
        if: github.event_name == 'pull_request'
        uses: actions/github-script@v7
        with:
          script: |
            const output = `#### Terraform Plan
            \`\`\`
            ${{ steps.plan.outputs.stdout }}
            \`\`\`
            *Pushed by @${{ github.actor }}, Action: \`${{ github.event_name }}\`*`;
            github.rest.issues.createComment({
              issue_number: context.issue.number,
              owner: context.repo.owner,
              repo: context.repo.repo,
              body: output
            });

      - name: Terraform Apply
        if: github.ref == 'refs/heads/main' && github.event_name == 'push'
        run: terraform apply -auto-approve tfplan
        working-directory: prod/

What this workflow does, in English. On every PR, the workflow checks formatting, validates syntax, runs plan, and posts the plan as a comment so the reviewer sees the exact diff before approving. On merge to main, it runs apply automatically. No one ever runs terraform apply from a laptop again.

Before vs. after, in commands:

Before: cd prod && terraform apply — runs on your laptop, uses your credentials, leaves no record.
After: git push — CI runs plan, you review the diff in the PR, merge, CI runs apply, the run is logged in GitHub Actions forever.

The OIDC auth detail

OIDC, in plain English. A way for GitHub to prove its identity to AWS so AWS hands it a short-lived credential for that one workflow run — instead of GitHub holding a permanent access key that could leak.

The workflow uses OIDC (OpenID Connect) instead of long-lived AWS access keys stored as GitHub secrets. OIDC issues a short-lived token for each workflow run — no static credentials to leak or rotate.

To set it up, create an IAM role with a trust policy that allows the specific GitHub repo to assume it:

data "aws_iam_policy_document" "github_actions_trust" {
  statement {
    actions = ["sts:AssumeRoleWithWebIdentity"]

    principals {
      type        = "Federated"
      identifiers = [aws_iam_openid_connect_provider.github.arn]
    }

    condition {
      test     = "StringLike"
      variable = "token.actions.githubusercontent.com:sub"
      # Only the specific repo can assume this role — not any GitHub repo
      values   = ["repo:mohamednourdine/terraform-infra:*"]
    }
  }
}

resource "aws_iam_role" "github_actions" {
  name               = "GitHubActionsRole"
  assume_role_policy = data.aws_iam_policy_document.github_actions_trust.json
}

Gap 2: Automated Testing with Terratest

Without this: you discover that a module is broken when it fails in production, not when the change is proposed.

Terratest is a Go library that tests Terraform modules by actually deploying them, running assertions, and tearing them down. It is the standard for module testing in the Terraform ecosystem.

How to read the test below. It deploys the module to a real AWS account, waits for the ALB health check to pass, asserts a 200 response, then tears everything down — even if the test fails. The defer terraform.Destroy line is the safety net that prevents orphaned resources when an assertion blows up.

A test for the web-app module:

// test/web_app_test.go
package test

import (
    "fmt"
    "net/http"
    "testing"
    "time"

    "github.com/gruntwork-io/terratest/modules/terraform"
    "github.com/stretchr/testify/assert"
    "github.com/stretchr/testify/require"
)

func TestWebAppModule(t *testing.T) {
    t.Parallel()

    terraformOptions := &terraform.Options{
        // Path to the module to test
        TerraformDir: "../modules/web-app",

        Vars: map[string]interface{}{
            "environment":               "test",
            "instance_type":             "t2.micro",
            "min_size":                  1,
            "max_size":                  2,
            "server_port":               8000,
            "health_check_path":         "/health",
            "health_check_grace_period": 120,
            "user_data":                 testUserData(),
        },

        // Prevent colour codes from cluttering test output
        NoColor: true,
    }

    // Destroy everything at the end of the test — even if the test fails
    defer terraform.Destroy(t, terraformOptions)

    // Deploy the module
    terraform.InitAndApply(t, terraformOptions)

    // Get the ALB DNS name from outputs
    albDNS := terraform.Output(t, terraformOptions, "alb_dns_name")
    url := fmt.Sprintf("http://%s/health", albDNS)

    // The ALB needs time to finish health checks after apply
    // Retry the request every 10 seconds for up to 5 minutes
    maxRetries := 30
    sleepBetween := 10 * time.Second

    for i := 0; i < maxRetries; i++ {
        resp, err := http.Get(url)
        if err == nil && resp.StatusCode == 200 {
            // Health check passed — run assertions
            assert.Equal(t, 200, resp.StatusCode)
            return
        }
        t.Logf("Attempt %d/%d: %v — retrying in %s", i+1, maxRetries, err, sleepBetween)
        time.Sleep(sleepBetween)
    }

    require.Fail(t, "Health check never returned 200 after 5 minutes")
}

func testUserData() string {
    return `#!/bin/bash
set -e
yum install -y python3 python3-pip
pip3 install fastapi "uvicorn[standard]"
mkdir -p /opt/api
cat > /opt/api/main.py << 'EOF'
from fastapi import FastAPI
import socket, datetime
app = FastAPI()
@app.get("/health")
def health():
    return {"status": "healthy", "hostname": socket.gethostname(),
            "timestamp": datetime.datetime.utcnow().isoformat() + "Z"}
EOF
nohup uvicorn main:app --host 0.0.0.0 --port 8000 --app-dir /opt/api &`
}

Run the test:

# Tests deploy real AWS resources — they take 5–10 minutes and incur cost
go test -v -timeout 20m ./test/

What Terratest catches

Module outputs are correct (ALB DNS name is a valid hostname)
The deployed app actually responds to health checks
Security group rules are correct (the ALB can reach the instances)
The create_before_destroy lifecycle doesn't leave orphaned resources

These are things terraform validate and terraform plan cannot check — they only validate syntax and configuration. Terratest validates behaviour.

Keeping test costs down

Real infrastructure tests cost real money. Two practices keep this manageable:

Use t2.micro in tests — the smallest instance that works. Tests don't need production-sized resources.
Always use defer terraform.Destroy — resources are cleaned up even if the test panics or t.Fatal is called.

Gap 3: Module Documentation with `terraform-docs`

Without this: every consumer of the module has to read its source code to figure out the inputs, and the README drifts further from reality with every commit.

A module with no documentation is a module no one else can safely use. terraform-docs generates README content directly from the module's variables, outputs, and required providers — always in sync with the code.

Install:

brew install terraform-docs

Add a README.md to the module with special markers:

# web-app module

Deploys a FastAPI application on EC2 behind an Application Load Balancer,
managed by an Auto Scaling Group with rolling update support.

## Usage

```hcl
module "web_app" {
  source = "git::https://github.com/mohamednourdine/terraform-modules.git//modules/web-app?ref=v1.4.0"

  environment   = "prod"
  instance_type = "t3.small"
  min_size      = 2
  max_size      = 6
  user_data     = file("user_data.sh")
}
```

<!-- BEGIN_TF_DOCS -->
<!-- END_TF_DOCS -->

Generate and inject the docs:

terraform-docs markdown table --output-file README.md modules/web-app/

This replaces everything between the markers with a generated table of all inputs and outputs:

## Inputs

| Name | Description | Type | Default | Required |
|------|-------------|------|---------|----------|
| environment | Deployment environment | `string` | n/a | yes |
| instance_type | EC2 instance type | `string` | `"t2.micro"` | no |
| min_size | Minimum ASG instance count | `number` | `1` | no |
| max_size | Maximum ASG instance count | `number` | `2` | no |
| server_port | Port the app listens on | `number` | `8000` | no |
| health_check_path | ALB health check path | `string` | `"/health"` | no |
| user_data | Instance startup script | `string` | n/a | yes |

## Outputs

| Name | Description |
|------|-------------|
| alb_dns_name | DNS name of the Application Load Balancer |
| alb_zone_id | Hosted zone ID — required for Route 53 alias records |
| asg_name | Name of the Auto Scaling Group |

Add a terraform-docs step to CI so the README is always current:

- name: Check module docs are up to date
  run: |
    terraform-docs markdown table --output-file README.md modules/web-app/
    git diff --exit-code   # fails if the generated docs differ from what's committed

A PR that changes variables.tf without regenerating the README will fail CI.

Module Composability — The Design Principle

The modules built over the past week have grown in scope. The web-app module now handles security groups, the launch template, the ASG, the ALB, CloudWatch alarms, and log groups. That is too much for one module.

Production-grade modules follow the single responsibility principle: each module does one well-defined thing and exposes clean outputs for the next module to consume.

How to read the tree below. Read it top-down. Each module's outputs (listed beneath it) become the inputs to the modules underneath. networking produces a VPC; security-groups consumes it; alb and asg consume both. Nothing reaches sideways — every dependency is explicit and flows down.

The right decomposition for the FastAPI stack:

modules/
├── networking/          # VPC, subnets, route tables, internet gateway
│   └── outputs: vpc_id, public_subnet_ids, private_subnet_ids
│
├── security-groups/     # All security groups for the stack
│   └── outputs: alb_sg_id, instance_sg_id, rds_sg_id
│
├── alb/                 # ALB, target group, listener
│   └── outputs: alb_dns_name, alb_zone_id, target_group_arn
│
├── asg/                 # Launch template + ASG (no ALB logic)
│   └── outputs: asg_name, asg_arn
│
├── rds/                 # RDS instance, subnet group, parameter group
│   └── outputs: endpoint, port
│
└── iam/                 # Instance role, profile, policy
    └── outputs: instance_profile_name, role_arn

Each module is independently versioned, independently testable, and independently replaceable. The root config wires them together:

module "networking" {
  source      = "git::https://github.com/mohamednourdine/terraform-modules.git//modules/networking?ref=v1.0.0"
  environment = var.environment
}

module "security_groups" {
  source      = "git::https://github.com/mohamednourdine/terraform-modules.git//modules/security-groups?ref=v1.0.0"
  environment = var.environment
  vpc_id      = module.networking.vpc_id
}

module "alb" {
  source            = "git::https://github.com/mohamednourdine/terraform-modules.git//modules/alb?ref=v1.0.0"
  environment       = var.environment
  vpc_id            = module.networking.vpc_id
  subnet_ids        = module.networking.public_subnet_ids
  security_group_id = module.security_groups.alb_sg_id
}

module "asg" {
  source            = "git::https://github.com/mohamednourdine/terraform-modules.git//modules/asg?ref=v1.0.0"
  environment       = var.environment
  subnet_ids        = module.networking.private_subnet_ids
  security_group_id = module.security_groups.instance_sg_id
  target_group_arn  = module.alb.target_group_arn
  instance_type     = var.instance_type
  min_size          = var.min_size
  max_size          = var.max_size
  user_data         = local.fastapi_user_data
}

The outputs of one module become the inputs of the next. No module knows about the internals of any other — only its declared outputs.

Why this matters for testing

A monolithic web-app module requires a full stack to test. A decomposed alb module can be tested in isolation — just an ALB and a target group, no EC2, no ASG, no RDS. The test is faster, cheaper, and simpler to reason about.

The Remaining Gap: Private Networking

The current setup uses the AWS default VPC with default public subnets. This means EC2 instances have public IP addresses and are directly reachable from the internet (modulo security group rules). That is not acceptable for production.

The proper network layout:

EC2 instances in private subnets still need outbound internet access to pull packages and call AWS APIs (Secrets Manager, CloudWatch). That goes through a NAT Gateway in the public subnet — outbound only, no inbound from the internet.

NAT Gateway, in plain English. A managed AWS service that lets private instances reach the internet outbound (for yum install, AWS API calls) but blocks all unsolicited inbound traffic from the internet. Think of it as a one-way door.

This networking module is the foundation that all the others build on. It is the piece that converts the setup from "works in a default VPC" to "production network architecture." Day 17 covers this in detail.

Revised Checklist After Today

Requirement	Before Day 16	After Day 16
Automated testing	NO	YES — Terratest validates deployed behaviour
CI/CD pipeline	NO	YES — GitHub Actions: plan on PR, apply on merge
Module documentation	NO	YES — terraform-docs, enforced in CI
Alerting	PARTIALLY — alarm, no notification	Still PARTIALLY — SNS topic needed
Network isolation	PARTIALLY — default VPC	Still PARTIALLY — VPC module needed (Day 17)

The infrastructure is now testable, automatically deployed, and self-documenting. These three things are what separate a module you can share with a team from one that only works because you know all the undocumented assumptions.

The production-grade checklist is not a one-time audit — it is a framework for evaluating every new piece of infrastructure before it goes live. The gaps that remain (private networking, SNS alerting) are real, known, and have a plan. That is a different situation from not knowing they exist.

Next up: building the VPC and private networking module that closes the network isolation gap.

If you only do one thing today, make it CI/CD. Tests and docs are easier to add later; an audit trail of every apply is something you cannot reconstruct after the fact.

Blog Article

Day 16: Creating Production-Grade Infrastructure with Terraform

The Production-Grade Checklist Audit

Gap 1: CI/CD Pipeline

GitHub Actions workflow

The OIDC auth detail

Gap 2: Automated Testing with Terratest

What Terratest catches

Keeping test costs down

Gap 3: Module Documentation with `terraform-docs`

Module Composability — The Design Principle

Why this matters for testing

The Remaining Gap: Private Networking

Revised Checklist After Today

💬 Comments

Leave a Comment

Search

Categories

Tags

Blog Article

Day 16: Creating Production-Grade Infrastructure with Terraform

The Production-Grade Checklist Audit

Gap 1: CI/CD Pipeline

GitHub Actions workflow

The OIDC auth detail

Gap 2: Automated Testing with Terratest

What Terratest catches

Keeping test costs down

Gap 3: Module Documentation with terraform-docs

Module Composability — The Design Principle

Why this matters for testing

The Remaining Gap: Private Networking

Revised Checklist After Today

Share This Article

💬 Comments

Leave a Comment

Search

Categories

Popular Posts

Day 25: Deploying a Static Website on AWS S3 with...

CI/CD for Static Sites: Deploy to AWS S3 + CloudFr...

Day 27: Building a Multi-Region, Fault-Tolerant 3-...

Tags

Related Articles

Day 30: How to Register for the Terraform Associate (004) Exam — and What to Expect on Test Day

Day 29: Take the Exam First, Review Later: How to Actually Use Bryan Krausen's Practice Tests

Day 28: How I Prepared for the Terraform Associate Exam with Practice Exams

Day 27: Building a Multi-Region, Fault-Tolerant 3-Tier Infrastructure with AWS and Terraform

Get In Touch

Connect With Me

Send a Message

Gap 3: Module Documentation with `terraform-docs`