Day 16 added a single Terratest test that deploys the FastAPI module and asserts that /health returns 200. That is a good starting point, but it is not a testing strategy. A single test that takes 8 minutes to run and costs $0.50 in EC2 time is not something you run on every commit.
A real testing strategy has layers. Fast, cheap tests run first. Slow, expensive tests run only when the fast tests pass. Each layer catches different categories of failure.
The Terraform Testing Pyramid
| Layer | What fails here |
|---|---|
| Static analysis | Wrong syntax, wrong attribute names, known security misconfigurations |
| Unit tests | The wrong resource is produced for a given combination of inputs |
| Integration tests | The deployed resources don't actually work together |
| E2E tests | The application doesn't behave correctly end-to-end |
The layers are cumulative, not independent: each one adds a category of confidence that the layer below cannot provide. A passing unit test is only meaningful if static analysis already passed; a passing integration test is only meaningful if unit tests already passed.
What goes where — the decision rule:
- Static if the rule is true regardless of inputs (e.g., "RDS must be encrypted").
- Unit (
terraform test) if the rule depends on input variables (e.g., "prod must havemin_size >= 2"). - Integration (Terratest) if the rule depends on AWS behaviour (e.g., "the ALB actually routes traffic to a healthy instance").
- E2E if the rule depends on the application working end-to-end (e.g., "
POST /itemswrites a row to RDS").
Layer 1: Static Analysis
Static analysis runs without a plan or deploy. It reads the .tf files directly and flags problems.
tflint — provider-aware linting
terraform validate catches invalid HCL syntax. tflint goes further: it knows the AWS provider schema and catches semantically wrong configurations that validate accepts.
Install:
brew install tflint
Initialize with the AWS ruleset:
tflint --init
This requires a .tflint.hcl config at the repo root:
# .tflint.hcl
plugin "aws" {
enabled = true
version = "0.31.0"
source = "github.com/terraform-linters/tflint-ruleset-aws"
}
Run against the module:
tflint --recursive
What tflint catches that terraform validate misses:
# validate passes this — the attribute is valid HCL
resource "aws_instance" "web" {
instance_type = "t4.micro" # not a real instance type — tflint catches it
}
# validate passes this too
resource "aws_autoscaling_group" "web" {
health_check_type = "elb" # wrong case — must be "ELB"
}
t4.micro is not a real EC2 instance type (the t4g.* family exists, but no plain t4). tflint catches it via the AWS provider's instance-type catalog; terraform validate only knows that instance_type is a string and the value is a string, so it accepts anything.
checkov — security and compliance scanning
checkov evaluates your Terraform configuration against a library of security and compliance rules: CIS AWS Foundations, NIST, PCI-DSS, and others.
Install:
pip3 install checkov
Run:
checkov -d . --framework terraform
Example output on the FastAPI module (check IDs change between checkov releases — the categories are what matter):
FAILED Ensure Instance Metadata Service Version 1 is not enabled
resource: aws_launch_template.web
FAILED Ensure all data stored in EBS is securely encrypted
resource: aws_launch_template.web
PASSED Ensure that EC2 is EBS optimized
resource: aws_launch_template.web
Each failure is a real security gap. The IMDSv1 check is important — enabling IMDSv2-only closes a class of SSRF attack that can steal EC2 instance credentials.
Fix in the launch template:
resource "aws_launch_template" "web" {
# ...
# Require IMDSv2 — disables the older token-optional endpoint
metadata_options {
http_endpoint = "enabled"
http_tokens = "required" # "optional" allows IMDSv1
http_put_response_hop_limit = 1
}
# Encrypt the root volume
block_device_mappings {
device_name = "/dev/xvda"
ebs {
volume_size = 20
encrypted = true
delete_on_termination = true
}
}
}
Add checkov to CI and fail the pipeline on any HIGH or CRITICAL severity finding:
checkov -d . --framework terraform --check HIGH,CRITICAL --compact
Layer 2: Unit Tests with terraform test
Terraform 1.6 introduced a built-in testing framework, and Terraform 1.7 added mock_provider blocks. Unit tests use mock providers — no AWS API calls, no real resources, no cost, and they run in seconds.
Mocking caveat. A bare
mock_provider "aws" {}returns a randomly-generated value for every computed attribute. Assertions likeoutput.alb_dns_name != ""will always pass with a mock — not because the output is real, but because the mock invented a value. To make output assertions meaningful, supply explicit return values viaoverride_resource/mock_resourceblocks. Asserting on attributes that the user set (instance_type,min_size) is reliable; asserting on AWS-computed values (arn,dns_name) is not, with bare mocks.
Test files use the .tftest.hcl extension and live alongside the module:
modules/
└── web-app/
├── main.tf
├── variables.tf
├── outputs.tf
└── tests/
├── unit.tftest.hcl
└── integration.tftest.hcl
Writing a unit test
# modules/web-app/tests/unit.tftest.hcl
# Override the AWS provider with a mock — no real API calls
mock_provider "aws" {}
# Test 1: default variable values produce the expected resource configuration
run "defaults_produce_correct_config" {
command = plan # plan only — no apply
variables {
environment = "test"
instance_type = "t2.micro"
min_size = 1
max_size = 2
user_data = "#!/bin/bash\necho hello"
}
assert {
condition = aws_launch_template.web.instance_type == "t2.micro"
error_message = "Expected instance_type to be t2.micro, got: ${aws_launch_template.web.instance_type}"
}
assert {
condition = aws_autoscaling_group.web.min_size == 1
error_message = "Expected min_size to be 1, got: ${aws_autoscaling_group.web.min_size}"
}
assert {
condition = aws_autoscaling_group.web.health_check_type == "ELB"
error_message = "Health check type must be ELB, not EC2"
}
}
# Test 2: prod environment enforces minimum capacity
run "prod_enforces_min_size" {
command = plan
variables {
environment = "prod"
instance_type = "t3.small"
min_size = 2
max_size = 6
user_data = "#!/bin/bash\necho hello"
}
assert {
condition = aws_autoscaling_group.web.min_size >= 2
error_message = "prod environment must have min_size >= 2"
}
}
# Test 3: monitoring is enabled in prod
run "prod_has_monitoring" {
command = plan
variables {
environment = "prod"
instance_type = "t3.small"
min_size = 2
max_size = 6
user_data = "#!/bin/bash\necho hello"
enable_monitoring = true
}
assert {
condition = length(aws_cloudwatch_metric_alarm.cpu_high) == 1
error_message = "prod environment must have CPU alarm enabled"
}
}
Run the unit tests:
terraform test -filter=tests/unit.tftest.hcl
Output:
web-app/tests/unit.tftest.hcl... pass
run "defaults_produce_correct_config"... pass
run "prod_enforces_min_size"... pass
run "prod_has_monitoring"... pass
These tests run in under 3 seconds. No AWS credentials needed. They validate the module's logic — the relationship between inputs and the resource configuration they produce.
What unit tests catch
- A variable with the wrong default being used by a resource
- A conditional (
count = var.enable_monitoring ? 1 : 0) that is inverted - A
for_eachthat produces more or fewer resources than expected - An output that references the wrong attribute
What they do not catch: whether the IAM role has the right permissions, whether the security group allows the right traffic, whether the app actually starts on the instance.
Layer 3: Integration Tests with Terratest
Integration tests deploy real AWS resources, verify real behavior, and destroy everything afterward. They catch what unit tests cannot: configuration that is syntactically correct but operationally wrong.
The complete test file for the web-app module:
// test/web_app_test.go
package test
import (
"fmt"
"net/http"
"testing"
"time"
"github.com/gruntwork-io/terratest/modules/aws"
"github.com/gruntwork-io/terratest/modules/random"
"github.com/gruntwork-io/terratest/modules/terraform"
"github.com/stretchr/testify/assert"
"github.com/stretchr/testify/require"
)
func TestWebAppModule(t *testing.T) {
t.Parallel() // run multiple test functions concurrently
// Unique suffix prevents resource name collisions when tests run in parallel
uniqueID := random.UniqueId()
environment := fmt.Sprintf("test-%s", uniqueID)
terraformOptions := &terraform.Options{
TerraformDir: "../modules/web-app",
Vars: map[string]interface{}{
"environment": environment,
"instance_type": "t2.micro",
"min_size": 1,
"max_size": 2,
"user_data": buildUserData(),
},
// Retry plan/apply up to 3 times on transient AWS errors
MaxRetries: 3,
TimeBetweenRetries: 5 * time.Second,
NoColor: true,
}
// Always destroy — even if an assertion panics
defer terraform.Destroy(t, terraformOptions)
terraform.InitAndApply(t, terraformOptions)
// Test 1: ALB DNS name is a non-empty string
albDNS := terraform.Output(t, terraformOptions, "alb_dns_name")
assert.NotEmpty(t, albDNS, "alb_dns_name output should not be empty")
// Test 2: health endpoint returns 200
assertHealthEndpoint(t, albDNS, 30, 10*time.Second)
// Test 3: ASG exists and has the correct desired capacity
asgName := terraform.Output(t, terraformOptions, "asg_name")
capacity := aws.GetCapacityInfoForAsg(t, "us-east-1", asgName)
assert.Equal(t, int64(1), capacity.DesiredCapacity,
"ASG desired capacity should be 1 for test environment")
// Test 4: all instances in the target group are healthy
tgArn := terraform.Output(t, terraformOptions, "target_group_arn")
assertAllTargetsHealthy(t, tgArn, "us-east-1")
}
func assertHealthEndpoint(t *testing.T, albDNS string, maxRetries int, sleep time.Duration) {
t.Helper()
url := fmt.Sprintf("http://%s/health", albDNS)
for i := 0; i < maxRetries; i++ {
resp, err := http.Get(url)
if err == nil && resp.StatusCode == http.StatusOK {
t.Logf("Health check passed on attempt %d", i+1)
return
}
t.Logf("Attempt %d/%d failed (%v), retrying in %s", i+1, maxRetries, err, sleep)
time.Sleep(sleep)
}
require.Fail(t, fmt.Sprintf("Health check at %s never returned 200 after %d attempts", url, maxRetries))
}
func assertAllTargetsHealthy(t *testing.T, tgArn string, region string) {
t.Helper()
// Wait up to 5 minutes for all targets to become healthy
for i := 0; i < 30; i++ {
targets := aws.GetTargetHealthForTargetGroup(t, region, tgArn)
allHealthy := true
for _, target := range targets {
if aws.GetTargetHealthState(target) != "healthy" {
allHealthy = false
break
}
}
if allHealthy {
t.Log("All targets healthy")
return
}
time.Sleep(10 * time.Second)
}
require.Fail(t, "Not all targets became healthy within 5 minutes")
}
func buildUserData() string {
return `#!/bin/bash
set -e
yum install -y python3 python3-pip
pip3 install fastapi "uvicorn[standard]"
mkdir -p /opt/api
cat > /opt/api/main.py << 'PYEOF'
from fastapi import FastAPI
import socket, datetime
app = FastAPI()
@app.get("/health")
def health():
return {
"status": "healthy",
"hostname": socket.gethostname(),
"timestamp": datetime.datetime.utcnow().isoformat() + "Z"
}
PYEOF
cat > /etc/systemd/system/fastapi.service << 'SVCEOF'
[Unit]
Description=FastAPI
After=network.target
[Service]
ExecStart=/usr/local/bin/uvicorn main:app --host 0.0.0.0 --port 8000 --app-dir /opt/api
Restart=always
[Install]
WantedBy=multi-user.target
SVCEOF
systemctl daemon-reload
systemctl enable fastapi
systemctl start fastapi`
}
Heredoc indentation matters. The body of
<< 'PYEOF'is delivered to the shell verbatim, including any leading whitespace. Indented Python lines likefrom fastapi import FastAPIwill fail withIndentationError. Either left-justify the heredoc body (as above) or use<<-indented heredocs with tabs (the-form only strips leading tabs, not spaces).
Run the integration test:
cd test
go test -v -run TestWebAppModule -timeout 20m
Test stage separation
When debugging a failing integration test, re-running the full deploy-verify-destroy cycle wastes 8–10 minutes. Use environment variables to skip phases:
func TestWebAppModule(t *testing.T) {
// SKIP_DEPLOY=true reuses existing infrastructure from a prior run
skipDeploy := os.Getenv("SKIP_DEPLOY") == "true"
// SKIP_TEARDOWN=true leaves infrastructure running for inspection
skipTeardown := os.Getenv("SKIP_TEARDOWN") == "true"
if !skipTeardown {
defer terraform.Destroy(t, terraformOptions)
}
if !skipDeploy {
terraform.InitAndApply(t, terraformOptions)
}
// assertions always run
assertHealthEndpoint(t, terraform.Output(t, terraformOptions, "alb_dns_name"), 30, 10*time.Second)
}
Typical debugging session:
# First run — deploys infrastructure
go test -v -run TestWebAppModule -timeout 20m
# Test fails at assertAllTargetsHealthy
# Inspect the target group manually in AWS console...
# Fix the bug in the module
# Re-run — skips deploy, tests against existing infrastructure
SKIP_DEPLOY=true go test -v -run TestWebAppModule -timeout 20m
# Fixed — but now skip teardown to inspect the final state before destroying
SKIP_TEARDOWN=true go test -v -run TestWebAppModule -timeout 20m
# Tear down manually
terraform -chdir=modules/web-app destroy
This cuts the iteration cycle from 8 minutes to under 30 seconds for each test-only run.
Caveat.
SKIP_DEPLOY=trueonly works when the second run uses the sameTerraformDiras the first — the state file must still be reachable. If your tests usetest_structure.CopyTerraformFolderToTempto isolate runs in temp directories, this pattern needs adapting (persist the temp path to disk between runs, or skip the copy whenSKIP_DEPLOYis set).
Layer 4: End-to-End Tests
End-to-end tests deploy the complete stack — networking, app, database — and test the application's behavior through its API. They are the most expensive to run and the closest to a real user interaction.
func TestFullStackE2E(t *testing.T) {
// E2E tests are opt-in — only run when explicitly requested
if os.Getenv("RUN_E2E_TESTS") != "true" {
t.Skip("Skipping E2E test — set RUN_E2E_TESTS=true to run")
}
uniqueID := random.UniqueId()
environment := fmt.Sprintf("e2e-%s", uniqueID)
// Deploy the full root module, not just one sub-module
terraformOptions := &terraform.Options{
TerraformDir: "../", // root config with networking + app + rds
Vars: map[string]interface{}{
"environment": environment,
"instance_type": "t3.small",
"min_size": 1,
"max_size": 2,
"db_instance_class": "db.t3.micro",
},
NoColor: true,
}
defer terraform.Destroy(t, terraformOptions)
terraform.InitAndApply(t, terraformOptions)
albDNS := terraform.Output(t, terraformOptions, "alb_dns_name")
// Test the /items endpoint — requires database connectivity
assertHealthEndpoint(t, albDNS, 30, 10*time.Second)
assertItemsEndpoint(t, albDNS)
}
func assertItemsEndpoint(t *testing.T, albDNS string) {
t.Helper()
// Create an item
createURL := fmt.Sprintf("http://%s/items", albDNS)
body := strings.NewReader(`{"name": "test-item", "price": 9.99}`)
resp, err := http.Post(createURL, "application/json", body)
require.NoError(t, err)
assert.Equal(t, http.StatusCreated, resp.StatusCode)
// Read it back
var created map[string]interface{}
json.NewDecoder(resp.Body).Decode(&created)
itemID := created["id"].(string)
getURL := fmt.Sprintf("http://%s/items/%s", albDNS, itemID)
getResp, err := http.Get(getURL)
require.NoError(t, err)
assert.Equal(t, http.StatusOK, getResp.StatusCode)
}
E2E tests validate the parts of the stack that integration tests skip:
- Database connectivity from the app through the private subnet
- Secrets Manager secret correctly decoded and used for database credentials
- VPC routing between subnets (NAT Gateway for outbound, ALB for inbound)
- IAM role has the right permissions across all three AWS services
Other Testing Approaches
Policy-as-code with OPA
Open Policy Agent (OPA) evaluates Terraform plans against a policy rulebook before apply. Unlike checkov (which reads .tf files), OPA reads the JSON plan output — it catches issues that only become visible after planning. The common wrapper for this in the Terraform ecosystem is conftest, which packages OPA with sensible defaults for plan evaluation; HashiCorp's commercial equivalent is Sentinel (in Terraform Cloud/Enterprise).
# Generate a JSON plan
terraform plan -out=tfplan
terraform show -json tfplan > plan.json
# Evaluate with conftest (uses OPA underneath)
conftest test --policy policies/ plan.json
# Or raw OPA, equivalent
opa eval --data policies/ --input plan.json "data.terraform.deny"
Example OPA policy that blocks unencrypted RDS instances:
# policies/rds.rego
package terraform
deny[msg] {
resource := input.resource_changes[_]
resource.type == "aws_db_instance"
resource.change.after.storage_encrypted == false
msg := sprintf("RDS instance '%s' must have storage_encrypted = true", [resource.address])
}
OPA policies become the machine-readable version of your compliance requirements. They run in milliseconds and fail the CI pipeline with a specific, actionable message.
terraform test for module contracts
One pattern that works well with terraform test is testing the contract between a module's inputs and its outputs — not the internal resource configuration, but what the module promises to callers.
# modules/web-app/tests/contract.tftest.hcl
mock_provider "aws" {}
run "outputs_are_populated" {
command = apply # apply with mock provider — no real resources
variables {
environment = "test"
instance_type = "t2.micro"
min_size = 1
max_size = 2
user_data = "#!/bin/bash\necho hello"
}
assert {
condition = output.alb_dns_name != ""
error_message = "alb_dns_name output must not be empty"
}
assert {
condition = output.alb_zone_id != ""
error_message = "alb_zone_id output must not be empty — required for Route 53 alias records"
}
assert {
condition = output.asg_name != ""
error_message = "asg_name output must not be empty"
}
}
This runs with a mock provider (apply works because the mock returns fake values for all computed attributes). The test guarantees that any caller of this module can depend on these outputs being present — a regression test for the module's public API.
Wiring Into CI/CD
Tests run in three stages in GitHub Actions, matching the pyramid:
# .github/workflows/test.yml
name: Terraform Tests
on:
push:
branches: ["**"]
pull_request:
branches: [main]
jobs:
# Stage 1: static analysis — runs on every push, < 30 seconds
static:
name: Static Analysis
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Setup Terraform
uses: hashicorp/setup-terraform@v3
- name: terraform fmt check
run: terraform fmt -recursive -check
working-directory: modules/web-app
- name: terraform validate
run: |
terraform init -backend=false
terraform validate
working-directory: modules/web-app
- name: tflint
uses: terraform-linters/setup-tflint@v4
with:
tflint_version: v0.50.3
- run: |
tflint --init
tflint --recursive
working-directory: modules/web-app
- name: checkov security scan
uses: bridgecrewio/checkov-action@v12
with:
directory: modules/web-app
framework: terraform
check: HIGH,CRITICAL
# Stage 2: unit tests — runs on every push, < 10 seconds
unit:
name: Unit Tests
runs-on: ubuntu-latest
needs: static
steps:
- uses: actions/checkout@v4
- name: Setup Terraform
uses: hashicorp/setup-terraform@v3
with:
terraform_version: "~> 1.6"
- name: terraform test (unit)
run: terraform test -filter=tests/unit.tftest.hcl
working-directory: modules/web-app
# Stage 3: integration tests — runs on PRs to main only (costs money)
integration:
name: Integration Tests
runs-on: ubuntu-latest
needs: unit
if: github.event_name == 'pull_request' && github.base_ref == 'main'
permissions:
id-token: write # required for OIDC
contents: read
steps:
- uses: actions/checkout@v4
- name: Configure AWS credentials (OIDC)
uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: arn:aws:iam::${{ secrets.AWS_ACCOUNT_ID }}:role/GitHubActionsRole
aws-region: us-east-1
- name: Setup Go
uses: actions/setup-go@v5
with:
go-version: "1.21"
- name: Run integration tests
run: go test -v -run TestWebAppModule -timeout 30m
working-directory: test
env:
AWS_REGION: us-east-1
The three stages are gates: integration tests only run if unit tests pass. Unit tests only run if static analysis passes. A formatting error does not burn EC2 time running Terratest.
E2E tests: manual trigger only
E2E tests are expensive enough that they should never run automatically on every PR:
e2e:
name: E2E Tests
runs-on: ubuntu-latest
if: github.event_name == 'workflow_dispatch' # manual trigger only
# ...
steps:
- name: Run E2E tests
run: go test -v -run TestFullStackE2E -timeout 60m
env:
RUN_E2E_TESTS: "true"
Trigger manually from the GitHub Actions UI before a major release.
Retrying a failed integration run. Because the stages are gated by
needs:, a transient AWS failure in the integration job means re-running the whole workflow re-runs static + unit too. Addworkflow_dispatchto the workflow triggers and pick "Re-run failed jobs" from the Actions UI to retry only the failed stage.
Test Cost Management
Integration tests deploy real AWS resources. Without discipline, this cost accumulates quickly.
Use the smallest viable resources
// In test fixtures, always override to minimum sizes
Vars: map[string]interface{}{
"instance_type": "t2.micro", // not t3.small
"min_size": 1, // not 2
"db_instance_class": "db.t3.micro", // not db.t3.small
"multi_az": false, // never in tests
},
Always defer terraform.Destroy
defer terraform.Destroy(t, terraformOptions)
Place this immediately after creating terraformOptions, before InitAndApply. If the test panics, the deferred function still runs.
Estimate test cost before running
For the FastAPI integration test:
- 1x t2.micro EC2: ~$0.012/hour
- 1x ALB: ~$0.008/hour
- 1x db.t3.micro RDS: ~$0.017/hour
- 10-minute test run: ~$0.006 total
Running this test 10 times per day = $0.06/day = $1.80/month. Negligible for the confidence it provides.
The E2E test adds an RDS instance and runs longer (~30 min including DB provisioning): roughly $0.04 per run, plus the one-time RDS create overhead. Call it ~$0.50 per E2E run as a safe budgeting number.
Test pollution
Parallel tests using random.UniqueId() keep resource names from colliding within a single run, but they don't help if a previous run failed to destroy. Orphaned ALBs, ASGs, and RDS instances from crashed test runs accumulate quickly. Two mitigations:
- Tag every test resource with
Purpose = "terratest"andCreatedAt = <timestamp>, then run a scheduled Lambda (oraws-nukein a sandbox account) to delete anything older than 24 hours. - Use a dedicated AWS sub-account for tests so a runaway cleanup script can't touch shared resources.
To conclude
The FastAPI module now has a four-layer testing strategy:
| Layer | Tool | When | Cost |
|---|---|---|---|
| Static analysis | tflint + checkov | Every commit | Free |
| Unit tests | terraform test |
Every commit | Free |
| Integration tests | Terratest | PR to main | ~$0.006/run |
| E2E tests | Terratest | Manual trigger | ~$0.50/run |
The tests run fastest-first. A formatting error is caught before any AWS resource is created. A bad IAM policy is caught in the integration test before it reaches staging. A broken database integration is caught in the E2E test before it reaches production.
This is the testing pyramid applied to infrastructure: cheap tests cover most of the surface area; expensive tests cover the integrations that cheap tests cannot reach.
The remaining open item from the Day 16 checklist: private networking. The next module — VPC with public, private, and database subnets — is the foundation that all the other modules build on. Every test that currently uses the default VPC will be updated to use this module.
This post is part of a 30-day Terraform learning journey.
💬 Comments
No comments yet. Be the first to share your thoughts!
Leave a Comment