Blog Article

Terraform

Apr 18, 2026

10 min read

37 views

Day 17: Manual Testing of Terraform Infrastructure

Day 16 added CI/CD and Terratest. Day 17 steps back: what is the manual testing workflow that Terratest complements — and that you fall back to when a test fails and you need to debug?

Before building the private networking module (the remaining gap from the Day 16 checklist), it is worth establishing the manual testing discipline that keeps each iteration safe. Terratest gives you automated confidence at the module level. Manual testing gives you confidence at the human level — it answers the question "does this actually do what I think it does?"

The two are not in competition. Manual tests reveal the failure scenarios that become Terratest assertions. Terratest catches regressions that manual tests would miss. Both are part of the workflow.

Why Manual Testing Still Matters

Automated tests run the happy path reliably. They do not always tell you why something failed — only that it did. When a Terratest run fails at the retry loop after 5 minutes, you still need to:

Read CloudWatch logs to see why the instance never passed the health check
Check the Secrets Manager policy to confirm the IAM role has access
Inspect the security group rules to verify the ALB can reach the instances
Read the ASG activity log to understand why instances are cycling

These are manual operations. The faster you can do them, the faster you can diagnose the failure and write the test that catches it next time.

Step 1: Static Checks Before `terraform plan`

Three commands run before every plan, in order:

`terraform fmt`

Formats all .tf files in the current directory to canonical style:

terraform fmt -recursive

The -recursive flag reaches into subdirectories (module directories). Unformatted code passes validation and planning but fails CI diff checks. Running fmt locally avoids a CI failure on something trivial.

What it catches:

Inconsistent indentation (tabs vs spaces)
Misaligned = signs in attribute blocks
Inconsistent spacing inside blocks and around operators

terraform fmt is non-negotiable before a PR. It is fast and has no side effects.

`terraform validate`

Validates the configuration without making any AWS API calls:

terraform validate

What it catches:

Undeclared variable references (var.does_not_exist)
Type mismatches between variables and their assigned values
Missing required arguments on resource blocks
Invalid provider configuration syntax
Some reference cycles (others only surface during plan)

What it does NOT catch:

Whether a security group rule is correct
Whether an IAM policy grants the right permissions
Whether the AMI ID exists in your region

validate is a syntax and schema check, not a semantic check.

`terraform plan`

The most important pre-deploy check:

terraform plan -out=tfplan

Always write the plan to a file (-out=tfplan). This ensures that terraform apply uses exactly the plan you reviewed — not a new plan generated at apply time (which could differ if another engineer made a change between your plan and apply).

Reading Plan Output Correctly

The plan output is dense. Three symbols to know:

+   resource will be created           → review the new attributes
~   resource will be updated in place  → confirm the diff is intended
-   resource will be destroyed         → confirm you want it gone
-/+ resource will be destroyed and recreated → STOP and re-read

The -/+ case is the dangerous one. It means a change to that attribute forces replacement — the old resource is destroyed and a new one is created. This is relevant for EC2 instances (changing AMI forces replacement), RDS instances (changing engine version forces replacement), and security groups (some parameter changes force replacement).

Example plan for a launch template update:

# aws_launch_template.web will be updated in-place
~ resource "aws_launch_template" "web" {
      id      = "lt-0abc123"
      name    = "fastapi-prod-lt"
    ~ latest_version = 1 -> (known after apply)

    ~ image_id = "ami-0abcdef1234567890" -> "ami-0fedcba0987654321"
  }

# aws_autoscaling_group.web will be updated in-place
~ resource "aws_autoscaling_group" "web" {
    ~ launch_template {
        ~ version = "1" -> (known after apply)
      }
  }

The launch template updates in place. The ASG updates in place — a new version attribute pointing to the new template. Neither is destroyed. This is the expected plan for an AMI rotation.

A plan that shows -/+ on the ASG itself is a problem — it would terminate all running instances. That typically happens when name or name_prefix changes. Catch it in the plan, not in prod.

The Sandbox Environment Pattern

Never test against a shared environment. Manual testing requires deploying, breaking things, and cleaning up — none of which is safe in prod or staging.

The standard approach: a dedicated sandbox environment with its own Terraform workspace and state.

Workspaces caveat. Workspaces work well for very similar environments (dev/test of the same config), which is what we want for an ephemeral sandbox. For long-lived prod/staging separation, HashiCorp's current guidance is to use separate root configs with their own backend state files — not workspaces — because that gives you separate IAM, separate state buckets, and clearer blast-radius boundaries.

# Create and switch to a sandbox workspace
terraform workspace new sandbox
terraform workspace select sandbox

# Confirm which workspace you're on before every apply
terraform workspace show
# → sandbox

With workspaces, the same state file path gets a prefix:

s3://mnourdine-tf-state/terraform.tfstate          # default workspace
s3://mnourdine-tf-state/env:/sandbox/terraform.tfstate  # sandbox workspace

Resources in the sandbox workspace are completely isolated from the default workspace — they have separate state, separate AWS resources, and separate lifecycle.

Variable overrides for sandbox to keep costs down:

terraform apply \
  -var="environment=sandbox" \
  -var="instance_type=t2.micro" \
  -var="min_size=1" \
  -var="max_size=2" \
  -var="db_instance_class=db.t3.micro"

No multi-AZ, no prod-sized instances. The sandbox proves the configuration is correct — not that it can handle prod load.

Deploying to Sandbox and Verifying

After terraform apply completes, work through the verification checklist from the outside in.

1. Verify the ALB is reachable

ALB_DNS=$(terraform output -raw alb_dns_name)

# Basic connectivity
curl -i "http://$ALB_DNS/health"

Expected output:

HTTP/1.1 200 OK
Content-Type: application/json

{"status": "healthy", "hostname": "ip-10-0-1-23", "timestamp": "2026-04-23T14:00:00Z"}

If curl hangs: the security group on the ALB is not allowing port 80 from 0.0.0.0/0. Check the security group rules in the AWS console.

If curl returns a 502: the ALB is up but the instances are not passing health checks. The next step is checking the target group.

Quick diagnostic table for ALB responses:

Symptom	Most likely cause
`curl` hangs / connection timeout	ALB security group blocks port 80, or DNS not resolved yet
`Connection refused`	Hitting the wrong port, or no ALB at this DNS name
HTTP 502 Bad Gateway	ALB reached the instance but got no valid response — app not listening
HTTP 503 Service Unavailable	Target group has zero healthy targets
HTTP 504 Gateway Timeout	App is listening but didn't respond before the idle timeout (default 60s)
HTTP 200 but wrong body	Routing rules sending traffic to the wrong target group

2. Check the target group health

# Get the target group ARN from state
terraform state show aws_lb_target_group.web | grep arn

# Check target health via AWS CLI
aws elbv2 describe-target-health \
  --target-group-arn $TG_ARN \
  --query 'TargetHealthDescriptions[*].{ID:Target.Id,State:TargetHealth.State,Reason:TargetHealth.Reason}' \
  --output table

Expected output:

--------------------------------------------------
|           DescribeTargetHealth                 |
+----------+-------------------+-----------------+
|    ID    |      Reason       |     State       |
+----------+-------------------+-----------------+
| i-0abc1  |  None             |  healthy        |
+----------+-------------------+-----------------+

If State is unhealthy:

Reason: Target.FailedHealthChecks — the instance is rejecting the health check request. The app is not running or not listening on port 8000.
Reason: Elb.InitialHealthChecking — the instance just joined the target group and is still being checked.
Reason: Target.NotRegistered — the instance was registered but deregistered itself (usually a termination).

3. Verify the app on a specific instance

# Get the instance IDs from the ASG
ASG_NAME=$(terraform output -raw asg_name)

aws autoscaling describe-auto-scaling-groups \
  --auto-scaling-group-names $ASG_NAME \
  --query 'AutoScalingGroups[0].Instances[*].InstanceId' \
  --output text

# Use EC2 Instance Connect to access the instance directly
aws ec2-instance-connect send-ssh-public-key \
  --instance-id i-0abc1 \
  --instance-os-user ec2-user \
  --ssh-public-key file://~/.ssh/id_rsa.pub

# Then connect within ~60 seconds (the pushed key expires fast)
ssh -i ~/.ssh/id_rsa ec2-user@<instance-public-ip>

Heads up for Day 18. This SSH path only works while instances live in public subnets. Once they move to private subnets behind a NAT Gateway, use AWS Systems Manager Session Manager (aws ssm start-session --target i-0abc1) — no public IP, no SSH key, no inbound port required.

Once on the instance, check whether the app is running:

# Is the systemd service running?
systemctl status fastapi

# Is anything listening on port 8000?
ss -tlnp | grep 8000

# What did the startup script output?
journalctl -u cloud-final --no-pager | tail -50
# or check cloud-init logs
cat /var/log/cloud-init-output.log | tail -100

The startup script logs are the ground truth for what happened when the instance first booted. If pip install fastapi failed because PyPI was unreachable, it shows up here.

4. Verify Secrets Manager access

# Get the secret name from the output or state
SECRET_NAME="fastapi/sandbox/db-credentials"

# Confirm the secret exists
aws secretsmanager describe-secret --secret-id $SECRET_NAME

# From the instance, verify the IAM role can read it
# (run this on the EC2 instance, not locally)
aws secretsmanager get-secret-value \
  --secret-id $SECRET_NAME \
  --query SecretString \
  --output text | python3 -m json.tool

If this returns AccessDeniedException from the instance, the IAM role attached to the instance profile does not have secretsmanager:GetSecretValue on that ARN. Check the policy and the ARN match exactly — including the region and account ID.

5. Check CloudWatch logs

LOG_GROUP="/fastapi/sandbox/app"

# List the most recent log streams
aws logs describe-log-streams \
  --log-group-name $LOG_GROUP \
  --order-by LastEventTime \
  --descending \
  --max-items 3 \
  --query 'logStreams[*].logStreamName' \
  --output text

# Tail a stream
aws logs get-log-events \
  --log-group-name $LOG_GROUP \
  --log-stream-name $STREAM_NAME \
  --limit 50 \
  --query 'events[*].message' \
  --output text

If no log streams appear, the CloudWatch agent is not running on the instance or the IAM role is missing logs:CreateLogStream and logs:PutLogEvents permissions.

Testing Failure Scenarios

Manual testing is not just verifying the happy path. The value is in deliberately breaking things and confirming the system self-heals.

Scenario 1: Instance failure and ASG recovery

Terminate an instance manually and verify the ASG replaces it:

# Get an instance ID
INSTANCE_ID=$(aws autoscaling describe-auto-scaling-groups \
  --auto-scaling-group-names $ASG_NAME \
  --query 'AutoScalingGroups[0].Instances[0].InstanceId' \
  --output text)

# Terminate it via the ASG (proper lifecycle: triggers replacement,
# fires lifecycle hooks, respects cooldowns)
aws autoscaling terminate-instance-in-auto-scaling-group \
  --instance-id $INSTANCE_ID \
  --no-should-decrement-desired-capacity

# Alternative: aws ec2 terminate-instances --instance-ids $INSTANCE_ID
# This works too — the ASG detects the loss via EC2 health check and
# replaces the instance — but skips ASG lifecycle hooks.

# Watch the ASG activity log
watch -n 5 "aws autoscaling describe-scaling-activities \
  --auto-scaling-group-name $ASG_NAME \
  --max-items 5 \
  --query 'Activities[*].{Status:StatusCode,Cause:Cause}' \
  --output table"

Expected: within 2–3 minutes, the ASG launches a replacement instance. The ALB removes the terminated instance from rotation immediately (it fails health checks), and the new instance is added once it passes.

What to check: the time between termination and the new instance becoming healthy. If it exceeds your SLA for recovery, the health_check_grace_period may need increasing, or the startup script is too slow.

Scenario 2: Health check failure

The ALB health check polls /health every 30 seconds. With the default unhealthy_threshold of 2 consecutive failed checks, an instance is marked unhealthy after roughly 60 seconds of failures.

Simulate a health check failure by stopping the app on an instance:

# SSH into an instance
systemctl stop fastapi

# Watch the ALB mark the instance unhealthy (~60s with defaults),
# then drain it (deregistration_delay defaults to 300s)
watch -n 10 "aws elbv2 describe-target-health \
  --target-group-arn $TG_ARN \
  --query 'TargetHealthDescriptions[*].{ID:Target.Id,State:TargetHealth.State}' \
  --output table"

You should see the instance transition healthy → unhealthy → draining → deregistered. The ASG health check (type = ELB) then detects the unhealthy instance and replaces it.

This scenario validates the entire health check chain:

App stops responding
ALB marks instance unhealthy after 3 failed checks
ASG health check (ELB type) detects the unhealthy instance
ASG terminates and replaces it
New instance joins the target group and passes health checks

If the ASG health check type is EC2 instead of ELB, step 3 never fires — the ASG only detects failures at the VM level, not the application level. Confirm with:

terraform state show aws_autoscaling_group.web | grep health_check_type
# → health_check_type = "ELB"

Scenario 3: Bad secret value

Update the Secrets Manager secret with a malformed value and confirm the app handles it gracefully:

# Store an invalid JSON value (missing closing brace)
aws secretsmanager put-secret-value \
  --secret-id $SECRET_NAME \
  --secret-string '{"username": "api_user", "password": "testpass"'

# Trigger a new instance (forces a fresh startup script run)
aws autoscaling start-instance-refresh \
  --auto-scaling-group-name $ASG_NAME \
  --preferences '{"MinHealthyPercentage":100}'

If the app crashes on startup because json.loads() raises an exception, the instance never passes the health check and the ALB never routes traffic to it. This is the safe failure mode — a bad secret causes a failed deploy, not a running instance with broken database connectivity.

Check the CloudWatch logs on the new instance to confirm the error is logged clearly.

Useful State Commands for Debugging

Terraform state is the source of truth for what Terraform knows about your deployed resources. When something looks wrong, read the state.

Safety note. terraform state mutating commands (rm, mv, import) take a state lock but do not coordinate with other engineers' applies in flight. Confirm no apply is running against the same backend before mutating state — a botched concurrent edit can corrupt state and require restoring from backup.

# List all resources Terraform is managing
terraform state list

# Inspect a specific resource — full attribute map
terraform state show aws_lb.web

# Show all outputs
terraform output

# Show a specific output as raw text (useful for piping to CLI commands)
terraform output -raw alb_dns_name

When a resource exists in AWS but not in state (e.g., manually created or imported from another config):

# Import an existing resource into state (CLI form)
terraform import aws_s3_bucket.logs existing-bucket-name

For Terraform 1.5+, the preferred form is an import block in code — it is reviewable in a PR and tracked in version control, unlike the CLI command which leaves no record:

import {
  to = aws_s3_bucket.logs
  id = "existing-bucket-name"
}

When a resource exists in state but was deleted outside Terraform:

# Remove the stale reference from state without destroying anything in AWS
terraform state rm aws_instance.orphaned

This is how you clean up state after manual AWS console deletions — remove the stale entry, then let Terraform recreate the resource on the next apply.

Cleaning Up After Tests

Every manual test environment must be destroyed when testing is done. Forgotten test environments are one of the largest sources of unnecessary AWS cost.

# Destroy everything in the sandbox workspace
terraform workspace select sandbox
terraform destroy

terraform destroy respects depends_on in reverse — resources that depend on others are destroyed first. For the FastAPI stack, the order is:

ASG (instances terminated)
ALB + target group (no more traffic)
RDS (database stopped)
Security groups (rules removed)
IAM roles and policies (permissions revoked)
CloudWatch log groups (optional — may want to retain logs)

The lifecycle { prevent_destroy = true } on the RDS resource (from Day 13) will block a full destroy in production. prevent_destroy must be a literal boolean — you cannot set it from a variable expression. The pattern that works is to keep prevent_destroy = true in the prod-facing config and use a separate sandbox config (or module instance) without that lifecycle block.

For a quick sandbox cleanup, temporarily comment out the lifecycle block on a branch — do not commit:

# For sandbox cleanup only — do not commit
# lifecycle {
#   prevent_destroy = true
# }

Partial destroy with `-target`

Sometimes you want to destroy a specific resource without tearing down the whole environment — for example, recreating the launch template while keeping the ALB:

# Destroy only the launch template
terraform destroy -target=aws_launch_template.web

# Destroy only the ASG and its instances
terraform destroy -target=aws_autoscaling_group.web

Use -target carefully. Partial destroys can leave state inconsistent if a destroyed resource had dependents. Always run terraform plan after a partial destroy to see what Terraform thinks needs to be reconciled.

The cost of forgetting

In us-east-1 on on-demand pricing: an EC2 t3.small running for 30 days costs about $15/month, an RDS db.t3.micro costs about $15/month, and an ALB costs about $18/month. A forgotten sandbox costs roughly $50/month — which adds up across a team. Other regions can be 10–20% higher.

Two practices prevent this:

Set a calendar reminder when you deploy a sandbox. 48 hours is usually enough for manual testing. If you are not done, extend intentionally — do not let it run by default.
Tag all sandbox resources with an expiry date and use AWS Config rules or a weekly Lambda to flag resources older than 7 days. Compute the expiry outside Terraform (e.g., pass it in from CI as -var="expires_on=...") — using timestamp() inside the config causes the value to change on every plan, leaving the resource in perpetual drift:

variable "expires_on" {
  description = "Sandbox expiry date (YYYY-MM-DD), set by CI"
  type        = string
  default     = ""
}

locals {
  sandbox_tags = var.environment == "sandbox" && var.expires_on != "" ? {
    ExpiresOn = var.expires_on
  } : {}

  all_tags = merge(local.base_tags, local.sandbox_tags)
}

If you must compute the expiry inside Terraform, pair timestamp() with lifecycle { ignore_changes = [tags["ExpiresOn"]] } so subsequent plans don't show drift.

The Manual Testing Checklist

Before marking a module change as ready for review:

Static checks

terraform fmt -recursive — no formatting changes
terraform validate — exits with "Success"
terraform plan reviewed — no unexpected -/+ replacements

Deploy verification

ALB DNS name returns HTTP 200 from /health
All instances in target group show healthy
Secrets Manager secret is accessible from the instance role
CloudWatch log group has recent log entries
ASG has the correct desired_capacity for the environment

Failure scenario tests (for significant changes)

Instance termination → ASG replaces it within 3 minutes
App process stop → ALB drains instance → ASG replaces it
Health check type is ELB, not EC2

Cleanup

terraform destroy completes cleanly in sandbox
No orphaned resources in AWS console after destroy
Sandbox workspace cleared

Manual testing is the verification layer between writing Terraform and trusting it. It does not replace Terratest — it informs it. Every failure scenario tested manually becomes a Terratest assertion that prevents the same failure from reaching production.

The verification checklist above is reusable across every module. As the stack grows — private networking, VPC, NAT Gateway — the same workflow applies: deploy to sandbox, verify from the outside in, test the failure scenarios, clean up.

The remaining gap from the Day 16 checklist is private networking. The default VPC has served its purpose for learning, but production infrastructure needs EC2 instances in private subnets, an ALB in public subnets, and a NAT Gateway for outbound-only internet access from the private tier. That is Day 18's module.

Blog Article

Day 17: Manual Testing of Terraform Infrastructure

Why Manual Testing Still Matters

Step 1: Static Checks Before `terraform plan`

`terraform fmt`

`terraform validate`

`terraform plan`

Reading Plan Output Correctly

The Sandbox Environment Pattern

Deploying to Sandbox and Verifying

1. Verify the ALB is reachable

2. Check the target group health

3. Verify the app on a specific instance

4. Verify Secrets Manager access

5. Check CloudWatch logs

Testing Failure Scenarios

Scenario 1: Instance failure and ASG recovery

Scenario 2: Health check failure

Scenario 3: Bad secret value

Useful State Commands for Debugging

Cleaning Up After Tests

Partial destroy with `-target`

The cost of forgetting

The Manual Testing Checklist

💬 Comments

Leave a Comment

Search

Categories

Tags

Blog Article

Day 17: Manual Testing of Terraform Infrastructure

Why Manual Testing Still Matters

Step 1: Static Checks Before terraform plan

terraform fmt

terraform validate

terraform plan

Reading Plan Output Correctly

The Sandbox Environment Pattern

Deploying to Sandbox and Verifying

1. Verify the ALB is reachable

2. Check the target group health

3. Verify the app on a specific instance

4. Verify Secrets Manager access

5. Check CloudWatch logs

Testing Failure Scenarios

Scenario 1: Instance failure and ASG recovery

Scenario 2: Health check failure

Scenario 3: Bad secret value

Useful State Commands for Debugging

Cleaning Up After Tests

Partial destroy with -target

The cost of forgetting

The Manual Testing Checklist

Share This Article

💬 Comments

Leave a Comment

Search

Categories

Popular Posts

Day 25: Deploying a Static Website on AWS S3 with...

CI/CD for Static Sites: Deploy to AWS S3 + CloudFr...

Day 27: Building a Multi-Region, Fault-Tolerant 3-...

Tags

Related Articles

Day 30: How to Register for the Terraform Associate (004) Exam — and What to Expect on Test Day

Day 29: Take the Exam First, Review Later: How to Actually Use Bryan Krausen's Practice Tests

Day 28: How I Prepared for the Terraform Associate Exam with Practice Exams

Day 27: Building a Multi-Region, Fault-Tolerant 3-Tier Infrastructure with AWS and Terraform

Get In Touch

Connect With Me

Send a Message

Step 1: Static Checks Before `terraform plan`

`terraform fmt`

`terraform validate`

`terraform plan`

Partial destroy with `-target`