Before building the private networking module (the remaining gap from the Day 16 checklist), it is worth establishing the manual testing discipline that keeps each iteration safe. Terratest gives you automated confidence at the module level. Manual testing gives you confidence at the human level — it answers the question "does this actually do what I think it does?"
The two are not in competition. Manual tests reveal the failure scenarios that become Terratest assertions. Terratest catches regressions that manual tests would miss. Both are part of the workflow.
Why Manual Testing Still Matters
Automated tests run the happy path reliably. They do not always tell you why something failed — only that it did. When a Terratest run fails at the retry loop after 5 minutes, you still need to:
- Read CloudWatch logs to see why the instance never passed the health check
- Check the Secrets Manager policy to confirm the IAM role has access
- Inspect the security group rules to verify the ALB can reach the instances
- Read the ASG activity log to understand why instances are cycling
These are manual operations. The faster you can do them, the faster you can diagnose the failure and write the test that catches it next time.
Step 1: Static Checks Before terraform plan
Three commands run before every plan, in order:
terraform fmt
Formats all .tf files in the current directory to canonical style:
terraform fmt -recursive
The -recursive flag reaches into subdirectories (module directories). Unformatted code passes validation and planning but fails CI diff checks. Running fmt locally avoids a CI failure on something trivial.
What it catches:
- Inconsistent indentation (tabs vs spaces)
- Misaligned
=signs in attribute blocks - Inconsistent spacing inside blocks and around operators
terraform fmt is non-negotiable before a PR. It is fast and has no side effects.
terraform validate
Validates the configuration without making any AWS API calls:
terraform validate
What it catches:
- Undeclared variable references (
var.does_not_exist) - Type mismatches between variables and their assigned values
- Missing required arguments on resource blocks
- Invalid provider configuration syntax
- Some reference cycles (others only surface during
plan)
What it does NOT catch:
- Whether a security group rule is correct
- Whether an IAM policy grants the right permissions
- Whether the AMI ID exists in your region
validate is a syntax and schema check, not a semantic check.
terraform plan
The most important pre-deploy check:
terraform plan -out=tfplan
Always write the plan to a file (-out=tfplan). This ensures that terraform apply uses exactly the plan you reviewed — not a new plan generated at apply time (which could differ if another engineer made a change between your plan and apply).
Reading Plan Output Correctly
The plan output is dense. Three symbols to know:
+ resource will be created → review the new attributes
~ resource will be updated in place → confirm the diff is intended
- resource will be destroyed → confirm you want it gone
-/+ resource will be destroyed and recreated → STOP and re-read
The -/+ case is the dangerous one. It means a change to that attribute forces replacement — the old resource is destroyed and a new one is created. This is relevant for EC2 instances (changing AMI forces replacement), RDS instances (changing engine version forces replacement), and security groups (some parameter changes force replacement).
Example plan for a launch template update:
# aws_launch_template.web will be updated in-place
~ resource "aws_launch_template" "web" {
id = "lt-0abc123"
name = "fastapi-prod-lt"
~ latest_version = 1 -> (known after apply)
~ image_id = "ami-0abcdef1234567890" -> "ami-0fedcba0987654321"
}
# aws_autoscaling_group.web will be updated in-place
~ resource "aws_autoscaling_group" "web" {
~ launch_template {
~ version = "1" -> (known after apply)
}
}
The launch template updates in place. The ASG updates in place — a new version attribute pointing to the new template. Neither is destroyed. This is the expected plan for an AMI rotation.
A plan that shows -/+ on the ASG itself is a problem — it would terminate all running instances. That typically happens when name or name_prefix changes. Catch it in the plan, not in prod.
The Sandbox Environment Pattern
Never test against a shared environment. Manual testing requires deploying, breaking things, and cleaning up — none of which is safe in prod or staging.
The standard approach: a dedicated sandbox environment with its own Terraform workspace and state.
Workspaces caveat. Workspaces work well for very similar environments (dev/test of the same config), which is what we want for an ephemeral sandbox. For long-lived
prod/stagingseparation, HashiCorp's current guidance is to use separate root configs with their own backend state files — not workspaces — because that gives you separate IAM, separate state buckets, and clearer blast-radius boundaries.
# Create and switch to a sandbox workspace
terraform workspace new sandbox
terraform workspace select sandbox
# Confirm which workspace you're on before every apply
terraform workspace show
# → sandbox
With workspaces, the same state file path gets a prefix:
s3://mnourdine-tf-state/terraform.tfstate # default workspace
s3://mnourdine-tf-state/env:/sandbox/terraform.tfstate # sandbox workspace
Resources in the sandbox workspace are completely isolated from the default workspace — they have separate state, separate AWS resources, and separate lifecycle.
Variable overrides for sandbox to keep costs down:
terraform apply \
-var="environment=sandbox" \
-var="instance_type=t2.micro" \
-var="min_size=1" \
-var="max_size=2" \
-var="db_instance_class=db.t3.micro"
No multi-AZ, no prod-sized instances. The sandbox proves the configuration is correct — not that it can handle prod load.
Deploying to Sandbox and Verifying
After terraform apply completes, work through the verification checklist from the outside in.
1. Verify the ALB is reachable
ALB_DNS=$(terraform output -raw alb_dns_name)
# Basic connectivity
curl -i "http://$ALB_DNS/health"
Expected output:
HTTP/1.1 200 OK
Content-Type: application/json
{"status": "healthy", "hostname": "ip-10-0-1-23", "timestamp": "2026-04-23T14:00:00Z"}
If curl hangs: the security group on the ALB is not allowing port 80 from 0.0.0.0/0. Check the security group rules in the AWS console.
If curl returns a 502: the ALB is up but the instances are not passing health checks. The next step is checking the target group.
Quick diagnostic table for ALB responses:
| Symptom | Most likely cause |
|---|---|
curl hangs / connection timeout |
ALB security group blocks port 80, or DNS not resolved yet |
Connection refused |
Hitting the wrong port, or no ALB at this DNS name |
| HTTP 502 Bad Gateway | ALB reached the instance but got no valid response — app not listening |
| HTTP 503 Service Unavailable | Target group has zero healthy targets |
| HTTP 504 Gateway Timeout | App is listening but didn't respond before the idle timeout (default 60s) |
| HTTP 200 but wrong body | Routing rules sending traffic to the wrong target group |
2. Check the target group health
# Get the target group ARN from state
terraform state show aws_lb_target_group.web | grep arn
# Check target health via AWS CLI
aws elbv2 describe-target-health \
--target-group-arn $TG_ARN \
--query 'TargetHealthDescriptions[*].{ID:Target.Id,State:TargetHealth.State,Reason:TargetHealth.Reason}' \
--output table
Expected output:
--------------------------------------------------
| DescribeTargetHealth |
+----------+-------------------+-----------------+
| ID | Reason | State |
+----------+-------------------+-----------------+
| i-0abc1 | None | healthy |
+----------+-------------------+-----------------+
If State is unhealthy:
Reason: Target.FailedHealthChecks— the instance is rejecting the health check request. The app is not running or not listening on port 8000.Reason: Elb.InitialHealthChecking— the instance just joined the target group and is still being checked.Reason: Target.NotRegistered— the instance was registered but deregistered itself (usually a termination).
3. Verify the app on a specific instance
# Get the instance IDs from the ASG
ASG_NAME=$(terraform output -raw asg_name)
aws autoscaling describe-auto-scaling-groups \
--auto-scaling-group-names $ASG_NAME \
--query 'AutoScalingGroups[0].Instances[*].InstanceId' \
--output text
# Use EC2 Instance Connect to access the instance directly
aws ec2-instance-connect send-ssh-public-key \
--instance-id i-0abc1 \
--instance-os-user ec2-user \
--ssh-public-key file://~/.ssh/id_rsa.pub
# Then connect within ~60 seconds (the pushed key expires fast)
ssh -i ~/.ssh/id_rsa ec2-user@<instance-public-ip>
Heads up for Day 18. This SSH path only works while instances live in public subnets. Once they move to private subnets behind a NAT Gateway, use AWS Systems Manager Session Manager (
aws ssm start-session --target i-0abc1) — no public IP, no SSH key, no inbound port required.
Once on the instance, check whether the app is running:
# Is the systemd service running?
systemctl status fastapi
# Is anything listening on port 8000?
ss -tlnp | grep 8000
# What did the startup script output?
journalctl -u cloud-final --no-pager | tail -50
# or check cloud-init logs
cat /var/log/cloud-init-output.log | tail -100
The startup script logs are the ground truth for what happened when the instance first booted. If pip install fastapi failed because PyPI was unreachable, it shows up here.
4. Verify Secrets Manager access
# Get the secret name from the output or state
SECRET_NAME="fastapi/sandbox/db-credentials"
# Confirm the secret exists
aws secretsmanager describe-secret --secret-id $SECRET_NAME
# From the instance, verify the IAM role can read it
# (run this on the EC2 instance, not locally)
aws secretsmanager get-secret-value \
--secret-id $SECRET_NAME \
--query SecretString \
--output text | python3 -m json.tool
If this returns AccessDeniedException from the instance, the IAM role attached to the instance profile does not have secretsmanager:GetSecretValue on that ARN. Check the policy and the ARN match exactly — including the region and account ID.
5. Check CloudWatch logs
LOG_GROUP="/fastapi/sandbox/app"
# List the most recent log streams
aws logs describe-log-streams \
--log-group-name $LOG_GROUP \
--order-by LastEventTime \
--descending \
--max-items 3 \
--query 'logStreams[*].logStreamName' \
--output text
# Tail a stream
aws logs get-log-events \
--log-group-name $LOG_GROUP \
--log-stream-name $STREAM_NAME \
--limit 50 \
--query 'events[*].message' \
--output text
If no log streams appear, the CloudWatch agent is not running on the instance or the IAM role is missing logs:CreateLogStream and logs:PutLogEvents permissions.
Testing Failure Scenarios
Manual testing is not just verifying the happy path. The value is in deliberately breaking things and confirming the system self-heals.
Scenario 1: Instance failure and ASG recovery
Terminate an instance manually and verify the ASG replaces it:
# Get an instance ID
INSTANCE_ID=$(aws autoscaling describe-auto-scaling-groups \
--auto-scaling-group-names $ASG_NAME \
--query 'AutoScalingGroups[0].Instances[0].InstanceId' \
--output text)
# Terminate it via the ASG (proper lifecycle: triggers replacement,
# fires lifecycle hooks, respects cooldowns)
aws autoscaling terminate-instance-in-auto-scaling-group \
--instance-id $INSTANCE_ID \
--no-should-decrement-desired-capacity
# Alternative: aws ec2 terminate-instances --instance-ids $INSTANCE_ID
# This works too — the ASG detects the loss via EC2 health check and
# replaces the instance — but skips ASG lifecycle hooks.
# Watch the ASG activity log
watch -n 5 "aws autoscaling describe-scaling-activities \
--auto-scaling-group-name $ASG_NAME \
--max-items 5 \
--query 'Activities[*].{Status:StatusCode,Cause:Cause}' \
--output table"
Expected: within 2–3 minutes, the ASG launches a replacement instance. The ALB removes the terminated instance from rotation immediately (it fails health checks), and the new instance is added once it passes.
What to check: the time between termination and the new instance becoming healthy. If it exceeds your SLA for recovery, the health_check_grace_period may need increasing, or the startup script is too slow.
Scenario 2: Health check failure
The ALB health check polls /health every 30 seconds. With the default unhealthy_threshold of 2 consecutive failed checks, an instance is marked unhealthy after roughly 60 seconds of failures.
Simulate a health check failure by stopping the app on an instance:
# SSH into an instance
systemctl stop fastapi
# Watch the ALB mark the instance unhealthy (~60s with defaults),
# then drain it (deregistration_delay defaults to 300s)
watch -n 10 "aws elbv2 describe-target-health \
--target-group-arn $TG_ARN \
--query 'TargetHealthDescriptions[*].{ID:Target.Id,State:TargetHealth.State}' \
--output table"
You should see the instance transition healthy → unhealthy → draining → deregistered. The ASG health check (type = ELB) then detects the unhealthy instance and replaces it.
This scenario validates the entire health check chain:
- App stops responding
- ALB marks instance unhealthy after 3 failed checks
- ASG health check (ELB type) detects the unhealthy instance
- ASG terminates and replaces it
- New instance joins the target group and passes health checks
If the ASG health check type is EC2 instead of ELB, step 3 never fires — the ASG only detects failures at the VM level, not the application level. Confirm with:
terraform state show aws_autoscaling_group.web | grep health_check_type
# → health_check_type = "ELB"
Scenario 3: Bad secret value
Update the Secrets Manager secret with a malformed value and confirm the app handles it gracefully:
# Store an invalid JSON value (missing closing brace)
aws secretsmanager put-secret-value \
--secret-id $SECRET_NAME \
--secret-string '{"username": "api_user", "password": "testpass"'
# Trigger a new instance (forces a fresh startup script run)
aws autoscaling start-instance-refresh \
--auto-scaling-group-name $ASG_NAME \
--preferences '{"MinHealthyPercentage":100}'
If the app crashes on startup because json.loads() raises an exception, the instance never passes the health check and the ALB never routes traffic to it. This is the safe failure mode — a bad secret causes a failed deploy, not a running instance with broken database connectivity.
Check the CloudWatch logs on the new instance to confirm the error is logged clearly.
Useful State Commands for Debugging
Terraform state is the source of truth for what Terraform knows about your deployed resources. When something looks wrong, read the state.
Safety note.
terraform statemutating commands (rm,mv,import) take a state lock but do not coordinate with other engineers' applies in flight. Confirm no apply is running against the same backend before mutating state — a botched concurrent edit can corrupt state and require restoring from backup.
# List all resources Terraform is managing
terraform state list
# Inspect a specific resource — full attribute map
terraform state show aws_lb.web
# Show all outputs
terraform output
# Show a specific output as raw text (useful for piping to CLI commands)
terraform output -raw alb_dns_name
When a resource exists in AWS but not in state (e.g., manually created or imported from another config):
# Import an existing resource into state (CLI form)
terraform import aws_s3_bucket.logs existing-bucket-name
For Terraform 1.5+, the preferred form is an import block in code — it is reviewable in a PR and tracked in version control, unlike the CLI command which leaves no record:
import {
to = aws_s3_bucket.logs
id = "existing-bucket-name"
}
When a resource exists in state but was deleted outside Terraform:
# Remove the stale reference from state without destroying anything in AWS
terraform state rm aws_instance.orphaned
This is how you clean up state after manual AWS console deletions — remove the stale entry, then let Terraform recreate the resource on the next apply.
Cleaning Up After Tests
Every manual test environment must be destroyed when testing is done. Forgotten test environments are one of the largest sources of unnecessary AWS cost.
# Destroy everything in the sandbox workspace
terraform workspace select sandbox
terraform destroy
terraform destroy respects depends_on in reverse — resources that depend on others are destroyed first. For the FastAPI stack, the order is:
- ASG (instances terminated)
- ALB + target group (no more traffic)
- RDS (database stopped)
- Security groups (rules removed)
- IAM roles and policies (permissions revoked)
- CloudWatch log groups (optional — may want to retain logs)
The lifecycle { prevent_destroy = true } on the RDS resource (from Day 13) will block a full destroy in production. prevent_destroy must be a literal boolean — you cannot set it from a variable expression. The pattern that works is to keep prevent_destroy = true in the prod-facing config and use a separate sandbox config (or module instance) without that lifecycle block.
For a quick sandbox cleanup, temporarily comment out the lifecycle block on a branch — do not commit:
# For sandbox cleanup only — do not commit
# lifecycle {
# prevent_destroy = true
# }
Partial destroy with -target
Sometimes you want to destroy a specific resource without tearing down the whole environment — for example, recreating the launch template while keeping the ALB:
# Destroy only the launch template
terraform destroy -target=aws_launch_template.web
# Destroy only the ASG and its instances
terraform destroy -target=aws_autoscaling_group.web
Use -target carefully. Partial destroys can leave state inconsistent if a destroyed resource had dependents. Always run terraform plan after a partial destroy to see what Terraform thinks needs to be reconciled.
The cost of forgetting
In us-east-1 on on-demand pricing: an EC2 t3.small running for 30 days costs about $15/month, an RDS db.t3.micro costs about $15/month, and an ALB costs about $18/month. A forgotten sandbox costs roughly $50/month — which adds up across a team. Other regions can be 10–20% higher.
Two practices prevent this:
-
Set a calendar reminder when you deploy a sandbox. 48 hours is usually enough for manual testing. If you are not done, extend intentionally — do not let it run by default.
-
Tag all sandbox resources with an expiry date and use AWS Config rules or a weekly Lambda to flag resources older than 7 days. Compute the expiry outside Terraform (e.g., pass it in from CI as
-var="expires_on=...") — usingtimestamp()inside the config causes the value to change on every plan, leaving the resource in perpetual drift:
variable "expires_on" {
description = "Sandbox expiry date (YYYY-MM-DD), set by CI"
type = string
default = ""
}
locals {
sandbox_tags = var.environment == "sandbox" && var.expires_on != "" ? {
ExpiresOn = var.expires_on
} : {}
all_tags = merge(local.base_tags, local.sandbox_tags)
}
If you must compute the expiry inside Terraform, pair timestamp() with lifecycle { ignore_changes = [tags["ExpiresOn"]] } so subsequent plans don't show drift.
The Manual Testing Checklist
Before marking a module change as ready for review:
Static checks
terraform fmt -recursive— no formatting changesterraform validate— exits with "Success"terraform planreviewed — no unexpected-/+replacements
Deploy verification
- ALB DNS name returns HTTP 200 from
/health - All instances in target group show
healthy - Secrets Manager secret is accessible from the instance role
- CloudWatch log group has recent log entries
- ASG has the correct
desired_capacityfor the environment
Failure scenario tests (for significant changes)
- Instance termination → ASG replaces it within 3 minutes
- App process stop → ALB drains instance → ASG replaces it
- Health check type is
ELB, notEC2
Cleanup
terraform destroycompletes cleanly in sandbox- No orphaned resources in AWS console after destroy
- Sandbox workspace cleared
Manual testing is the verification layer between writing Terraform and trusting it. It does not replace Terratest — it informs it. Every failure scenario tested manually becomes a Terratest assertion that prevents the same failure from reaching production.
The verification checklist above is reusable across every module. As the stack grows — private networking, VPC, NAT Gateway — the same workflow applies: deploy to sandbox, verify from the outside in, test the failure scenarios, clean up.
The remaining gap from the Day 16 checklist is private networking. The default VPC has served its purpose for learning, but production infrastructure needs EC2 instances in private subnets, an ALB in public subnets, and a NAT Gateway for outbound-only internet access from the private tier. That is Day 18's module.
This post is part of a 30-day Terraform learning journey.
💬 Comments
No comments yet. Be the first to share your thoughts!
Leave a Comment