-
Notifications
You must be signed in to change notification settings - Fork 1
Migrate non-compliant c5.large instance to t3 platform standard #415
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
🚀 env0 had composed a PR Plan for environment Terraform Example / production : Plan DetailsTerraform used the selected providers to generate the following execution
plan. Resource actions are indicated with the following symbols:
+ create
! update in-place
Terraform will perform the following actions:
# module.api_server.aws_cloudwatch_metric_alarm.cpu_credits[0] will be created
+ resource "aws_cloudwatch_metric_alarm" "cpu_credits" {
+ actions_enabled = true
+ alarm_actions = [
+ "arn:aws:sns:eu-west-2:540044833068:api-51c748b4-alerts",
]
+ alarm_description = "CPU credit balance is low"
+ alarm_name = "api-51c748b4-cpu-credits-low"
+ arn = (known after apply)
+ comparison_operator = "LessThanThreshold"
+ dimensions = {
+ "InstanceId" = "i-057105c8b13bee63a"
}
+ evaluate_low_sample_count_percentiles = (known after apply)
+ evaluation_periods = 2
+ id = (known after apply)
+ metric_name = "CPUCreditBalance"
+ namespace = "AWS/EC2"
+ ok_actions = [
+ "arn:aws:sns:eu-west-2:540044833068:api-51c748b4-alerts",
]
+ period = 300
+ statistic = "Average"
+ tags = {
+ "CostCenter" = "engineering"
+ "Environment" = "production"
+ "ManagedBy" = "terraform"
+ "Name" = "api-51c748b4-credits-alarm"
+ "Project" = "api-platform"
+ "Workload" = "cpu-intensive"
}
+ tags_all = {
+ "CostCenter" = "engineering"
+ "Environment" = "production"
+ "ManagedBy" = "terraform"
+ "Name" = "api-51c748b4-credits-alarm"
+ "Project" = "api-platform"
+ "Workload" = "cpu-intensive"
}
+ threshold = 50
+ treat_missing_data = "missing"
}
# module.api_server.aws_instance.api_server[0] will be updated in-place
! resource "aws_instance" "api_server" {
id = "i-057105c8b13bee63a"
! instance_type = "c5.large" -> "t3.large"
! public_dns = "ec2-35-178-211-139.eu-west-2.compute.amazonaws.com" -> (known after apply)
! public_ip = "35.178.211.139" -> (known after apply)
tags = {
"CostCenter" = "engineering"
"Environment" = "production"
"ManagedBy" = "terraform"
"Name" = "api-51c748b4-api-server"
"Project" = "api-platform"
"Workload" = "cpu-intensive"
}
! user_data = "acf40314e678f506b36da3c78022132136664591" -> "53cc44b24699094d69344f1f1ffe1416cd20ba52"
# (29 unchanged attributes hidden)
+ credit_specification {
+ cpu_credits = "standard"
}
# (7 unchanged blocks hidden)
}
# module.heritage[0].aws_rds_cluster.face_database will be updated in-place
! resource "aws_rds_cluster" "face_database" {
id = "facial-recognition-terraform-example"
tags = {}
# (46 unchanged attributes hidden)
# (1 unchanged block hidden)
}
Plan: 1 to add, 2 to change, 0 to destroy.
|
Open in Overmind ↗
🔴 Change SignalsRoutine 🔴 🔥 RisksT3 standard CPU credit throttling will cause ALB /health timeouts and remove the only target Current metrics show the instance running at about 70% CPU. On a T3 in standard mode, sustained high CPU will rapidly consume credits and then throttle CPU to its baseline. Once throttled, the /health endpoint will exceed the 5-second timeout or fail to return 200, so the ALB will mark i-057105c8b13bee63a unhealthy after three failed checks and remove it. With no other healthy targets, traffic to the ALB will error, causing a production outage. The new CPUCreditBalance alarm only alerts after depletion and does not mitigate the failure condition. Burstable T3 in standard credit mode will throttle a CPU‑intensive single‑target API and cause ALB health check failures As the single target becomes unhealthy, the ALB will have zero healthy targets and will stop serving traffic to the API. The newly added CPUCreditBalance alarm will only notify after depletion; it does not prevent throttling. ENA/network settings are compatible across c5 and t3, so the outage mechanism is CPU throttling leading to failed health checks and loss of availability. Switching to t3.large with standard credits will throttle a 70% CPU workload, degrading API performance and health checks Once throttled, the API’s throughput and latency will degrade, causing slow responses and potential timeouts behind the ALB. With the target group’s 5-second health check timeout and this instance registered as a target, health checks can fail and user traffic will be impacted until credits recover, repeating under load. Migrating CPU‑intensive service to T3 standard will throttle CPU and cause ALB health checks to fail Once credits are exhausted, throttling will increase response latency and the /health endpoint will fail the 5-second ALB health check repeatedly. With only this instance registered in the target group, the ALB will mark the target unhealthy and stop routing requests, resulting in an outage. Adding a CPU credit alarm does not prevent throttling and will only notify after the condition begins. Switching to t3.large with standard credits will throttle a CPU‑intensive instance and cause ALB timeouts and failures When throttling begins, application latency will spike and requests will time out. With the target group’s 5-second health check, the instance will drop out of the load balancer under load, causing user-facing errors and instability. The new CPU credit balance alarm will notify after the problem starts but will not prevent throttling. Switch to t3.large (cpu_credits: standard) will throttle a CPU‑intensive single target and cause ALB health check failures, removing all capacity The instance is tagged as cpu-intensive and is currently averaging about 70% CPU. On a T3 in standard mode, sustained load above baseline will quickly exhaust CPU credits and trigger throttling. Once throttled, the HTTP service will slow or time out, causing three consecutive health check failures and the ALB to mark the target unhealthy. With no alternate targets, the ALB stops routing to any backend, resulting in service downtime. The newly added CPU credit alarm only detects depletion and will not prevent this failure mode. Switching c5.large to t3.large (standard credits) will throttle a CPU‑intensive single‑instance service, risking ALB health check failures With only one registered target in api-51c748b4-tg and health checks set to /health with a 5s timeout and 30s interval, throttling will slow responses and the instance will begin failing ALB health checks. Once unhealthy, the target group will have zero healthy targets, resulting in user-visible errors and potential outage for the API. Switching to t3.large with standard CPU credits will throttle under current load and cause ALB /health check failures on port 80 Once throttled, the application will frequently exceed the 5-second health-check timeout on /health, causing the ALB to mark this target unhealthy and stop routing to it. This reduces the target set and resilience of the fronted service, and if this instance is the only or primary target it will result in user-visible downtime until credits recover or the instance is restarted. 🟣 Expected Changes~ ec2-instance › i-057105c8b13bee63a--- current
+++ proposed
@@ -13,4 +13,6 @@
threads_per_core: 2
cpu_threads_per_core: 2
+ credit_specification:
+ - cpu_credits: standard
disable_api_stop: false
disable_api_termination: false
@@ -26,5 +28,5 @@
instance_initiated_shutdown_behavior: stop
instance_state: running
- instance_type: c5.large
+ instance_type: t3.large
ipv6_address_count: 0
maintenance_options:
@@ -45,6 +47,6 @@
hostname_type: ip-name
private_ip: 10.0.101.119
- public_dns: ec2-35-178-211-139.eu-west-2.compute.amazonaws.com
- public_ip: 35.178.211.139
+ public_dns: (known after apply)
+ public_ip: (known after apply)
root_block_device:
- delete_on_termination: true
@@ -90,5 +92,5 @@
terraform_name: module.api_server.aws_instance.api_server[0]
timeouts: null
- user_data: acf40314e678f506b36da3c78022132136664591
+ user_data: 53cc44b24699094d69344f1f1ffe1416cd20ba52
user_data_base64: null
user_data_replace_on_change: false
🟠 Unmapped Changes+ cloudwatch-alarm › module.api_server.aws_cloudwatch_metric_alarm.cpu_credits[0]--- current
+++ proposed
@@ -0,0 +1,44 @@
+type: cloudwatch-alarm
+id: github.com/overmindtech/terraform-example.cloudwatch-alarm.module.api_server.aws_cloudwatch_metric_alarm.cpu_credits[0]
+attributes:
+ actions_enabled: true
+ alarm_actions:
+ - arn:aws:sns:eu-west-2:540044833068:api-51c748b4-alerts
+ alarm_description: CPU credit balance is low
+ alarm_name: api-51c748b4-cpu-credits-low
+ arn: (known after apply)
+ comparison_operator: LessThanThreshold
+ datapoints_to_alarm: null
+ dimensions:
+ InstanceId: i-057105c8b13bee63a
+ evaluate_low_sample_count_percentiles: (known after apply)
+ evaluation_periods: 2
+ extended_statistic: null
+ id: (known after apply)
+ insufficient_data_actions: null
+ metric_name: CPUCreditBalance
+ namespace: AWS/EC2
+ ok_actions:
+ - arn:aws:sns:eu-west-2:540044833068:api-51c748b4-alerts
+ period: 300
+ statistic: Average
+ tags:
+ CostCenter: engineering
+ Environment: production
+ ManagedBy: terraform
+ Name: api-51c748b4-credits-alarm
+ Project: api-platform
+ Workload: cpu-intensive
+ tags_all:
+ CostCenter: engineering
+ Environment: production
+ ManagedBy: terraform
+ Name: api-51c748b4-credits-alarm
+ Project: api-platform
+ Workload: cpu-intensive
+ terraform_address: module.api_server.aws_cloudwatch_metric_alarm.cpu_credits[0]
+ terraform_name: module.api_server.aws_cloudwatch_metric_alarm.cpu_credits[0]
+ threshold: 50
+ threshold_metric_id: null
+ treat_missing_data: missing
+ unit: null
💥 Blast RadiusItems Edges |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
⛔ Auto-Blocked
🔴 Decision
Found 1 high risk requiring review
📊 Signals Summary
Routine 🔴 -5
🔥 Risks Summary
High 1 · Medium 1 · Low 0
💥 Blast Radius
Items 213 · Edges 525
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
⛔ Auto-Blocked
🔴 Decision
Found 6 high risks requiring review
📊 Signals Summary
Routine 🔴 -5
Policies 🔴 -3
🔥 Risks Summary
High 6 · Medium 2 · Low 0
💥 Blast Radius
Items 23 · Edges 58


Migrate non-compliant c5.large instance to t3 platform standard
Instance Details:
• Instance ID: i-0a1b2c3d4e5f67890
• Name: api-prod-server
• Current Type: c5.large (non-compliant)
• Target Type: t3.large (compliant)
• Environment: Production
• Region: eu-west-2
Justification:
Per platform standard PS-2024-003, all EC2 instances should use t3 instance family to leverage pre-purchased Savings Plans. This instance was flagged during monthly compliance audit.
✓ This is a Standard Change (SC-0042: Instance Family Migration) and does not require CAB approval. Instance specifications (2 vCPU, 4GB RAM) remain unchanged.