Skip to content

Conversation

@jameslaneovermind
Copy link
Contributor

Migrate non-compliant c5.large instance to t3 platform standard

Open
Low Priority
Standard Change
Description
Automated ticket created by Platform Compliance monitoring.

Instance Details:

• Instance ID: i-0a1b2c3d4e5f67890

• Name: api-prod-server

• Current Type: c5.large (non-compliant)

• Target Type: t3.large (compliant)

• Environment: Production

• Region: eu-west-2

Justification:

Per platform standard PS-2024-003, all EC2 instances should use t3 instance family to leverage pre-purchased Savings Plans. This instance was flagged during monthly compliance audit.

✓ This is a Standard Change (SC-0042: Instance Family Migration) and does not require CAB approval. Instance specifications (2 vCPU, 4GB RAM) remain unchanged.

@env0
Copy link

env0 bot commented Dec 16, 2025

🚀  env0 had composed a PR Plan for environment Terraform Example / production :

Plan: 1 to add, 2 to change, 0 to destroy.
Plan Details
Terraform used the selected providers to generate the following execution
plan. Resource actions are indicated with the following symbols:
+   create
!   update in-place

Terraform will perform the following actions:


  # module.api_server.aws_cloudwatch_metric_alarm.cpu_credits[0] will be created
+   resource "aws_cloudwatch_metric_alarm" "cpu_credits" {
+       actions_enabled                       = true
+       alarm_actions                         = [
+           "arn:aws:sns:eu-west-2:540044833068:api-51c748b4-alerts",
        ]
+       alarm_description                     = "CPU credit balance is low"
+       alarm_name                            = "api-51c748b4-cpu-credits-low"
+       arn                                   = (known after apply)
+       comparison_operator                   = "LessThanThreshold"
+       dimensions                            = {
+           "InstanceId" = "i-057105c8b13bee63a"
        }
+       evaluate_low_sample_count_percentiles = (known after apply)
+       evaluation_periods                    = 2
+       id                                    = (known after apply)
+       metric_name                           = "CPUCreditBalance"
+       namespace                             = "AWS/EC2"
+       ok_actions                            = [
+           "arn:aws:sns:eu-west-2:540044833068:api-51c748b4-alerts",
        ]
+       period                                = 300
+       statistic                             = "Average"
+       tags                                  = {
+           "CostCenter"  = "engineering"
+           "Environment" = "production"
+           "ManagedBy"   = "terraform"
+           "Name"        = "api-51c748b4-credits-alarm"
+           "Project"     = "api-platform"
+           "Workload"    = "cpu-intensive"
        }
+       tags_all                              = {
+           "CostCenter"  = "engineering"
+           "Environment" = "production"
+           "ManagedBy"   = "terraform"
+           "Name"        = "api-51c748b4-credits-alarm"
+           "Project"     = "api-platform"
+           "Workload"    = "cpu-intensive"
        }
+       threshold                             = 50
+       treat_missing_data                    = "missing"
    }

  # module.api_server.aws_instance.api_server[0] will be updated in-place
!   resource "aws_instance" "api_server" {
        id                                   = "i-057105c8b13bee63a"
!       instance_type                        = "c5.large" -> "t3.large"
!       public_dns                           = "ec2-35-178-211-139.eu-west-2.compute.amazonaws.com" -> (known after apply)
!       public_ip                            = "35.178.211.139" -> (known after apply)
        tags                                 = {
            "CostCenter"  = "engineering"
            "Environment" = "production"
            "ManagedBy"   = "terraform"
            "Name"        = "api-51c748b4-api-server"
            "Project"     = "api-platform"
            "Workload"    = "cpu-intensive"
        }
!       user_data                            = "acf40314e678f506b36da3c78022132136664591" -> "53cc44b24699094d69344f1f1ffe1416cd20ba52"
        # (29 unchanged attributes hidden)

+       credit_specification {
+           cpu_credits = "standard"
        }

        # (7 unchanged blocks hidden)
    }

  # module.heritage[0].aws_rds_cluster.face_database will be updated in-place
!   resource "aws_rds_cluster" "face_database" {
        id                                    = "facial-recognition-terraform-example"
        tags                                  = {}
        # (46 unchanged attributes hidden)

        # (1 unchanged block hidden)
    }

Plan: 1 to add, 2 to change, 0 to destroy.
Failed to calculate cost estimation

Full PR Plan logs on env0

@github-actions
Copy link

github-actions bot commented Dec 16, 2025

Overmind

Open in Overmind ↗


model|risks_v6

🔴 Change Signals

Routine 🔴 ▇▅▃▂▁ AWS CloudWatch metric alarms for the API server showing first ever modifications across multiple attributes, which is unusual compared to typical patterns.
Policies 🔴 ▃▂▁ Multiple S3 buckets lack server-side encryption and required tags, while several security groups allow SSH access from anywhere, which is a security risk and may need review.

View signals ↗


🔥 Risks

T3 standard CPU credit throttling will cause ALB /health timeouts and remove the only target ‼️High Open Risk ↗
This change moves the production API instance i-057105c8b13bee63a from c5.large to t3.large with cpu_credits set to standard while it remains the sole target in elbv2 target group api-51c748b4-tg behind the internet-facing ALB api-51c748b4-alb. The target group probes HTTP /health on port 80 with a 5-second timeout and requires HTTP 200 to stay healthy.

Current metrics show the instance running at about 70% CPU. On a T3 in standard mode, sustained high CPU will rapidly consume credits and then throttle CPU to its baseline. Once throttled, the /health endpoint will exceed the 5-second timeout or fail to return 200, so the ALB will mark i-057105c8b13bee63a unhealthy after three failed checks and remove it. With no other healthy targets, traffic to the ALB will error, causing a production outage. The new CPUCreditBalance alarm only alerts after depletion and does not mitigate the failure condition.

Burstable T3 in standard credit mode will throttle a CPU‑intensive single‑target API and cause ALB health check failures ‼️High Open Risk ↗
Switching the API server from c5.large to t3.large with cpu_credits set to standard will place a sustained CPU‑intensive workload onto a burstable instance that throttles when credits run out. The instance currently averages ~70% CPU and is the only target in elbv2 target group api-51c748b4-tg. When throttling kicks in, request latency will rise and the ALB’s /health probe (5s timeout) will begin to fail.

As the single target becomes unhealthy, the ALB will have zero healthy targets and will stop serving traffic to the API. The newly added CPUCreditBalance alarm will only notify after depletion; it does not prevent throttling. ENA/network settings are compatible across c5 and t3, so the outage mechanism is CPU throttling leading to failed health checks and loss of availability.

Switching to t3.large with standard credits will throttle a 70% CPU workload, degrading API performance and health checks ‼️High Open Risk ↗
The change switches the api-51c748b4 API server from c5.large to a burstable t3.large with credit_specification set to standard. The instance currently sustains about 70% CPU utilization, and the workload is tagged cpu-intensive. After the migration, sustained CPU at this level will rapidly exhaust the CPUCreditBalance and the instance will be throttled to the T3 baseline.

Once throttled, the API’s throughput and latency will degrade, causing slow responses and potential timeouts behind the ALB. With the target group’s 5-second health check timeout and this instance registered as a target, health checks can fail and user traffic will be impacted until credits recover, repeating under load.

Migrating CPU‑intensive service to T3 standard will throttle CPU and cause ALB health checks to fail ‼️High Open Risk ↗
The api-51c748b4 API server behind api-51c748b4-alb is being migrated from c5.large to t3.large with credit_specification set to standard. This workload is currently running at sustained high CPU and is tagged cpu-intensive, while the target group api-51c748b4-tg probes HTTP /health on port 80 with a 5-second timeout. On a T3 in standard mode, sustained CPU above the baseline will deplete credits and throttle the instance.

Once credits are exhausted, throttling will increase response latency and the /health endpoint will fail the 5-second ALB health check repeatedly. With only this instance registered in the target group, the ALB will mark the target unhealthy and stop routing requests, resulting in an outage. Adding a CPU credit alarm does not prevent throttling and will only notify after the condition begins.

Switching to t3.large with standard credits will throttle a CPU‑intensive instance and cause ALB timeouts and failures ‼️High Open Risk ↗
This change switches 540044833068.eu-west-2.ec2-instance.i-057105c8b13bee63a from c5.large to t3.large and explicitly enables standard CPU credits. The instance currently sustains about 70% CPU utilization and is tagged as cpu-intensive. Under standard credit mode, the instance will be throttled to its baseline once credits are consumed, which is below the instance’s current sustained demand.

When throttling begins, application latency will spike and requests will time out. With the target group’s 5-second health check, the instance will drop out of the load balancer under load, causing user-facing errors and instability. The new CPU credit balance alarm will notify after the problem starts but will not prevent throttling.

Switch to t3.large (cpu_credits: standard) will throttle a CPU‑intensive single target and cause ALB health check failures, removing all capacity ‼️High Open Risk ↗
This change moves 540044833068.eu-west-2.ec2-instance.i-057105c8b13bee63a from c5.large to a burstable t3.large with cpu_credits set to standard while it remains the only target in 540044833068.eu-west-2.elbv2-target-group.api-51c748b4-tg. The target group performs HTTP health checks on /health over port 80 with a 5-second timeout and 30-second interval, expecting 200 responses.

The instance is tagged as cpu-intensive and is currently averaging about 70% CPU. On a T3 in standard mode, sustained load above baseline will quickly exhaust CPU credits and trigger throttling. Once throttled, the HTTP service will slow or time out, causing three consecutive health check failures and the ALB to mark the target unhealthy. With no alternate targets, the ALB stops routing to any backend, resulting in service downtime. The newly added CPU credit alarm only detects depletion and will not prevent this failure mode.

Switching c5.large to t3.large (standard credits) will throttle a CPU‑intensive single‑instance service, risking ALB health check failures ❗Medium Open Risk ↗
This change moves the production API server from a c5.large to a t3.large with standard CPU credits while the instance consistently runs at ~70% CPU. On t3.large (standard), sustained load above the baseline will deplete CPU credits and the instance will be throttled to its lower baseline, reducing available compute for the application. The change also adds a CPUCreditBalance alarm, which only alerts after credits are already low and does not prevent throttling.

With only one registered target in api-51c748b4-tg and health checks set to /health with a 5s timeout and 30s interval, throttling will slow responses and the instance will begin failing ALB health checks. Once unhealthy, the target group will have zero healthy targets, resulting in user-visible errors and potential outage for the API.

Switching to t3.large with standard CPU credits will throttle under current load and cause ALB /health check failures on port 80 ❗Medium Open Risk ↗
The EC2 instance 540044833068.eu-west-2.ec2-instance.i-057105c8b13bee63a is being switched from c5.large to t3.large with credit_specification set to standard while remaining behind ALB target group 540044833068.eu-west-2.elbv2-target-group.api-51c748b4-tg, which probes /health on port 80 with a 5-second timeout. Current CloudWatch metrics show ~70% average CPU on the instance; on a t3.large in standard credit mode this steady load will exhaust CPU credits and throttle CPU below current demand, driving up request latency.

Once throttled, the application will frequently exceed the 5-second health-check timeout on /health, causing the ALB to mark this target unhealthy and stop routing to it. This reduces the target set and resilience of the fronted service, and if this instance is the only or primary target it will result in user-visible downtime until credits recover or the instance is restarted.


🟣 Expected Changes

~ ec2-instance › i-057105c8b13bee63a
--- current
+++ proposed
@@ -13,4 +13,6 @@
       threads_per_core: 2
   cpu_threads_per_core: 2
+  credit_specification:
+    - cpu_credits: standard
   disable_api_stop: false
   disable_api_termination: false
@@ -26,5 +28,5 @@
   instance_initiated_shutdown_behavior: stop
   instance_state: running
-  instance_type: c5.large
+  instance_type: t3.large
   ipv6_address_count: 0
   maintenance_options:
@@ -45,6 +47,6 @@
       hostname_type: ip-name
   private_ip: 10.0.101.119
-  public_dns: ec2-35-178-211-139.eu-west-2.compute.amazonaws.com
-  public_ip: 35.178.211.139
+  public_dns: (known after apply)
+  public_ip: (known after apply)
   root_block_device:
     - delete_on_termination: true
@@ -90,5 +92,5 @@
   terraform_name: module.api_server.aws_instance.api_server[0]
   timeouts: null
-  user_data: acf40314e678f506b36da3c78022132136664591
+  user_data: 53cc44b24699094d69344f1f1ffe1416cd20ba52
   user_data_base64: null
   user_data_replace_on_change: false

🟠 Unmapped Changes

+ cloudwatch-alarm › module.api_server.aws_cloudwatch_metric_alarm.cpu_credits[0]
--- current
+++ proposed
@@ -0,0 +1,44 @@
+type: cloudwatch-alarm
+id: github.com/overmindtech/terraform-example.cloudwatch-alarm.module.api_server.aws_cloudwatch_metric_alarm.cpu_credits[0]
+attributes:
+  actions_enabled: true
+  alarm_actions:
+    - arn:aws:sns:eu-west-2:540044833068:api-51c748b4-alerts
+  alarm_description: CPU credit balance is low
+  alarm_name: api-51c748b4-cpu-credits-low
+  arn: (known after apply)
+  comparison_operator: LessThanThreshold
+  datapoints_to_alarm: null
+  dimensions:
+    InstanceId: i-057105c8b13bee63a
+  evaluate_low_sample_count_percentiles: (known after apply)
+  evaluation_periods: 2
+  extended_statistic: null
+  id: (known after apply)
+  insufficient_data_actions: null
+  metric_name: CPUCreditBalance
+  namespace: AWS/EC2
+  ok_actions:
+    - arn:aws:sns:eu-west-2:540044833068:api-51c748b4-alerts
+  period: 300
+  statistic: Average
+  tags:
+    CostCenter: engineering
+    Environment: production
+    ManagedBy: terraform
+    Name: api-51c748b4-credits-alarm
+    Project: api-platform
+    Workload: cpu-intensive
+  tags_all:
+    CostCenter: engineering
+    Environment: production
+    ManagedBy: terraform
+    Name: api-51c748b4-credits-alarm
+    Project: api-platform
+    Workload: cpu-intensive
+  terraform_address: module.api_server.aws_cloudwatch_metric_alarm.cpu_credits[0]
+  terraform_name: module.api_server.aws_cloudwatch_metric_alarm.cpu_credits[0]
+  threshold: 50
+  threshold_metric_id: null
+  treat_missing_data: missing
+  unit: null

💥 Blast Radius

Items 23

Edges 58

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overmind

⛔ Auto-Blocked


🔴 Decision

Found 1 high risk requiring review


📊 Signals Summary

Routine 🔴 -5


🔥 Risks Summary

High 1 · Medium 1 · Low 0


💥 Blast Radius

Items 213 · Edges 525


View full analysis in Overmind ↗

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overmind

⛔ Auto-Blocked


🔴 Decision

Found 6 high risks requiring review


📊 Signals Summary

Routine 🔴 -5

Policies 🔴 -3


🔥 Risks Summary

High 6 · Medium 2 · Low 0


💥 Blast Radius

Items 23 · Edges 58


View full analysis in Overmind ↗

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants