EC2 Troubleshooting
Summary
- Most EC2 issues fall into a handful of categories: launch failures, connectivity, performance, storage/network, cost, and status check failures.
- Effective troubleshooting starts with a standard checklist (state, network/Security Group/route, logs, CloudWatch metrics) before making configuration changes.
- Strong monitoring and logging (CloudWatch, VPC Flow Logs, Systems Manager) dramatically reduce mean time to resolution.
Troubleshooting flow
mermaid
flowchart TD
A[EC2 issue] --> B{Category?}
B --> C[Launch failure]
B --> D[Cannot connect]
B --> E[Performance problem]
B --> F[Storage/Network issue]
B --> G[Cost anomaly]
C --> C1[Check limits, AZ capacity, instance type]
D --> D1[Check SG/NACL/Route, IP, key/user]
E --> E1[Check CPU/Mem/IO; right-size/ASG]
F --> F1[Check EBS type, IOPS, network metrics]
G --> G1[Use Cost Explorer; find idle resources & data transfer]Best Practices
- Always verify the basics first: instance state, correct Security Group rules, route tables, NACLs, DNS/IP, and correct SSH/RDP user and key.
- Use CloudWatch metrics and logs to pinpoint root causes (high CPU, I/O bottlenecks, network throttling, status check failures) instead of guessing.
- For status check failures:
- System status → underlying host issue → stop/start to move to a new host, check AWS Health.
- Instance status → OS/app/network issue inside the instance → inspect logs and configuration.
- For performance issues, prioritize right‑sizing, code/DB optimization, appropriate EBS volume and instance type selection, and enhanced networking/placement groups where applicable.
- For cost anomalies, look for always‑on instances, unattached EBS volumes, idle Elastic IPs, old snapshots, and unexpected data transfer patterns.
- Document root causes and fixes, then improve monitoring/alerting or configuration to prevent recurrence.
Exam Notes
- Many questions describe “cannot SSH/RDP” or “no internet”; the expected approach is to walk through the basic connectivity checklist (state, SG, NACL, route, IP, key/user).
- Know the difference between System status checks (host issues, fixed by stop/start) and Instance status checks (OS/app/network issues, fixed by configuration changes).
- Cost‑related troubleshooting often involves identifying idle resources (running instances, unattached EBS, unused Elastic IPs, old snapshots) and high data transfer patterns.
- Key tools to remember: CloudWatch (metrics/logs), VPC Flow Logs, Systems Manager, EC2 Instance Connect, AWS Health Dashboard, Cost Explorer, AWS Budgets.