Most cloud outages are not caused by exotic failures. They are caused by basics that were 'temporary workarounds' three years ago. This 15-point checklist helps mid-market CTOs identify the gaps that silently accumulate in AWS, Azure, and GCP environments.

We audit cloud environments for mid-market companies across the Baltics and Nordics. The pattern is remarkably consistent: the infrastructure works fine day-to-day, but when something breaks, the recovery is slower, harder, and more expensive than anyone expected.

The root cause is almost never a sophisticated technical failure. It is the accumulation of small decisions — a 'temporary' workaround that became permanent, a backup that was never tested, a security group that was opened for debugging and never closed.

This checklist covers the 15 areas we evaluate in every cloud resilience assessment. Score yourself honestly.

Architecture and Availability

1. Multi-Region or Multi-AZ Redundancy

Does your production workload run in at least two availability zones? If your entire business depends on a single region never having a bad day, you do not have a resilient architecture — you have a bet.

What to check: Review your load balancer configuration. If all targets are in one AZ, a single data center issue takes down your entire application.

2. Defined Recovery Objectives

Do you have documented RTO (Recovery Time Objective) and RPO (Recovery Point Objective) for every critical system? More importantly — have you validated them?

What to check: Ask your team: "If our primary database fails right now, how long until we are back online?" If the answer is a guess, this is your highest-priority gap.

3. Automated Failover Testing

When did you last simulate a failure? Not a planned maintenance window — an unannounced test of your failover procedures.

What to check: Schedule a chaos engineering exercise. Kill a primary database instance during business hours (with the team prepared but not pre-warned). Measure actual recovery time against your documented RTO.

Backup and Recovery

4. Backup Integrity Verification

You have automated backups. When did you last restore from them? We routinely find backups that have been silently failing or saving corrupted data for months.

What to check: Perform a full restore to a staging environment this week. Time the process. Verify the data is complete and consistent.

5. Cross-Region Backup Replication

Are your backups stored in the same region as your primary infrastructure? If a regional outage takes down both your production systems and your backups, you have zero recoverability.

What to check: Verify that at least one backup copy exists in a different region from your production workload.

6. Database Point-in-Time Recovery

Can you restore your database to any point in the last 7 days, not just the last nightly snapshot? Data corruption is often discovered hours after it occurs.

What to check: Verify that continuous WAL archiving or equivalent is enabled and functioning on all production databases.

Security and Access Control

7. Least Privilege Access Enforcement

Does every engineer have admin access because 'we need to move fast'? A compromised developer token with admin privileges gives an attacker the keys to your entire production environment.

What to check: Run an IAM access report. Flag every user with AdministratorAccess or equivalent. Implement role-based access with the minimum permissions required for each function.

8. Secret Rotation and Management

Are your API keys, database passwords, and service tokens stored in environment variables that have not changed since initial deployment? Secrets that never rotate are secrets waiting to be compromised.

What to check: Audit when each production secret was last rotated. Implement automated rotation with AWS Secrets Manager, HashiCorp Vault, or equivalent. Target: no secret older than 90 days.

9. Network Security Group Hygiene

Security groups get opened for debugging and never closed. Unused rules accumulate, expanding your attack surface with every 'temporary' exception.

What to check: Export all security group rules. Flag any rule allowing 0.0.0.0/0 ingress on non-standard ports. Remove rules that were added more than 30 days ago without documentation.

Monitoring and Observability

10. Centralized Logging with Retention

Can you answer the question: "What happened in our production environment at 3:17 AM last Tuesday?" If your logs are scattered across instances or auto-deleted after 7 days, incident investigation becomes guesswork.

What to check: Verify that all production services ship logs to a centralized platform (CloudWatch, Datadog, ELK). Confirm retention period meets your compliance requirements (NIS2 requires minimum 18 months).

11. Actionable Alerting (Not Alert Fatigue)

Do your on-call engineers ignore alerts because 90% of them are noise? Alert fatigue is a leading indicator of missed real incidents.

What to check: Review alert history for the past 30 days. If a significant portion of your alerts consistently require no action, your thresholds need tuning. Best practice (per Google SRE) is that every alert should represent a condition requiring human intervention.

12. Infrastructure Drift Detection

Cloud environments drift. Manual console changes, undocumented tweaks, and configuration patches accumulate until nobody knows the true state of the infrastructure.

What to check: Run a Terraform plan (or equivalent) against your production environment. If the diff is not empty, you have drift. The larger the diff, the larger your risk.

Cost and Resource Governance

13. Orphaned Resource Identification

Unattached EBS volumes, idle load balancers, forgotten EC2 instances, unused Elastic IPs — they cost money and expand your attack surface.

What to check: Run AWS Cost Explorer or equivalent. Filter for resources with zero or near-zero utilization over the past 30 days. We typically find 15-25% waste in environments that have not been audited in 6+ months.

14. Reserved Capacity Planning

Are you running predictable, long-running workloads on on-demand pricing? This is the single most common source of cloud overspend for mid-market companies.

What to check: Identify instances that have been running continuously for 6+ months. Calculate savings from Reserved Instances or Savings Plans. Typical savings: 30-60% for 1-year commitments, depending on provider and workload type.

Operational Readiness

15. Shadow Infrastructure Inventory

That server an engineer spun up in 2023 for a quick test? It is now running a critical background job. It is not monitored. Nobody knows the root password. And it is not in your infrastructure-as-code.

What to check: Compare your IaC-managed resource list against the actual resources in your cloud account. Every resource that exists in your account but not in your IaC is shadow infrastructure — and it is a risk.

Scoring Your Cloud Resilience

For each of the 15 items above, score yourself:

3 points — Fully implemented, tested within the past 6 months, documented
2 points — Implemented but not recently tested or partially documented
1 point — Partially implemented or planned
0 points — Not implemented or unknown status

Score interpretation:

Score	Assessment
40-45	Strong — you are ahead of most mid-market companies
30-39	Adequate — address the gaps before they become incidents
20-29	At risk — prioritize the zero-score items immediately
Below 20	Critical — consider a professional assessment before your next incident decides for you

Most companies we assess score between 22 and 32. The ones that score above 35 have usually experienced a significant outage in the past and learned from it. The goal is to reach that level of resilience without the outage.

If you scored below 30, the most impactful first step is a 2-day cloud resilience assessment with our infrastructure team. We evaluate all 15 areas against your specific environment and deliver a prioritized remediation roadmap.

15-Point Cloud Resilience Checklist for Mid-Market CTOs