Back to Blog
AWSCloudWatchMonitoringObservabilityDevOps

AWS CloudWatch Tutorial: Complete Monitoring Guide for Cloud Engineers (2026)

April 12, 202621 min read

AWS CloudWatch Tutorial: Complete Monitoring Guide for Cloud Engineers (2026)

You've deployed your application to AWS. Now how do you know it's running correctly? How do you get paged at 2am when something goes wrong before your users notice? How do you understand why last Tuesday's traffic spike caused latency issues?

The answer is AWS CloudWatch — the native observability platform that ties together metrics, logs, alarms, and dashboards for everything running in your AWS account.

This guide teaches you CloudWatch from scratch, including production patterns that are tested at scale and included in the AWS DevOps Professional exam.

What is AWS CloudWatch?

CloudWatch is a fully managed monitoring and observability service that collects, stores, and analyzes:

  • Metrics: Numerical time-series data (CPU utilization, request count, latency)
  • Logs: Raw text output from your applications and AWS services
  • Alarms: Triggers based on metric thresholds (page me when CPU > 80%)
  • Dashboards: Visualizations combining metrics and logs
  • Events: Schedule-based and change-based triggers (now called EventBridge)
  • ServiceLens: Distributed tracing with X-Ray integration

Every AWS service automatically sends metrics to CloudWatch. EC2 sends CPU and network metrics. ECS sends task metrics. Lambda sends invocation counts and errors. You get this for free.

CloudWatch Metrics

Namespaces and Dimensions

Metrics are organized by namespace (AWS service or custom) and dimensions (filters).

# List all available metrics for EC2
aws cloudwatch list-metrics --namespace AWS/EC2

# Get CPU utilization for a specific instance
aws cloudwatch get-metric-statistics \
  --namespace AWS/EC2 \
  --metric-name CPUUtilization \
  --dimensions Name=InstanceId,Value=i-1234567890abcdef0 \
  --start-time 2026-04-12T00:00:00Z \
  --end-time 2026-04-12T23:59:59Z \
  --period 300 \
  --statistics Average

Key Metrics to Monitor by Service

EC2:

  • CPUUtilization — scale up when > 70% for 5 minutes
  • NetworkIn/NetworkOut — detect traffic anomalies
  • DiskReadOps/DiskWriteOps — identify I/O bottlenecks
  • StatusCheckFailed — hardware/software failures

RDS:

  • CPUUtilization — should stay < 80%
  • DatabaseConnections — alert near max_connections
  • ReadLatency / WriteLatency — target < 10ms for OLTP
  • FreeStorageSpace — alert when < 20% remaining

ECS/Fargate:

  • CPUUtilization — task-level, triggers HPA-equivalent
  • MemoryUtilization — OOM kills when this hits 100%
  • RunningTaskCount — detect task failures

Lambda:

  • Invocations — volume
  • Errors — count and error rate
  • Duration — P99 latency (set alarms on P99, not average!)
  • Throttles — hitting concurrency limits
  • ConcurrentExecutions — approaching account limits

ALB (Application Load Balancer):

  • RequestCount — traffic volume
  • TargetResponseTime — P99 is critical
  • HTTPCode_ELB_5XX_Count — server errors
  • HTTPCode_Target_5XX_Count — application errors
  • UnHealthyHostCount — targets failing health checks

Custom Metrics

AWS services send metrics automatically, but your application doesn't. Send custom metrics using the AWS SDK:

import boto3
import time

cloudwatch = boto3.client('cloudwatch', region_name='us-east-1')

def record_order_processed(order_id: str, processing_time_ms: float, success: bool):
    cloudwatch.put_metric_data(
        Namespace='MyApp/Orders',
        MetricData=[
            {
                'MetricName': 'OrderProcessingTime',
                'Value': processing_time_ms,
                'Unit': 'Milliseconds',
                'Dimensions': [
                    {'Name': 'Environment', 'Value': 'production'},
                    {'Name': 'Service', 'Value': 'order-processor'},
                ],
            },
            {
                'MetricName': 'OrdersProcessed',
                'Value': 1,
                'Unit': 'Count',
                'Dimensions': [
                    {'Name': 'Environment', 'Value': 'production'},
                    {'Name': 'Status', 'Value': 'success' if success else 'failure'},
                ],
            },
        ]
    )

Node.js:

import { CloudWatchClient, PutMetricDataCommand } from "@aws-sdk/client-cloudwatch";

const cw = new CloudWatchClient({ region: "us-east-1" });

async function recordApiLatency(endpoint: string, latencyMs: number) {
  await cw.send(new PutMetricDataCommand({
    Namespace: "MyApp/API",
    MetricData: [{
      MetricName: "Latency",
      Value: latencyMs,
      Unit: "Milliseconds",
      Dimensions: [
        { Name: "Endpoint", Value: endpoint },
        { Name: "Environment", Value: process.env.NODE_ENV ?? "production" },
      ],
    }],
  }));
}

Cost note: CloudWatch charges $0.30 per custom metric per month. Group related metrics using dimensions rather than creating separate metrics for each endpoint.

CloudWatch Alarms

Alarms trigger actions when metric thresholds are crossed.

Creating Alarms with Terraform

# SNS topic for alerts
resource "aws_sns_topic" "alerts" {
  name = "production-alerts"
}

resource "aws_sns_topic_subscription" "pagerduty" {
  topic_arn = aws_sns_topic.alerts.arn
  protocol  = "https"
  endpoint  = var.pagerduty_webhook_url
}

# High CPU alarm for ECS service
resource "aws_cloudwatch_metric_alarm" "high_cpu" {
  alarm_name          = "api-high-cpu"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 2           # Must breach for 2 consecutive periods
  metric_name         = "CPUUtilization"
  namespace           = "AWS/ECS"
  period              = 300         # 5-minute periods
  statistic           = "Average"
  threshold           = 80
  alarm_description   = "ECS API service CPU > 80% for 10 minutes"
  treat_missing_data  = "notBreaching"

  dimensions = {
    ClusterName = aws_ecs_cluster.production.name
    ServiceName = aws_ecs_service.api.name
  }

  alarm_actions = [aws_sns_topic.alerts.arn]
  ok_actions    = [aws_sns_topic.alerts.arn]  # Notify on recovery too
}

# Error rate alarm for Lambda
resource "aws_cloudwatch_metric_alarm" "lambda_errors" {
  alarm_name          = "lambda-high-error-rate"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 1
  threshold           = 5
  alarm_description   = "Lambda error rate > 5 in 1 minute"

  metric_query {
    id          = "error_rate"
    expression  = "100 * errors / MAX([errors, invocations])"
    label       = "Error Rate (%)"
    return_data = true
  }

  metric_query {
    id = "errors"
    metric {
      metric_name = "Errors"
      namespace   = "AWS/Lambda"
      period      = 60
      stat        = "Sum"
      dimensions  = { FunctionName = aws_lambda_function.api.function_name }
    }
  }

  metric_query {
    id = "invocations"
    metric {
      metric_name = "Invocations"
      namespace   = "AWS/Lambda"
      period      = 60
      stat        = "Sum"
      dimensions  = { FunctionName = aws_lambda_function.api.function_name }
    }
  }

  alarm_actions = [aws_sns_topic.alerts.arn]
}

# RDS free storage alarm
resource "aws_cloudwatch_metric_alarm" "rds_low_storage" {
  alarm_name          = "rds-low-storage"
  comparison_operator = "LessThanThreshold"
  evaluation_periods  = 1
  metric_name         = "FreeStorageSpace"
  namespace           = "AWS/RDS"
  period              = 300
  statistic           = "Average"
  threshold           = 10737418240  # 10 GB in bytes
  alarm_description   = "RDS free storage < 10GB"

  dimensions = {
    DBInstanceIdentifier = aws_db_instance.main.identifier
  }

  alarm_actions = [aws_sns_topic.alerts.arn]
}

# ALB 5xx error rate
resource "aws_cloudwatch_metric_alarm" "alb_5xx" {
  alarm_name          = "alb-5xx-errors"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 2
  threshold           = 10
  alarm_description   = "More than 10 5xx errors in 5 minutes"

  metric_query {
    id          = "error_rate"
    expression  = "100 * errors / requests"
    label       = "5xx Error Rate (%)"
    return_data = true
  }

  metric_query {
    id = "errors"
    metric {
      metric_name = "HTTPCode_Target_5XX_Count"
      namespace   = "AWS/ApplicationELB"
      period      = 300
      stat        = "Sum"
      dimensions  = { LoadBalancer = aws_lb.main.arn_suffix }
    }
  }

  metric_query {
    id = "requests"
    metric {
      metric_name = "RequestCount"
      namespace   = "AWS/ApplicationELB"
      period      = 300
      stat        = "Sum"
      dimensions  = { LoadBalancer = aws_lb.main.arn_suffix }
    }
  }

  alarm_actions = [aws_sns_topic.alerts.arn]
}

Alarm Best Practices

  1. Always alert on P99, not P50. Average latency hides the pain your worst 1% of users experience.
  2. Use evaluation_periods > 1 for CPU/memory. A single spike shouldn't page you at 3am.
  3. Set ok_actions to get recovery notifications. Without them, you'll forget an alarm was firing.
  4. treat_missing_data = "notBreaching" prevents false alarms when services scale to zero.
  5. Use composite alarms to reduce noise: only page if CPU AND error rate are both high.

CloudWatch Logs

Every AWS service logs to CloudWatch Logs. Your applications should too.

Sending Application Logs

ECS/Fargate (add to task definition):

{
  "logConfiguration": {
    "logDriver": "awslogs",
    "options": {
      "awslogs-group": "/ecs/my-app",
      "awslogs-region": "us-east-1",
      "awslogs-stream-prefix": "ecs"
    }
  }
}

Lambda (automatic): Lambda automatically sends stdout/stderr to /aws/lambda/function-name.

EC2 (use CloudWatch Agent):

{
  "logs": {
    "logs_collected": {
      "files": {
        "collect_list": [
          {
            "file_path": "/var/log/my-app/*.log",
            "log_group_name": "/ec2/my-app",
            "log_stream_name": "{instance_id}",
            "timestamp_format": "%Y-%m-%dT%H:%M:%S"
          }
        ]
      }
    }
  }
}

CloudWatch Logs Insights

The most powerful feature for debugging production issues — query your logs with SQL-like syntax:

-- Find all errors in the last hour
fields @timestamp, @message
| filter @message like /ERROR/
| sort @timestamp desc
| limit 100

-- Average response time by endpoint (last 30 min)
fields @timestamp, endpoint, duration
| filter ispresent(endpoint)
| stats avg(duration) as avgMs, count(*) as requests by endpoint
| sort avgMs desc

-- Find users experiencing errors
fields @timestamp, userId, @message
| filter @message like /Exception/
| stats count(*) as errorCount by userId
| sort errorCount desc
| limit 20

-- P99 latency by service
fields @timestamp, service, latency
| stats pct(latency, 99) as p99, pct(latency, 95) as p95, avg(latency) as avg by service
| sort p99 desc

-- Count requests per status code
fields @timestamp, statusCode
| stats count(*) as count by statusCode
| sort count desc

Metric Filters

Convert log patterns into metrics for alarming:

resource "aws_cloudwatch_log_metric_filter" "error_count" {
  name           = "application-error-count"
  pattern        = "[timestamp, requestId, level=ERROR, ...]"
  log_group_name = "/ecs/my-app"

  metric_transformation {
    name          = "ErrorCount"
    namespace     = "MyApp/Application"
    value         = "1"
    default_value = "0"
  }
}

# Now create an alarm on this custom metric
resource "aws_cloudwatch_metric_alarm" "app_errors" {
  alarm_name          = "application-errors"
  metric_name         = "ErrorCount"
  namespace           = "MyApp/Application"
  threshold           = 5
  period              = 60
  evaluation_periods  = 1
  statistic           = "Sum"
  comparison_operator = "GreaterThanThreshold"
  alarm_actions       = [aws_sns_topic.alerts.arn]
}

CloudWatch Dashboards

Build operational dashboards visible to the whole team:

resource "aws_cloudwatch_dashboard" "production" {
  dashboard_name = "production-overview"
  dashboard_body = jsonencode({
    widgets = [
      {
        type   = "metric"
        x      = 0; y = 0; width = 12; height = 6
        properties = {
          title  = "API Request Rate & Error Rate"
          period = 60
          stat   = "Sum"
          metrics = [
            ["AWS/ApplicationELB", "RequestCount", "LoadBalancer", aws_lb.main.arn_suffix],
            ["AWS/ApplicationELB", "HTTPCode_Target_5XX_Count", "LoadBalancer", aws_lb.main.arn_suffix],
          ]
        }
      },
      {
        type   = "metric"
        x      = 12; y = 0; width = 12; height = 6
        properties = {
          title  = "ECS CPU & Memory"
          period = 60
          metrics = [
            ["AWS/ECS", "CPUUtilization", "ClusterName", "production", "ServiceName", "api"],
            ["AWS/ECS", "MemoryUtilization", "ClusterName", "production", "ServiceName", "api"],
          ]
        }
      },
      {
        type   = "log"
        x      = 0; y = 6; width = 24; height = 6
        properties = {
          title   = "Recent Errors"
          query   = "SOURCE '/ecs/my-app' | fields @timestamp, @message | filter @message like /ERROR/ | sort @timestamp desc | limit 20"
          region  = "us-east-1"
          view    = "table"
        }
      }
    ]
  })
}

Container Insights for ECS

Enable enhanced ECS monitoring with a single click:

resource "aws_ecs_cluster" "production" {
  name = "production"

  setting {
    name  = "containerInsights"
    value = "enabled"   # Enhanced metrics + logs at task/container level
  }
}

Container Insights adds per-container CPU/memory breakdown, network stats, and storage I/O — critical for diagnosing which container in a multi-container task is the bottleneck.

The Production Monitoring Checklist

Before going live, verify you have:

□ CPU/Memory alarms on all ECS services
□ Error rate alarms on ALB (5xx > 1%)
□ RDS storage alarm (< 10GB free)
□ RDS connection count alarm (> 80% of max_connections)
□ Lambda error rate alarm (> 1%)
□ Lambda timeout alarm (p99 > 80% of timeout)
□ Application logs flowing to CloudWatch Logs
□ Log retention set (default is never expire — expensive!)
□ Dashboard for on-call engineers
□ SNS → PagerDuty/Slack integration tested
□ Runbook linked in alarm description

Log retention is critical for cost control:

resource "aws_cloudwatch_log_group" "app" {
  name              = "/ecs/my-app"
  retention_in_days = 30  # Default is never expire — set this!
}

CloudWatch in the Real World

Cost optimization: CloudWatch can get expensive. Each custom metric costs $0.30/month. Logs cost $0.50/GB ingested. At scale:

  • Use EMF (Embedded Metric Format) to batch custom metrics
  • Set log retention on every log group
  • Use S3 export for logs older than 30 days
  • Prefer structured JSON logging for efficient Logs Insights queries

Multi-account monitoring: Use AWS Organizations + CloudWatch cross-account observability to aggregate metrics from all accounts into a central monitoring account.


*Phase 3 of CloudPath Academy covers production monitoring in depth — CloudWatch, X-Ray distributed tracing, custom metrics, dashboards, and real incident response. You'll build monitoring for your own ECS deployments as part of the curriculum.*

Get more guides like this

Get weekly cloud engineering guides — delivered Sundays. No spam. Unsubscribe anytime.

Ready to start your cloud engineering career?

CloudPath Academy gives you real hands-on experience — Jira sprints, production deployments, Ask Atlas AI mentor, and certifications.

Start Free