Infrastructure metrics coverage across the full stack -- compute, database, network, and serverless -- with alert thresholds calibrated to your services' actual behaviour rather than arbitrary percentages that create alert fatigue. AWS metric collection: Datadog agent on EC2 and ECS task definitions collects CPU, memory, disk I/O, and network metrics at 15-second intervals; CloudWatch native metrics ingested for Lambda (invocation count, duration, error rate, concurrent executions, throttle count), RDS (CPU, IOPS, connection count, replica lag), SQS (queue depth, oldest message age), and EKS nodes. GCP metric collection: Datadog or Grafana Agent deployed to GKE nodes and Cloud Run services with GCP Cloud Monitoring as the supplementary metrics source. Alert calibration process: baseline each metric over a 14-day normal operation window; identify the 95th percentile of normal values; set alert thresholds at statistically meaningful deviations above that baseline; suppress alerts during scheduled maintenance windows. Alert severity routing: critical (page immediately, wake up on-call) for conditions like service down, database connection exhaustion, or disk full within 1 hour; warning (notify via Slack, no page) for conditions approaching thresholds but not yet urgent; informational (log only) for metrics useful in postmortems but not requiring immediate action. PagerDuty integration with on-call schedule rotation so the right engineer is paged without manual escalation; OpsGenie as alternative for teams already using it. Alert-to-runbook linking: every alert configured with a runbook_url annotation pointing to the relevant incident response procedure.