AWS CloudWatch + PagerDuty

Integrate AWS CloudWatch with PagerDuty to Automate Incident Response

Turn CloudWatch alarms into PagerDuty incidents instantly, so your on-call team is always the first to know.

Why integrate AWS CloudWatch and PagerDuty?

AWS CloudWatch and PagerDuty do two different jobs that only work when they're connected. CloudWatch watches your AWS infrastructure continuously, tracking metrics, logs, and alarms across EC2, Lambda, RDS, and dozens of other services. PagerDuty makes sure the right engineers are notified and moving the moment something breaks. Without a connection between them, there's a gap between detection and response where incidents go unnoticed and resolution time climbs. Connecting these platforms through tray.ai closes that gap — an automated pipeline from anomaly to alert to fix.

View AWS CloudWatch documentation View PagerDuty documentation

Automate & integrate AWS CloudWatch & PagerDuty

Learn about automation Discover integration

Use case

CloudWatch Alarm to PagerDuty Incident Creation

When a CloudWatch alarm transitions to ALARM state — whether from high CPU utilization, memory pressure, or network anomalies — tray.ai opens a new PagerDuty incident and routes it to the appropriate service and escalation policy. The incident is populated with alarm metadata including the affected resource, metric value, and breached threshold, giving on-call engineers immediate context without needing to log into the AWS console.

Use case

Auto-Resolve PagerDuty Incidents When CloudWatch Returns to OK

When a CloudWatch alarm recovers and transitions back to OK state, tray.ai automatically resolves the corresponding PagerDuty incident. Stale alerts stop cluttering dashboards and keeping engineers unnecessarily on edge. PagerDuty always reflects the true health of your AWS infrastructure in real time.

Use case

Severity-Based Incident Routing from CloudWatch Metrics

A Lambda timeout deserves a different response than a full RDS database failure. With tray.ai, you can define conditional logic that maps CloudWatch alarm severity, namespace, or resource type to specific PagerDuty services, urgency levels, and escalation policies. Critical P1 incidents immediately page senior engineers while informational warnings are quietly logged for review.

Use case

CloudWatch Log Insights Anomaly Alerting via PagerDuty

CloudWatch Log Insights can surface error spikes, unusual patterns, and application-level failures buried in log streams. By connecting with PagerDuty through tray.ai, teams can trigger incidents when log-based metric filters breach thresholds — a sudden surge in 5xx errors or repeated authentication failures, for example — so application-layer issues get the same incident management treatment as infrastructure alarms.

Use case

Scheduled AWS Health and Budget Alarm Summaries to PagerDuty

Beyond real-time alerting, tray.ai can run scheduled workflows that query CloudWatch for metric trends, billing anomalies, or AWS Health events and push summarized reports as low-urgency PagerDuty incidents or status updates. Operations teams get visibility into slow-burning issues — gradually increasing error rates, cost overruns — before they hit critical thresholds.

Use case

Multi-Region CloudWatch Alarm Aggregation into Unified PagerDuty Incidents

Organizations running workloads across multiple AWS regions often end up with the same underlying issue triggering dozens of redundant alarms. tray.ai can aggregate correlated CloudWatch alarms from multiple regions into a single, deduplicated PagerDuty incident, cutting the noise and helping on-call engineers find the root cause without sifting through hundreds of duplicate notifications.

Use case

Post-Incident CloudWatch Metric Snapshots Attached to PagerDuty

Once a PagerDuty incident is resolved, tray.ai can automatically retrieve historical CloudWatch metric data from the incident window and attach it as notes or links within the PagerDuty incident timeline. Every incident becomes a self-documenting record with the exact metric graphs and log data from the failure period, so post-mortems move faster without engineers manually pulling reports after the fact.

Get started with AWS CloudWatch & PagerDuty integration today

Talk to sales See how tray works

AWS CloudWatch & PagerDuty Challenges

What challenges are there when working with AWS CloudWatch & PagerDuty and how will using Tray.ai help?

Challenge

Alarm State Transitions Generating Duplicate or Redundant Incidents

CloudWatch alarms frequently flap between ALARM and OK states during intermittent issues, flooding PagerDuty with duplicate incident create and resolve events that exhaust on-call engineers and erode trust in the alerting system.

How Tray.ai Can Help:

tray.ai workflows implement deduplication logic using PagerDuty's dedup_key field and state-tracking within the workflow itself, so a flapping alarm maps to a single incident lifecycle rather than generating a flood of redundant notifications.

Challenge

Mapping AWS Resource Context to Actionable PagerDuty Incidents

Raw CloudWatch alarm payloads contain AWS-specific identifiers like ARNs, metric namespaces, and dimension keys that mean something to AWS engineers but leave on-call responders without the plain-language context they need to act quickly.

How Tray.ai Can Help:

tray.ai's data transformation tools let teams parse and enrich CloudWatch payloads — translating resource ARNs into human-readable names, appending runbook links, and formatting metric data into clear incident summaries — before anything reaches PagerDuty.

Challenge

Routing Alarms from Multiple AWS Accounts and Regions

Enterprises running across multiple AWS accounts and regions face a real headache consolidating CloudWatch alarms from fragmented infrastructure into a coherent PagerDuty incident structure without building and maintaining custom routing logic in every account.

How Tray.ai Can Help:

tray.ai acts as a centralized integration layer that receives alarm events from all AWS accounts and regions via a shared SNS endpoint, applies unified routing logic, and maps alarms to the correct PagerDuty services and teams — no per-account Lambda functions required.

Challenge

Keeping PagerDuty Incident State in Sync with CloudWatch Alarm Lifecycle

Without automated lifecycle management, PagerDuty incidents stay open long after a CloudWatch alarm has recovered, creating stale incident backlogs that mislead teams about actual infrastructure health and make on-call handoffs messy.

How Tray.ai Can Help:

tray.ai workflows handle the full alarm lifecycle — creating incidents on ALARM transitions, acknowledging them when engineers accept the alert, and resolving them automatically when CloudWatch returns to OK — so PagerDuty stays synchronized with live AWS infrastructure state.

Challenge

Alert Fatigue from High-Volume CloudWatch Metric Alarms

Large AWS environments with hundreds of monitored resources can generate enormous volumes of CloudWatch alarms, many representing expected transient conditions. Routing all of them directly to PagerDuty overwhelms on-call teams and causes critical alerts to get buried.

How Tray.ai Can Help:

tray.ai filters and aggregates before incidents reach PagerDuty — suppressing known transient alarms, grouping related metric failures into single incidents, and applying time-based rules that reduce off-hours notifications for non-critical services — so engineers only see alerts that actually need a human response.

Start using our pre-built AWS CloudWatch & PagerDuty templates today

Start from scratch or use one of our pre-built AWS CloudWatch & PagerDuty templates to quickly solve your most common use cases.

Talk to sales See how tray works

AWS CloudWatch & PagerDuty Templates

Find pre-built AWS CloudWatch & PagerDuty solutions for common use cases

Browse all templates

Template

CloudWatch Alarm → PagerDuty Incident (Real-Time)

Automatically creates a PagerDuty incident whenever a CloudWatch alarm transitions to ALARM state, populating it with the alarm name, affected AWS resource ARN, breached metric value, and a direct link to the CloudWatch console. Resolves the incident automatically when the alarm returns to OK.

Steps:

Listen for CloudWatch alarm state change events via EventBridge or SNS webhook trigger
Evaluate alarm state — branch for ALARM, OK, and INSUFFICIENT_DATA transitions
Create or resolve the corresponding PagerDuty incident with enriched alarm metadata

Connectors Used: AWS CloudWatch, PagerDuty

Talk to sales

Template

Severity-Tiered CloudWatch to PagerDuty Routing

Evaluates incoming CloudWatch alarms against a configurable severity matrix and routes them to the appropriate PagerDuty service and urgency level. Critical production alarms trigger high-urgency incidents with immediate escalation, while non-critical alarms create low-urgency incidents without waking on-call staff.

Steps:

Receive CloudWatch alarm payload and extract namespace, metric name, and alarm description
Apply conditional logic to classify severity and map to target PagerDuty service and urgency
Create the PagerDuty incident with severity-appropriate settings and resource context

Connectors Used: AWS CloudWatch, PagerDuty

Talk to sales

Template

CloudWatch Log Metric Filter Breach to PagerDuty Alert

Monitors CloudWatch log metric filters for application-level anomalies such as error rate spikes or failed authentication events. When a log-based metric breaches a defined threshold, tray.ai triggers a PagerDuty incident with log query context, so application teams can investigate faster.

Steps:

Trigger workflow from CloudWatch alarm backed by a log metric filter
Run a CloudWatch Log Insights query to fetch recent matching log entries
Create PagerDuty incident with alarm details and attach sample log output as incident notes

Connectors Used: AWS CloudWatch, PagerDuty

Talk to sales

Template

Multi-Region Alarm Deduplication and PagerDuty Consolidation

Aggregates CloudWatch alarm events from multiple AWS regions, detects correlated alarms representing the same underlying issue, and creates a single consolidated PagerDuty incident rather than flooding on-call engineers with duplicate notifications. Subsequent correlated alarms are appended as notes on the existing incident.

Steps:

Collect CloudWatch alarm events from multiple regions via a unified SNS fan-in topic
Check for existing open PagerDuty incidents with matching deduplication key
Create a new consolidated incident if none exists, or append alarm details to the existing incident as a note

Connectors Used: AWS CloudWatch, PagerDuty

Talk to sales

Template

Post-Incident CloudWatch Metric Report Attachment

When a PagerDuty incident is resolved, automatically queries CloudWatch for metric statistics covering the incident window and attaches a formatted summary — including peak values, anomaly timestamps, and affected resource identifiers — directly to the PagerDuty incident as a post-mortem data artifact.

Steps:

Trigger on PagerDuty incident resolved webhook event
Parse incident timestamps and extract the affected CloudWatch resource and metric from incident details
Query CloudWatch GetMetricStatistics for the incident window and post a formatted summary as a PagerDuty incident note

Connectors Used: AWS CloudWatch, PagerDuty

Talk to sales

Template

Daily CloudWatch Anomaly Digest to PagerDuty Open Incidents

Runs on a daily schedule to query CloudWatch Anomaly Detector findings and AWS Health events, then creates low-urgency PagerDuty incidents for any new anomalies found. Teams get visibility into gradual degradation without relying solely on threshold-based alarms.

Steps:

Schedule workflow to run daily and call CloudWatch DescribeAnomalyDetectors and list recent anomalies
Filter for anomalies not previously reported and not yet within active alarm state
Create low-urgency PagerDuty incidents for each new anomaly with trend context and recommended investigation steps

Connectors Used: AWS CloudWatch, PagerDuty

Talk to sales