Grafana + PagerDuty

Connect Grafana and PagerDuty to Automate Incident Response and Alert Management

Bridge your observability data and on-call workflows to resolve incidents faster and cut alert fatigue.

Why integrate Grafana and PagerDuty?

Grafana is the go-to platform for visualizing and analyzing metrics, logs, and traces across your infrastructure stack. PagerDuty handles intelligent incident management, routing critical alerts to the right on-call engineers at the right time. Connecting Grafana with PagerDuty creates a direct pipeline from metric anomaly detection to structured incident response, so no critical threshold breach gets ignored.

Automate & integrate Grafana & PagerDuty

Use case

Automatic Incident Creation from Grafana Alerts

When a Grafana alert fires and crosses a defined threshold — say, a 5xx error rate exceeding 2% or p99 latency spiking above SLA — tray.ai opens a new PagerDuty incident with the full alert payload, dashboard link, and affected service metadata attached. This closes the gap between detection and escalation that costs teams critical minutes during outages. On-call engineers get a rich, actionable notification rather than a raw metric dump.

Use case

Auto-Resolve PagerDuty Incidents When Grafana Alerts Recover

When a Grafana alert transitions from firing to resolved, tray.ai automatically resolves the corresponding PagerDuty incident, preventing stale incidents from cluttering your queue and confusing on-call responders. This bidirectional status sync keeps both platforms aligned throughout the full alert lifecycle. Teams spend less time manually closing incidents and more time confirming that systems are genuinely stable.

Use case

Escalate High-Severity Grafana Alerts to Specific PagerDuty Services

Not all alerts need the same urgency or team. With tray.ai, you can route Grafana alerts to different PagerDuty services and escalation policies based on alert labels, severity tags, or the originating data source — database alerts go to the DBA on-call team, Kubernetes alerts go to the platform engineering squad, and application errors notify the backend development team. Each team gets only the incidents relevant to their domain, not everything.

Use case

Enrich PagerDuty Incidents with Grafana Dashboard Context

A bare alert notification rarely gives an on-call engineer enough to act on immediately. Using tray.ai, when a PagerDuty incident is created, the workflow can simultaneously query Grafana for a rendered dashboard snapshot or a direct deep-link to the relevant panel, then append it to the incident as a note or custom field. Engineers open their PagerDuty mobile notification and immediately see the metric trend that caused the incident — no dashboard hunting during a high-pressure outage.

Use case

Sync PagerDuty Incident Acknowledgments Back to Grafana Annotations

When an on-call engineer acknowledges or resolves a PagerDuty incident, tray.ai writes a Grafana annotation onto the relevant dashboard panel, marking exactly when the incident was noticed and resolved. This puts a visible, time-stamped overlay on your metric graphs that ties human response actions to system behavior. Over time, these annotations build a historical record of operational events directly inside your observability layer.

Use case

Suppress PagerDuty Alerts During Grafana-Scheduled Maintenance Windows

Planned maintenance, deployments, or load tests shouldn't flood your PagerDuty queue with spurious incidents. With tray.ai, when a Grafana silence or maintenance window is created, an automated workflow simultaneously sets a PagerDuty maintenance window for the affected services, preventing unnecessary pages to on-call engineers. When the Grafana silence expires, the PagerDuty maintenance window lifts automatically. Both systems stay consistent without anyone needing to update them separately.

Use case

Generate Weekly Incident and Alert Trend Reports

Combining Grafana's metric history with PagerDuty's incident data gives operations leaders a complete picture of system reliability. tray.ai can run a scheduled workflow that queries PagerDuty for weekly incident counts, MTTR, and responder data, then pushes summary statistics as annotations or data points into a Grafana dashboard or a connected reporting tool. Teams get visibility into alert volume trends, recurring failure patterns, and the operational cost of reliability issues — without manual data extraction.

Get started with Grafana & PagerDuty integration today

Grafana & PagerDuty Challenges

What challenges are there when working with Grafana & PagerDuty and how will using Tray.ai help?

Challenge

Alert Payload Structure Inconsistency Across Grafana Versions

Grafana's alerting system changed substantially between legacy alerting and the Unified Alerting engine introduced in Grafana 8+, resulting in very different webhook payload formats. Teams running different Grafana versions or migrating from legacy to unified alerting run into broken integrations when payload field names and structures change unexpectedly, causing missed PagerDuty incidents or malformed alert data.

How Tray.ai Can Help:

tray.ai's visual workflow builder lets teams build conditional data transformation logic that detects the incoming payload format and normalizes it to a consistent structure before passing it to PagerDuty. Field mapping and JSONPath expressions can be updated in the tray.ai interface without redeploying code, so adapting to a Grafana version change takes minutes rather than a sprint.

Challenge

Deduplicating Alerts to Prevent PagerDuty Incident Storms

When a single infrastructure failure triggers multiple correlated Grafana alerts — a database outage cascading into application errors, latency spikes, and health check failures — each alert can independently create a separate PagerDuty incident, overwhelming on-call engineers with duplicate pages for what is effectively one root cause. Alert storms like this erode trust in the monitoring system and slow incident response.

How Tray.ai Can Help:

tray.ai workflows handle deduplication by using the Grafana alert fingerprint or a shared label value as a PagerDuty dedup_key when calling the Events API, so multiple correlated alerts collapse into a single PagerDuty incident. tray.ai's built-in data store tracks active fingerprints so the workflow updates an existing incident rather than creating a new one.

Challenge

Maintaining Bidirectional Lifecycle Sync Without Duplicate Actions

Keeping Grafana alert states and PagerDuty incident statuses synchronized in both directions is genuinely tricky. Resolving an incident in PagerDuty shouldn't re-trigger a Grafana alert, and a Grafana recovery event shouldn't close an incident that was manually escalated by an engineer for further investigation. Without careful state management, bidirectional workflows can enter feedback loops or overwrite deliberate human actions.

How Tray.ai Can Help:

tray.ai's workflow logic supports conditional branching and state checks before taking any action. Workflows can query the current PagerDuty incident status before resolving it, skipping resolution if the incident has been manually escalated or moved to a different status. tray.ai's data store provides lightweight state persistence to track which actions were system-initiated versus human-initiated.

Challenge

Mapping Grafana Alert Severity to PagerDuty Escalation Policies at Scale

Large organizations can have dozens of Grafana alert rules with varying severity labels and hundreds of PagerDuty services with distinct escalation policies. Manually maintaining a mapping between these two systems is error-prone and time-consuming, and a misconfigured routing rule can mean a critical production alert goes to the wrong team or gets assigned the wrong urgency level.

How Tray.ai Can Help:

tray.ai lets teams define routing logic in a centralized workflow using lookup tables, conditional branches, or even a connected Google Sheet or Airtable as a dynamic routing configuration source. When routing rules change, operations teams update the configuration source rather than rebuilding the workflow, making large-scale routing management both scalable and auditable.

Challenge

Handling Grafana API Authentication and Rate Limits in High-Volume Environments

In high-traffic production environments, Grafana webhooks may fire hundreds of alerts per hour during degraded conditions. Programmatically fetching dashboard snapshots or annotation data for each incident can quickly exhaust Grafana API rate limits or run into authentication token expiry, leaving incomplete incident context in PagerDuty.

How Tray.ai Can Help:

tray.ai includes built-in retry logic, configurable request throttling, and error handling steps that gracefully manage API rate limit responses with exponential backoff. Authentication credentials for both Grafana and PagerDuty are stored securely in tray.ai's credential manager and can be rotated without modifying workflow logic, so high-volume alert scenarios don't result in data loss or broken integrations.

Start using our pre-built Grafana & PagerDuty templates today

Start from scratch or use one of our pre-built Grafana & PagerDuty templates to quickly solve your most common use cases.

Grafana & PagerDuty Templates

Find pre-built Grafana & PagerDuty solutions for common use cases

Browse all templates

Template

Grafana Alert Firing → Create PagerDuty Incident

Monitors an incoming Grafana webhook for alert state changes and automatically creates a structured PagerDuty incident with severity mapping, affected service, alert labels, and a link back to the originating Grafana panel whenever an alert transitions to the Firing state.

Steps:

  • Receive Grafana alert webhook payload via tray.ai trigger
  • Parse alert state, severity label, and panel metadata from the payload
  • Map Grafana severity (critical/warning) to PagerDuty urgency (high/low)
  • Create a new PagerDuty incident via the Events API, attaching alert title, summary, and dashboard deep-link
  • Store the PagerDuty incident ID mapped to the Grafana alert fingerprint for lifecycle tracking

Connectors Used: Grafana, PagerDuty

Template

Grafana Alert Resolved → Auto-Resolve PagerDuty Incident

Listens for Grafana alert resolution events and automatically sends a resolve action to PagerDuty using the stored incident ID, closing the incident and appending a resolution note with the recovery timestamp and Grafana alert name.

Steps:

  • Receive Grafana webhook with alert state set to Resolved
  • Look up the previously stored PagerDuty incident ID using the Grafana alert fingerprint
  • Send a resolve event to PagerDuty Events API to close the incident
  • Append a note to the PagerDuty incident with the Grafana recovery time and metric values at resolution

Connectors Used: Grafana, PagerDuty

Template

Route Grafana Alerts to PagerDuty Services by Label

Inspects Grafana alert labels and annotations to dynamically route incoming alerts to the correct PagerDuty service and escalation policy, supporting multi-team on-call environments where different services own different infrastructure domains.

Steps:

  • Receive Grafana alert webhook and extract alert labels (team, environment, component)
  • Evaluate a routing table in tray.ai to match label values to PagerDuty service integration keys
  • Create the PagerDuty incident against the matched service with full alert context
  • If no matching service is found, fall back to a default catch-all PagerDuty service and notify a Slack channel

Connectors Used: Grafana, PagerDuty

Template

PagerDuty Incident Acknowledged → Write Grafana Annotation

Triggers when a PagerDuty incident is acknowledged or resolved and writes a corresponding time-stamped annotation to the relevant Grafana dashboard panel, creating a persistent operational record overlaid on metric visualizations.

Steps:

  • Receive PagerDuty webhook for incident.acknowledged or incident.resolved event
  • Extract incident title, responder name, and timestamp from the webhook payload
  • Identify the target Grafana dashboard UID and panel ID from incident custom fields
  • POST a new annotation to the Grafana Annotations API with event type, responder, and incident link

Connectors Used: PagerDuty, Grafana

Template

Sync Grafana Silence → PagerDuty Maintenance Window

Detects when a Grafana alert silence is created or updated and automatically creates a matching PagerDuty maintenance window for the specified services, then removes the window when the Grafana silence expires.

Steps:

  • Poll Grafana API on a schedule to detect newly created or updated silence objects
  • Extract silence duration, affected alert matchers, and creator metadata
  • Create a PagerDuty maintenance window for the matched services with the same start and end time
  • On silence expiry, confirm PagerDuty maintenance window is also ended or deleted

Connectors Used: Grafana, PagerDuty

Template

Weekly PagerDuty Incident Summary → Grafana Annotation Dashboard

Runs on a weekly schedule to pull incident counts, MTTR, and top alerting services from PagerDuty, then pushes summary annotations into a designated Grafana operations dashboard so teams can track reliability trends over time.

Steps:

  • Trigger workflow on a weekly schedule (e.g., every Monday at 08:00 UTC)
  • Query PagerDuty API for incidents in the past 7 days, grouped by service and severity
  • Calculate MTTR, total incident count, and top recurring alert names
  • POST weekly summary annotations to a Grafana dashboard, tagging each with the reporting week
  • Optionally send the summary report to a Slack channel or email distribution list

Connectors Used: PagerDuty, Grafana