Grafana + mttrlyAlert responseApproval-gated remediation

Fix incidents after Grafana alerts

Keep Grafana and Prometheus for visibility. Add mttrly when the next step is diagnosis, approval, remediation, and audit history.

Direct answer

Grafana and Prometheus are strong at metrics, dashboards, time-series visibility, alerts, and historical context. After a Grafana alert fires, mttrly adds the incident response action layer: an operator or AI assistant can inspect the affected server with scoped MCP tools, run diagnostics, and choose approval-gated playbooks. A human responder can approve from Telegram or the dashboard, verify the result, and keep an audit log. mttrly complements Grafana; it does not replace it.

For the general concept, read the mttrly incident response action layer overview.

What Grafana is great at

Grafana and Prometheus are excellent parts of an observability stack. They help teams see signals before, during, and after an incident. mttrly should sit next to that stack, not in place of it.

Metrics dashboards that help teams see current service and infrastructure state.

Prometheus-style time series visibility for resource pressure, saturation, error rates, and trends.

Alerting when thresholds, anomalies, or service indicators point to a problem.

Historical context that helps responders compare the current incident with earlier behavior.

What happens after the alert

The hard part often begins after the alert fires. A dashboard can tell the responder that something is unhealthy, but the response still needs investigation, a bounded fix path, approval, and verification.

The alert says CPU, disk, memory, latency, or error rate is bad, but the responder still needs current server status.

The operator needs logs and service reality before deciding whether the issue is capacity, a broken deploy, a noisy process, or a failing dependency.

A fix needs boundaries: read-only checks first, prepared remediation where possible, and human approval before risky state changes.

The team needs verification and an audit trail after the response, not just a notification that something was wrong.

How mttrly completes the loop

mttrly gives the operator or AI assistant a controlled response layer after a Grafana alert. It favors scoped diagnostics and prepared playbooks, and it keeps human approval in front of risky actions.

Outbound server agent

The mttrly agent runs on the server and connects outbound, giving the response layer a controlled path to inspect and act without exposing a broad inbound management surface.

MCP for Claude Code, Cursor, and Codex

AI IDEs can use mttrly MCP tools to list servers, read status, run diagnostics, list playbooks, create pending actions, and review audit history.

Diagnostics before remediation

The assistant or operator can inspect health, alerts, logs, and targeted diagnostics before choosing a remediation path.

Approval-gated playbooks

Prepared playbooks handle common operational fixes. Risky playbooks create pending actions instead of executing silently.

Telegram and mobile approvals

Human review can happen from Telegram, the dashboard, or an IDE confirmation flow, so the phone path stays approval-gated.

Audit log

mttrly records requested actions, approvals, execution results, diagnostics, and follow-up context for incident review.

Disk alert from Grafana -> mttrly cleanup playbook

This workflow keeps Grafana as the alerting layer and mttrly as the approval-gated action layer after the responder decides to investigate.

1

Grafana/Prometheus alert fires

Disk is 90% full on prod-web-01.

2

AI or operator asks mttrly to investigate

The alert is the signal; mttrly is used after the alert to inspect the affected server.

3

mttrly_get_server_status

Read current CPU, RAM, disk, active alerts, and server health.

4

mttrly_run_diagnostic

Run a focused diagnostic for disk pressure, large logs, containers, or service symptoms.

5

mttrly_list_playbooks

Find bounded cleanup or remediation playbooks before considering broader commands.

6

mttrly_run_playbook

Create a pending cleanup or remediation action when the playbook changes server state.

7

Human approval

A responder approves or rejects from Telegram, the dashboard, or the IDE confirmation flow.

8

Verification and audit

mttrly verifies the result where possible and records the action in the audit log.

9

Grafana confirms recovery

The team watches metrics return toward normal in Grafana or the existing observability stack.

Grafana / Prometheus vs mttrly

RowGrafana / Prometheusmttrly
Primary jobObserve metrics, visualize dashboards, and alert when signals indicate a problem.Diagnose and remediate after an alert, with scoped tools and approval gates.
Best atTime series visibility, dashboards, alert context, and historical comparison.Server diagnostics, playbooks, approvals, mobile response, and audit trails.
Typical user question"What changed, which metric is unhealthy, and when did it start?""What should we safely inspect or fix on this server next?"
Data sourceMetrics, dashboards, Prometheus-style time series, and alert history.Connected servers, health state, logs, diagnostics, playbook catalog, pending actions, and audit history.
Action modelRead, visualize, alert, route, and provide incident context.Run diagnostics, request approval-gated remediation, execute approved playbooks, and verify results.
Mobile responseAlert notification and dashboard review through the team response process.Telegram or dashboard approval for pending actions when configured.
Audit / approvalAlert history and observability context.Explicit human approval for risky actions plus an action audit log.

Recommended stack

The recommended stack is intentionally boring: keep the monitoring system that tells you what changed, then add an action layer that helps a human-controlled response move safely.

Teams formalizing on-call response can also review the on-call use case and the mttrly plans.

When Grafana alone is enough

You may not need mttrly for every Grafana setup. Grafana alone can be the right answer when the job stops at visibility, or when your response process is already complete without another action layer.

  • -You only need dashboards, metrics, and alert visibility.
  • -Your team already has mature runbooks, escalation, approval, and on-call response processes.
  • -You do not need remote or mobile remediation after alerts.
  • -You do not want any server-side action agent installed on production servers.

FAQ

Does mttrly replace Grafana?

No. mttrly does not replace Grafana or Prometheus. Use Grafana for metrics, dashboards, alerting, and historical context. Use mttrly after the alert when an operator or AI assistant needs scoped diagnostics, approval-gated remediation, and an audit trail.

Can mttrly read Grafana dashboards?

This page does not assume a direct Grafana dashboard integration. The safe pattern is to keep Grafana as the metrics and dashboard source, then use mttrly after the alert to inspect the affected server, run diagnostics, request playbooks, and record the response.

What does mttrly do after a Grafana alert?

After a Grafana alert indicates a problem, mttrly can help inspect server status, run targeted diagnostics, list remediation playbooks, create pending approval-gated actions, execute approved playbooks, verify the result, and store the audit log.

Can Claude Code use mttrly after a Grafana alert?

Yes. Claude Code, Cursor, and Codex can use mttrly through MCP after a Grafana alert. The assistant can investigate with scoped tools and request remediation, but risky actions require explicit human approval before execution.

Can fixes be approved from Telegram?

Yes. When Telegram is configured, pending mttrly actions can be reviewed and approved or rejected from a phone. The same approval-gated model can also work through the dashboard or an IDE confirmation flow.

Is this safer than giving an AI SSH after an alert?

For common incident response, scoped mttrly tools are safer than handing an AI a broad SSH-like terminal by default. mttrly separates read-only diagnostics from risky actions, routes state changes through human approval, and records the result in an audit log.

What if the mttrly agent is offline?

If the mttrly agent is offline, mttrly cannot run live diagnostics or remediation on that server. Use Grafana for historical context and restore server or agent connectivity through your provider console, SSH, or recovery process before relying on mttrly actions.

Add an action layer after Grafana alerts

Start with read-only investigation, then add approval-gated playbooks when your team is ready to remediate incidents from an AI IDE or phone.