SKILL·50492D

configure-alerting-rules

Name: configure-alerting-rules
Author: pjt222

pjt222

Aktualisiert 1 month ago

9 Ansichten

Metaaiautomation

Über

Diese Fähigkeit konfiguriert Prometheus Alertmanager, um handlungsorientierte Incident-Alarme einzurichten. Sie kümmert sich um Routing-Strukturen, Empfänger (wie Slack und PagerDuty) und Funktionen zur Verringerung der Alarmmüdigkeit. Nutzen Sie sie für die Implementierung von proaktivem Monitoring, die Integration mit On-Call-Systemen oder die Migration zu einem Prometheus-basierten Alerting-Stack.

Schnellinstallation

Claude Code

Dokumentation

Configure Alerting Rules

Set up Prometheus alerting rules + Alertmanager → reliable, actionable incident notifications.

See Extended Examples for complete config files + templates.

Use When

Impl proactive monitoring w/ automated incident detection
Route alerts to correct teams by severity + service ownership
Cut alert fatigue via intelligent grouping + dedup
Integrate monitoring w/ on-call systems (PagerDuty, Opsgenie)
Establish escalation policies for critical prod issues
Migrate legacy monitoring → Prometheus-based alerting
Create actionable alerts guiding responders to resolution

In

Required: Prometheus metrics to alert on (error rates, latency, saturation)
Required: On-call rotation + escalation policies
Optional: Existing alert defs to migrate
Optional: Notification channels (Slack, email, PagerDuty)
Optional: Runbook docs for common alerts

Do

Step 1: Deploy Alertmanager

Install + configure Alertmanager to receive alerts from Prometheus.

Docker Compose deployment (basic structure):

version: '3.8'
services:
  alertmanager:
    image: prom/alertmanager:v0.26.0
    ports:
      - "9093:9093"
    volumes:
      - ./alertmanager.yml:/etc/alertmanager/alertmanager.yml
    # ... (see EXAMPLES.md for complete configuration)

Basic Alertmanager config (alertmanager.yml excerpt):

global:
  resolve_timeout: 5m
  slack_api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'

route:
  receiver: 'default-receiver'
  group_by: ['alertname', 'cluster', 'service']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h

  routes:
    - match:
        severity: critical
      receiver: pagerduty-critical

# ... (see EXAMPLES.md for complete routing, inhibition rules, and receivers)

Configure Prometheus to use Alertmanager (prometheus.yml):

alerting:
  alertmanagers:
    - static_configs:
        - targets:
            - alertmanager:9093
      timeout: 10s
      api_version: v2

→ Alertmanager UI accessible at http://localhost:9093, Prometheus "Status > Alertmanagers" shows UP.

If err:

Check Alertmanager logs: docker logs alertmanager
Verify Prometheus can reach Alertmanager: curl http://alertmanager:9093/api/v2/status
Test webhook URLs: curl -X POST <SLACK_WEBHOOK_URL> -d '{"text":"test"}'
Valid. YAML syntax: amtool check-config alertmanager.yml

Step 2: Define Alerting Rules in Prometheus

Create alerting rules that fire when conditions met.

Create alerting rules file (/etc/prometheus/rules/alerts.yml excerpt):

groups:
  - name: instance_alerts
    interval: 30s
    rules:
      - alert: InstanceDown
        expr: up == 0
        for: 5m
        labels:
          severity: critical
          team: infrastructure
        annotations:
          summary: "Instance {{ $labels.instance }} is down"
          description: "{{ $labels.instance }} has been down for >5min."
          runbook_url: "https://wiki.example.com/runbooks/instance-down"

      - alert: HighCPUUsage
        expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "High CPU usage on {{ $labels.instance }}"
          # ... (see EXAMPLES.md for complete alerts)

Alert design best practices:

for duration: Prevents flapping alerts. Use 5-10 min for most alerts.
Descriptive annotations: Include current value, affected resource, runbook link.
Severity levels: critical (pages on-call), warning (investigate), info (FYI)
Team labels: Enable routing to correct team/channel
Runbook links: Every alert should have runbook URL

Load rules into Prometheus:

# prometheus.yml
rule_files:
  - "rules/*.yml"

Valid. + reload:

promtool check rules /etc/prometheus/rules/alerts.yml
curl -X POST http://localhost:9090/-/reload

→ Alerts visible in Prometheus "Alerts" page, alerts fire when thresholds exceeded, Alertmanager receives fired alerts.

If err:

Check Prometheus logs for rule eval errors
Valid. rule syntax w/ promtool check rules
Test alert queries independently in Prometheus UI
Inspect alert state transitions: Inactive → Pending → Firing

Step 3: Create Notification Templates

Design readable, actionable notification msgs.

Create template file (/etc/alertmanager/templates/default.tmpl excerpt):

{{ define "slack.default.title" }}
[{{ .Status | toUpper }}] {{ .GroupLabels.alertname }}
{{ end }}

{{ define "slack.default.text" }}
{{ range .Alerts }}
*Alert:* {{ .Labels.alertname }}
*Severity:* {{ .Labels.severity }}
*Summary:* {{ .Annotations.summary }}
{{ if .Annotations.runbook_url }}*Runbook:* {{ .Annotations.runbook_url }}{{ end }}
{{ end }}
{{ end }}

# ... (see EXAMPLES.md for complete email and PagerDuty templates)

Use templates in receivers:

receivers:
  - name: 'slack-custom'
    slack_configs:
      - channel: '#alerts'
        title: '{{ template "slack.default.title" . }}'
        text: '{{ template "slack.default.text" . }}'

→ Notifications formatted consistent, include all relevant context, actionable w/ runbook links.

If err:

Test template rendering: amtool template test --config.file=alertmanager.yml
Check template syntax errs in Alertmanager logs
Use {{ . | json }} to debug template data structure

Step 4: Configure Routing + Grouping

Optimize alert delivery w/ intelligent routing rules.

Advanced routing config (excerpt):

route:
  receiver: 'default-receiver'
  group_by: ['alertname', 'cluster', 'service']
  group_wait: 30s

  routes:
    - match:
        team: platform
      receiver: 'team-platform'
      routes:
        - match:
            severity: critical
          receiver: 'pagerduty-platform'
          group_wait: 10s
          repeat_interval: 15m
          continue: true   # Also send to Slack

# ... (see EXAMPLES.md for complete routing with time intervals)

Grouping strategies:

# Group by alertname: All HighCPU alerts bundled together
group_by: ['alertname']

# Group by alertname AND cluster: Separate notifications per cluster
group_by: ['alertname', 'cluster']

→ Alerts routed to correct teams, grouped logically, timing appropriate for severity.

If err:

Test routing: amtool config routes test --config.file=alertmanager.yml --alertname=HighCPU --label=severity=critical
Check routing tree: amtool config routes show --config.file=alertmanager.yml
Valid. continue: true if alert should match multi routes

Step 5: Implement Inhibition + Silencing

Cut alert noise w/ inhibition rules + temporary silences.

Inhibition rules (suppress dependent alerts):

inhibit_rules:
  # Cluster down suppresses all node alerts in that cluster
  - source_match:
      alertname: 'ClusterDown'
      severity: 'critical'
    target_match_re:
      alertname: '(InstanceDown|HighCPU|HighMemory)'
    equal: ['cluster']

  # Service down suppresses latency and error alerts
  - source_match:
      alertname: 'ServiceDown'
    target_match_re:
      alertname: '(HighLatency|HighErrorRate)'
    equal: ['service', 'namespace']

# ... (see EXAMPLES.md for more inhibition patterns)

Create silences programmatically:

# Silence during maintenance
amtool silence add \
  instance=app-server-1 \
  --author="ops-team" \
  --comment="Scheduled maintenance" \
  --duration=2h

# List and manage silences
amtool silence query
amtool silence expire <SILENCE_ID>

→ Inhibition cuts cascade alerts auto, silences prevent notifications during planned maintenance.

If err:

Test inhibition logic w/ live alerts
Check Alertmanager UI "Silences" tab
Valid. silence matchers exact (labels must match perfectly)

Step 6: Integrate with External Systems

Connect Alertmanager to PagerDuty, Opsgenie, Jira, etc.

PagerDuty integration (excerpt):

receivers:
  - name: 'pagerduty'
    pagerduty_configs:
      - routing_key: 'YOUR_INTEGRATION_KEY'
        severity: '{{ .CommonLabels.severity }}'
        description: '{{ range .Alerts.Firing }}{{ .Annotations.summary }}{{ end }}'
        details:
          firing: '{{ .Alerts.Firing | len }}'
          alertname: '{{ .GroupLabels.alertname }}'
        # ... (see EXAMPLES.md for complete integration examples)

Webhook for custom integrations:

receivers:
  - name: 'webhook-custom'
    webhook_configs:
      - url: 'https://your-webhook-endpoint.com/alerts'
        send_resolved: true

→ Alerts create incidents in PagerDuty, appear in team comms channels, trigger on-call escalations.

If err:

Valid. API keys/tokens valid
Check network connectivity to external services
Test webhook endpoints independently w/ curl
Enable debug mode: --log.level=debug

Check

Traps

Alert fatigue: Too many low-pri alerts → responders ignore critical ones. Set strict thresholds, use inhibition.
Missing for duration: Alerts w/o for fire on transient spikes. Always use 5-10 min windows.
Overly broad grouping: Grouping by ['...'] sends individual notifications. Use specific label grouping.
No runbook links: Alerts w/o runbooks leave responders guessing. Every alert needs runbook URL.
Incorrect severity: Mislabeling warnings as critical desensitizes team. Reserve critical for emergencies.
Forgotten silences: Silences w/o expiration can hide real issues. Always set end times.
Single route: All alerts to one channel loses context. Use team-specific routing.
No inhibition: Cascade alerts during outages create noise. Impl inhibition rules.

→

setup-prometheus-monitoring - Define metrics + recording rules feeding alerting rules
define-slo-sli-sla - Generate SLO burn rate alerts for error budget mgmt
write-incident-runbook - Create runbooks linked from alert annotations
build-grafana-dashboards - Visualize alert firing history + silence patterns

GitHub Repository

pjt222/agent-almanac

Pfad: i18n/caveman-ultra/skills/configure-alerting-rules

agentsagentskillsai-assisted-developmentclaude-codeskillsteams

FAQ

Frequently asked questions

What is the configure-alerting-rules skill?

configure-alerting-rules is a Claude Skill by pjt222. Skills package instructions and resources that Claude loads on demand, so Claude can perform configure-alerting-rules-related tasks without extra prompting.

How do I install configure-alerting-rules?

Use the install commands on this page: add configure-alerting-rules to Claude Code as a plugin, or clone its repository into your skills directory, then restart Claude so it picks up the skill.

What category does configure-alerting-rules belong to?

configure-alerting-rules is in the Meta category, tagged ai and automation.

Is configure-alerting-rules free to use?

Yes. configure-alerting-rules is listed on AIMCP and free to install. It runs inside Claude, so no separate service account is required to use the skill itself.

configure-alerting-rules

Über

Schnellinstallation

Claude Code

Dokumentation

Configure Alerting Rules

Use When

In

Do

Step 1: Deploy Alertmanager

Step 2: Define Alerting Rules in Prometheus

Step 3: Create Notification Templates

Step 4: Configure Routing + Grouping

Step 5: Implement Inhibition + Silencing

Step 6: Integrate with External Systems

Check

Traps

→

GitHub Repository

Frequently asked questions

What is the configure-alerting-rules skill?

How do I install configure-alerting-rules?

What category does configure-alerting-rules belong to?

Is configure-alerting-rules free to use?

Verwandte Skills