MCP HubMCP Hub
Retour aux compétences

correlate-observability-signals

pjt222
Mis à jour 2 days ago
3 vues
17
2
17
Voir sur GitHub
Métaapidesign

À propos

Cette compétence unifie les métriques, les journaux et les traces pour un débogage cohérent dans les systèmes distribués. Elle implémente des exemplaires pour lier les journaux aux traces et construit des tableaux de bord unifiés en utilisant les méthodes RED/USE afin de permettre une analyse rapide de la cause racine. Utilisez-la lors de l'investigation d'incidents complexes impliquant plusieurs systèmes ou lors du passage d'outils en silos à une plateforme d'observabilité unifiée.

Installation rapide

Claude Code

Recommandé
Principal
npx skills add pjt222/agent-almanac -a claude-code
Commande PluginAlternatif
/plugin add https://github.com/pjt222/agent-almanac
Git CloneAlternatif
git clone https://github.com/pjt222/agent-almanac.git ~/.claude/skills/correlate-observability-signals

Copiez et collez cette commande dans Claude Code pour installer cette compétence

Documentation

Observability-Signale korrelieren

Verbinden metrics, logs, and traces for unified debugging across the three pillars of observability.

Wann verwenden

  • Investigating complex incidents that span multiple systems
  • Reducing MTTR (mean time to resolution)
  • Building unified observability dashboards
  • Implementing distributed tracing
  • Moving from siloed tools to unified observability

Eingaben

  • Erforderlich: Prometheus (metrics)
  • Erforderlich: Log aggregation system (Loki, Elasticsearch, CloudWatch)
  • Erforderlich: Distributed tracing backend (Tempo, Jaeger, Zipkin)
  • Optional: Grafana for unified visualization
  • Optional: OpenTelemetry instrumentation

Vorgehensweise

See Extended Examples for complete configuration files and templates.

Schritt 1: Implementieren Trace Context Propagation

Hinzufuegen trace IDs to all logs and metrics using OpenTelemetry:

// Go example: Propagate trace context to logs
package main

import (
    "context"
    "log"

    "go.opentelemetry.io/otel"
    "go.opentelemetry.io/otel/trace"
)

func handleRequest(ctx context.Context, userID string) {
    // Extract trace context
    span := trace.SpanFromContext(ctx)
    traceID := span.SpanContext().TraceID().String()

    // Include trace ID in structured logs
    log.Printf("trace_id=%s user_id=%s action=process_request", traceID, userID)

    // Business logic here
    processData(ctx, userID)
}

func processData(ctx context.Context, userID string) {
    tracer := otel.Tracer("my-service")
    ctx, span := tracer.Start(ctx, "processData")
    defer span.End()

    traceID := span.SpanContext().TraceID().String()
    log.Printf("trace_id=%s user_id=%s action=process_data", traceID, userID)

    // More work
}

Python example:

# Python: Flask with OpenTelemetry
from flask import Flask, request
from opentelemetry import trace
from opentelemetry.instrumentation.flask import FlaskInstrumentor
import logging

app = Flask(__name__)
FlaskInstrumentor().instrument_app(app)

logging.basicConfig(
    format='%(asctime)s trace_id=%(otelTraceID)s span_id=%(otelSpanID)s %(message)s',
    level=logging.INFO
)

@app.route('/api/users/<user_id>')
def get_user(user_id):
    span = trace.get_current_span()
    trace_id = format(span.get_span_context().trace_id, '032x')

    logging.info(f"Fetching user {user_id}", extra={
        'otelTraceID': trace_id,
        'otelSpanID': format(span.get_span_context().span_id, '016x')
    })

    # Business logic
    return {"user_id": user_id}

Erwartet: All logs include trace_id field, enabling log-to-trace correlation.

Bei Fehler: If trace IDs missing, check OpenTelemetry SDK initialization and context propagation.

Schritt 2: Konfigurieren Exemplars in Prometheus

Exemplars link metrics to traces:

# prometheus.yml
global:
  scrape_interval: 15s
  # Enable exemplar storage
  exemplars:
    max_exemplars: 100000  # Per TSDB block

scrape_configs:
  - job_name: 'api-service'
    static_configs:
      - targets: ['api-service:8080']
    # Scrape exemplars
    metric_relabel_configs:
      - source_labels: [__name__]
        regex: 'http_request_duration_seconds.*'
        action: keep

Instrument application to emit exemplars:

// Go: Emit exemplars with Prometheus histogram
package main

import (
    "github.com/prometheus/client_golang/prometheus"
    "github.com/prometheus/client_golang/prometheus/promauto"
    "go.opentelemetry.io/otel/trace"
)

var httpDuration = promauto.NewHistogramVec(
    prometheus.HistogramOpts{
        Name:    "http_request_duration_seconds",
        Help:    "HTTP request duration",
        Buckets: prometheus.DefBuckets,
    },
    []string{"method", "endpoint", "status"},
)

func recordRequest(ctx context.Context, method, endpoint, status string, duration float64) {
    // Get trace ID for exemplar
    span := trace.SpanFromContext(ctx)
    traceID := span.SpanContext().TraceID().String()

    // Record metric with exemplar
    observer := httpDuration.WithLabelValues(method, endpoint, status)
    observer.(prometheus.ExemplarObserver).ObserveWithExemplar(
        duration,
        prometheus.Labels{"trace_id": traceID},
    )
}

Query exemplars in Prometheus:

# Histogram with exemplars
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

In Grafana, exemplars appear as dots on histogram graphs that link to traces.

Erwartet: Grafana shows exemplars in metric graphs, clicking opens corresponding trace.

Bei Fehler: Verifizieren Prometheus version ≥2.26 (exemplar support), check Grafana Datenquelle config enables exemplars.

Schritt 3: Erstellen Unified Dashboard with RED Method

RED Method: Rate, Errors, Duration (for services)

{
  "dashboard": {
    "title": "API Service - RED Dashboard",
    "panels": [
      {
        "title": "Request Rate (req/s)",
        "type": "graph",
        "targets": [
          {
            "expr": "sum(rate(http_requests_total{job=\"api-service\"}[5m])) by (endpoint)",
            "legendFormat": "{{ endpoint }}"
          }
        ],
        "exemplars": true
      },
      {
        "title": "Error Rate (%)",
        "type": "graph",
        "targets": [
          {
            "expr": "sum(rate(http_requests_total{job=\"api-service\", status=~\"5..\"}[5m])) / sum(rate(http_requests_total{job=\"api-service\"}[5m])) * 100",
            "legendFormat": "Error %"
          }
        ],
        "exemplars": true
      },
      {
        "title": "Request Duration (p50, p95, p99)",
        "type": "graph",
        "targets": [
          {
            "expr": "histogram_quantile(0.50, rate(http_request_duration_seconds_bucket{job=\"api-service\"}[5m]))",
            "legendFormat": "p50"
          },
          {
            "expr": "histogram_quantile(0.95, rate(http_request_duration_seconds_bucket{job=\"api-service\"}[5m]))",
            "legendFormat": "p95"
          },
          {
            "expr": "histogram_quantile(0.99, rate(http_request_duration_seconds_bucket{job=\"api-service\"}[5m]))",
            "legendFormat": "p99"
          }
        ],
        "exemplars": true
      },
      {
        "title": "Correlated Logs",
        "type": "logs",
        "datasource": "Loki",
        "targets": [
          {
            "expr": "{job=\"api-service\"} |= \"error\""
          }
        ],
        "options": {
          "showTime": true,
          "enableLogDetails": true
        }
      }
    ]
  }
}

Erwartet: Single dashboard showing rate, errors, duration + correlated logs.

Bei Fehler: If panels show "No Data", verify metric names match your instrumentation.

Schritt 4: Implementieren USE Method for Resources

USE Method: Utilization, Saturation, Errors (for resources like CPU, memory, disk)

{
  "dashboard": {
    "title": "Node Resources - USE Dashboard",
    "panels": [
      {
        "title": "CPU Utilization (%)",
        "type": "graph",
        "targets": [
          {
            "expr": "100 - (avg(rate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)",
            "legendFormat": "CPU Usage %"
          }
        ]
      },
      {
        "title": "CPU Saturation (Load Average)",
        "type": "graph",
        "targets": [
          {
            "expr": "node_load1",
            "legendFormat": "1min load"
          },
          {
            "expr": "node_load5",
            "legendFormat": "5min load"
          },
          {
            "expr": "count(node_cpu_seconds_total{mode=\"idle\"})",
            "legendFormat": "CPU cores (threshold)"
          }
        ]
      },
      {
        "title": "Memory Utilization (%)",
        "type": "graph",
        "targets": [
          {
            "expr": "(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100",
            "legendFormat": "Memory Usage %"
          }
        ]
      },
      {
        "title": "Memory Saturation (Page Faults)",
        "type": "graph",
        "targets": [
          {
            "expr": "rate(node_vmstat_pgmajfault[5m])",
            "legendFormat": "Major page faults/s"
          }
        ]
      },
      {
        "title": "Disk Utilization (%)",
        "type": "graph",
        "targets": [
          {
            "expr": "(node_filesystem_size_bytes - node_filesystem_free_bytes) / node_filesystem_size_bytes * 100",
            "legendFormat": "{{ device }}"
          }
        ]
      },
      {
        "title": "Disk Saturation (IO Wait %)",
        "type": "graph",
        "targets": [
          {
            "expr": "rate(node_cpu_seconds_total{mode=\"iowait\"}[5m]) * 100",
            "legendFormat": "IO Wait %"
          }
        ]
      }
    ]
  }
}

Erwartet: Dashboard showing resource health across all USE dimensions.

Bei Fehler: Sicherstellen node_exporter is running and scraping system metrics.

Schritt 5: Link Logs to Traces in Loki

Konfigurieren Loki to extract trace IDs:

# loki-config.yml
schema_config:
  configs:
    - from: 2024-01-01
      store: boltdb-shipper
      object_store: s3
      schema: v11
      index:
        prefix: index_
        period: 24h

# Derived fields for trace linking
query_config:
  derived_fields:
    - name: TraceID
      source: trace_id
      url: 'https://tempo.company.com/trace/${__value.raw}'
      urlDisplayLabel: 'View Trace'

In Grafana, configure Loki Datenquelle:

{
  "name": "Loki",
  "type": "loki",
  "url": "http://loki:3100",
  "jsonData": {
    "derivedFields": [
      {
        "datasourceUid": "tempo-uid",
        "matcherRegex": "trace_id=(\\w+)",
        "name": "TraceID",
        "url": "$${__value.raw}"
      }
    ]
  }
}

Erwartet: Clicking trace ID in Loki logs opens corresponding trace in Tempo.

Bei Fehler: Verifizieren regex matches your log format, check Tempo Datenquelle UID.

Schritt 6: Erstellen Unified Incident View

Erstellen a dashboard that brings all signals together:

{
  "dashboard": {
    "title": "Incident Investigation",
    "templating": {
      "list": [
        {
# ... (see EXAMPLES.md for complete configuration)

Workflow waehrend incident:

  1. Alarmieren fires for high error rate
  2. On-call engineer opens Grafana dashboard
  3. Identifies spike in error rate at specific time
  4. Clicks exemplar dot on duration histogram → opens trace
  5. Trace shows slow database query
  6. Clicks "View Logs" on span → opens logs for that trace
  7. Logs reveal specific SQL query causing timeout
  8. Root cause identified in <2 minutes

Erwartet: Single pane of glass for debugging, jumping zwischen metrics/logs/traces.

Bei Fehler: If links don't work, check Datenquelle configurations and trace ID propagation.

Validierung

  • Trace IDs present in all application logs
  • Prometheus scraping exemplars
  • Grafana dashboards show exemplar dots on histograms
  • Clicking exemplar opens corresponding trace in Tempo/Jaeger
  • Loki logs have "View Trace" links that work
  • RED dashboard created for key services
  • USE dashboard created for infrastructure
  • Unified incident dashboard tested waehrend GameDay

Haeufige Stolperfallen

  • Inconsistent trace ID format: OpenTelemetry uses 32-char hex, Jaeger uses 16-char. Waehlen one.
  • Missing context propagation: If trace IDs don't flow across services, distributed tracing breaks. Use OpenTelemetry auto-instrumentation.
  • Exemplar overload: Too many exemplars (>100k) can slow Prometheus. Sample high-volume metrics.
  • Clock skew: Traces span multiple services. Sicherstellen NTP is configured; clock drift causes trace ordering issues.
  • Data retention mismatch: If traces expire vor metrics, correlation breaks. Ausrichten retention policies.

Verwandte Skills

  • setup-prometheus-monitoring - metrics foundation for correlation
  • configure-log-aggregation - logs foundation for correlation
  • instrument-distributed-tracing - traces foundation for correlation
  • build-grafana-dashboards - unified visualization layer

Dépôt GitHub

pjt222/agent-almanac
Chemin: i18n/de/skills/correlate-observability-signals
0
agentsagentskillsai-assisted-developmentclaude-codeskillsteams

Compétences associées

content-collections

Méta

Cette compétence propose une configuration éprouvée en production pour Content Collections, un outil axé sur TypeScript qui transforme des fichiers Markdown/MDX en collections de données typées de manière sûre avec une validation Zod. Utilisez-la lors de la création de blogs, de sites de documentation ou d'applications Vite + React riches en contenu pour garantir la sécurité de typage et la validation automatique du contenu. Elle couvre tout, de la configuration du plugin Vite et de la compilation MDX à l'optimisation des déploiements et la validation des schémas.

Voir la compétence

polymarket

Méta

Cette compétence permet aux développeurs de créer des applications avec la plateforme de marchés prédictifs Polymarket, incluant l'intégration d'API pour le trading et les données de marché. Elle fournit également une diffusion de données en temps réel via WebSocket pour surveiller les transactions en direct et l'activité du marché. Utilisez-la pour mettre en œuvre des stratégies de trading ou pour créer des outils traitant les mises à jour de marché en direct.

Voir la compétence

creating-opencode-plugins

Méta

Cette compétence aide les développeurs à créer des plugins OpenCode qui s'interconnectent avec plus de 25 types d'événements tels que les commandes, les fichiers et les opérations LSP. Elle fournit la structure du plugin, les spécifications de l'API événementielle et les modèles d'implémentation pour les modules JavaScript/TypeScript. Utilisez-la lorsque vous avez besoin d'intercepter, de surveiller ou d'étendre le cycle de vie de l'assistant IA OpenCode avec une logique personnalisée pilotée par les événements.

Voir la compétence

sglang

Méta

SGLang est un framework de service LLM haute performance spécialisé dans la génération rapide et structurée pour les workflows JSON, regex et agentiques grâce à son cache de préfixe RadixAttention. Il offre une inférence nettement plus rapide, particulièrement pour les tâches avec des préfixes répétés, ce qui le rend idéal pour les sorties complexes et structurées ainsi que les conversations multi-tours. Choisissez SGLang plutôt que des alternatives comme vLLM lorsque vous avez besoin d'un décodage contraint ou que vous construisez des applications avec un partage étendu de préfixes.

Voir la compétence