correlate-observability-signals
Über
Diese Fähigkeit vereinheitlicht Metriken, Logs und Traces, um kohärentes Debugging und schnelle Root-Cause-Analyse über Systeme hinweg zu ermöglichen. Sie hilft bei der Implementierung von Log-to-Trace-Verknüpfungen über Exemplare und beim Aufbau einheitlicher Dashboards mit RED/USE-Methoden. Nutzen Sie sie bei der Untersuchung komplexer, systemübergreifender Vorfälle oder beim Wechsel von isolierten Tools zu einer vereinheitlichten Observability-Plattform.
Schnellinstallation
Claude Code
Empfohlennpx skills add pjt222/agent-almanac -a claude-code/plugin add https://github.com/pjt222/agent-almanacgit clone https://github.com/pjt222/agent-almanac.git ~/.claude/skills/correlate-observability-signalsKopieren Sie diesen Befehl und fügen Sie ihn in Claude Code ein, um diese Fähigkeit zu installieren
Dokumentation
Correlate Observability Signals
Connect metrics, logs, and traces for unified debugging across the three pillars of observability.
适用场景
- Investigating complex incidents that span multiple systems
- Reducing MTTR (mean time to resolution)
- Building unified observability dashboards
- Implementing distributed tracing
- Moving from siloed tools to unified observability
输入
- 必需: Prometheus (metrics)
- 必需: Log aggregation system (Loki, Elasticsearch, CloudWatch)
- 必需: Distributed tracing backend (Tempo, Jaeger, Zipkin)
- 可选: Grafana for unified visualization
- 可选: OpenTelemetry instrumentation
步骤
See Extended Examples for complete configuration files and templates.
第 1 步:Implement Trace Context Propagation
Add trace IDs to all logs and metrics using OpenTelemetry:
// Go example: Propagate trace context to logs
package main
import (
"context"
"log"
"go.opentelemetry.io/otel"
"go.opentelemetry.io/otel/trace"
)
func handleRequest(ctx context.Context, userID string) {
// Extract trace context
span := trace.SpanFromContext(ctx)
traceID := span.SpanContext().TraceID().String()
// Include trace ID in structured logs
log.Printf("trace_id=%s user_id=%s action=process_request", traceID, userID)
// Business logic here
processData(ctx, userID)
}
func processData(ctx context.Context, userID string) {
tracer := otel.Tracer("my-service")
ctx, span := tracer.Start(ctx, "processData")
defer span.End()
traceID := span.SpanContext().TraceID().String()
log.Printf("trace_id=%s user_id=%s action=process_data", traceID, userID)
// More work
}
Python example:
# Python: Flask with OpenTelemetry
from flask import Flask, request
from opentelemetry import trace
from opentelemetry.instrumentation.flask import FlaskInstrumentor
import logging
app = Flask(__name__)
FlaskInstrumentor().instrument_app(app)
logging.basicConfig(
format='%(asctime)s trace_id=%(otelTraceID)s span_id=%(otelSpanID)s %(message)s',
level=logging.INFO
)
@app.route('/api/users/<user_id>')
def get_user(user_id):
span = trace.get_current_span()
trace_id = format(span.get_span_context().trace_id, '032x')
logging.info(f"Fetching user {user_id}", extra={
'otelTraceID': trace_id,
'otelSpanID': format(span.get_span_context().span_id, '016x')
})
# Business logic
return {"user_id": user_id}
预期结果: All logs include trace_id field, enabling log-to-trace correlation.
失败处理: If trace IDs missing, check OpenTelemetry SDK initialization and context propagation.
第 2 步:Configure Exemplars in Prometheus
Exemplars link metrics to traces:
# prometheus.yml
global:
scrape_interval: 15s
# Enable exemplar storage
exemplars:
max_exemplars: 100000 # Per TSDB block
scrape_configs:
- job_name: 'api-service'
static_configs:
- targets: ['api-service:8080']
# Scrape exemplars
metric_relabel_configs:
- source_labels: [__name__]
regex: 'http_request_duration_seconds.*'
action: keep
Instrument application to emit exemplars:
// Go: Emit exemplars with Prometheus histogram
package main
import (
"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/promauto"
"go.opentelemetry.io/otel/trace"
)
var httpDuration = promauto.NewHistogramVec(
prometheus.HistogramOpts{
Name: "http_request_duration_seconds",
Help: "HTTP request duration",
Buckets: prometheus.DefBuckets,
},
[]string{"method", "endpoint", "status"},
)
func recordRequest(ctx context.Context, method, endpoint, status string, duration float64) {
// Get trace ID for exemplar
span := trace.SpanFromContext(ctx)
traceID := span.SpanContext().TraceID().String()
// Record metric with exemplar
observer := httpDuration.WithLabelValues(method, endpoint, status)
observer.(prometheus.ExemplarObserver).ObserveWithExemplar(
duration,
prometheus.Labels{"trace_id": traceID},
)
}
Query exemplars in Prometheus:
# Histogram with exemplars
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
In Grafana, exemplars appear as dots on histogram graphs that link to traces.
预期结果: Grafana shows exemplars in metric graphs, clicking opens corresponding trace.
失败处理: Verify Prometheus version ≥2.26 (exemplar support), check Grafana data source config enables exemplars.
第 3 步:Build Unified Dashboard with RED Method
RED Method: Rate, Errors, Duration (for services)
{
"dashboard": {
"title": "API Service - RED Dashboard",
"panels": [
{
"title": "Request Rate (req/s)",
"type": "graph",
"targets": [
{
"expr": "sum(rate(http_requests_total{job=\"api-service\"}[5m])) by (endpoint)",
"legendFormat": "{{ endpoint }}"
}
],
"exemplars": true
},
{
"title": "Error Rate (%)",
"type": "graph",
"targets": [
{
"expr": "sum(rate(http_requests_total{job=\"api-service\", status=~\"5..\"}[5m])) / sum(rate(http_requests_total{job=\"api-service\"}[5m])) * 100",
"legendFormat": "Error %"
}
],
"exemplars": true
},
{
"title": "Request Duration (p50, p95, p99)",
"type": "graph",
"targets": [
{
"expr": "histogram_quantile(0.50, rate(http_request_duration_seconds_bucket{job=\"api-service\"}[5m]))",
"legendFormat": "p50"
},
{
"expr": "histogram_quantile(0.95, rate(http_request_duration_seconds_bucket{job=\"api-service\"}[5m]))",
"legendFormat": "p95"
},
{
"expr": "histogram_quantile(0.99, rate(http_request_duration_seconds_bucket{job=\"api-service\"}[5m]))",
"legendFormat": "p99"
}
],
"exemplars": true
},
{
"title": "Correlated Logs",
"type": "logs",
"datasource": "Loki",
"targets": [
{
"expr": "{job=\"api-service\"} |= \"error\""
}
],
"options": {
"showTime": true,
"enableLogDetails": true
}
}
]
}
}
预期结果: Single dashboard showing rate, errors, duration + correlated logs.
失败处理: If panels show "No Data", verify metric names match your instrumentation.
第 4 步:Implement USE Method for Resources
USE Method: Utilization, Saturation, Errors (for resources like CPU, memory, disk)
{
"dashboard": {
"title": "Node Resources - USE Dashboard",
"panels": [
{
"title": "CPU Utilization (%)",
"type": "graph",
"targets": [
{
"expr": "100 - (avg(rate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)",
"legendFormat": "CPU Usage %"
}
]
},
{
"title": "CPU Saturation (Load Average)",
"type": "graph",
"targets": [
{
"expr": "node_load1",
"legendFormat": "1min load"
},
{
"expr": "node_load5",
"legendFormat": "5min load"
},
{
"expr": "count(node_cpu_seconds_total{mode=\"idle\"})",
"legendFormat": "CPU cores (threshold)"
}
]
},
{
"title": "Memory Utilization (%)",
"type": "graph",
"targets": [
{
"expr": "(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100",
"legendFormat": "Memory Usage %"
}
]
},
{
"title": "Memory Saturation (Page Faults)",
"type": "graph",
"targets": [
{
"expr": "rate(node_vmstat_pgmajfault[5m])",
"legendFormat": "Major page faults/s"
}
]
},
{
"title": "Disk Utilization (%)",
"type": "graph",
"targets": [
{
"expr": "(node_filesystem_size_bytes - node_filesystem_free_bytes) / node_filesystem_size_bytes * 100",
"legendFormat": "{{ device }}"
}
]
},
{
"title": "Disk Saturation (IO Wait %)",
"type": "graph",
"targets": [
{
"expr": "rate(node_cpu_seconds_total{mode=\"iowait\"}[5m]) * 100",
"legendFormat": "IO Wait %"
}
]
}
]
}
}
预期结果: Dashboard showing resource health across all USE dimensions.
失败处理: Ensure node_exporter is running and scraping system metrics.
第 5 步:Link Logs to Traces in Loki
Configure Loki to extract trace IDs:
# loki-config.yml
schema_config:
configs:
- from: 2024-01-01
store: boltdb-shipper
object_store: s3
schema: v11
index:
prefix: index_
period: 24h
# Derived fields for trace linking
query_config:
derived_fields:
- name: TraceID
source: trace_id
url: 'https://tempo.company.com/trace/${__value.raw}'
urlDisplayLabel: 'View Trace'
In Grafana, configure Loki data source:
{
"name": "Loki",
"type": "loki",
"url": "http://loki:3100",
"jsonData": {
"derivedFields": [
{
"datasourceUid": "tempo-uid",
"matcherRegex": "trace_id=(\\w+)",
"name": "TraceID",
"url": "$${__value.raw}"
}
]
}
}
预期结果: Clicking trace ID in Loki logs opens corresponding trace in Tempo.
失败处理: Verify regex matches your log format, check Tempo data source UID.
第 6 步:Create Unified Incident View
Build a dashboard that brings all signals together:
{
"dashboard": {
"title": "Incident Investigation",
"templating": {
"list": [
{
# ... (see EXAMPLES.md for complete configuration)
Workflow during incident:
- Alert fires for high error rate
- On-call engineer opens Grafana dashboard
- Identifies spike in error rate at specific time
- Clicks exemplar dot on duration histogram → opens trace
- Trace shows slow database query
- Clicks "View Logs" on span → opens logs for that trace
- Logs reveal specific SQL query causing timeout
- Root cause identified in <2 minutes
预期结果: Single pane of glass for debugging, jumping between metrics/logs/traces.
失败处理: If links don't work, check data source configurations and trace ID propagation.
验证清单
- Trace IDs present in all application logs
- Prometheus scraping exemplars
- Grafana dashboards show exemplar dots on histograms
- Clicking exemplar opens corresponding trace in Tempo/Jaeger
- Loki logs have "View Trace" links that work
- RED dashboard created for key services
- USE dashboard created for infrastructure
- Unified incident dashboard tested during GameDay
常见问题
- Inconsistent trace ID format: OpenTelemetry uses 32-char hex, Jaeger uses 16-char. Choose one.
- Missing context propagation: If trace IDs don't flow across services, distributed tracing breaks. Use OpenTelemetry auto-instrumentation.
- Exemplar overload: Too many exemplars (>100k) can slow Prometheus. Sample high-volume metrics.
- Clock skew: Traces span multiple services. Ensure NTP is configured; clock drift causes trace ordering issues.
- Data retention mismatch: If traces expire before metrics, correlation breaks. Align retention policies.
相关技能
setup-prometheus-monitoring- metrics foundation for correlationconfigure-log-aggregation- logs foundation for correlationinstrument-distributed-tracing- traces foundation for correlationbuild-grafana-dashboards- unified visualization layer
GitHub Repository
Verwandte Skills
content-collections
MetaDiese Skill bietet eine produktionsgetestete Einrichtung für Content Collections – ein TypeScript-first-Tool, das Markdown/MDX-Dateien in typsichere Datensammlungen mit Zod-Validierung umwandelt. Verwenden Sie ihn beim Erstellen von Blogs, Dokumentationsseiten oder inhaltsstarken Vite + React-Anwendungen, um Typsicherheit und automatische Inhaltsvalidierung zu gewährleisten. Er behandelt alles von der Vite-Plugin-Konfiguration und MDX-Kompilierung bis hin zur Deployment-Optimierung und Schema-Validierung.
polymarket
MetaDiese Fähigkeit ermöglicht es Entwicklern, Anwendungen mit der Polymarket-Prognosemärkte-Plattform zu erstellen, einschließlich API-Integration für Handel und Marktdaten. Sie bietet außerdem Echtzeit-Datenstreaming über WebSocket, um Live-Trades und Marktaktivitäten zu überwachen. Nutzen Sie sie zur Implementierung von Handelsstrategien oder zur Erstellung von Tools, die Live-Marktaktualisierungen verarbeiten.
creating-opencode-plugins
MetaDiese Fähigkeit unterstützt Entwickler dabei, OpenCode-Plugins zu erstellen, die in über 25 Ereignistypen wie Befehle, Dateien und LSP-Operationen eingreifen. Sie bietet die Plugin-Struktur, Event-API-Spezifikationen und Implementierungsmuster für JavaScript/TypeScript-Module. Nutzen Sie sie, wenn Sie den Lebenszyklus des OpenCode KI-Assistenten mit benutzerdefinierter ereignisgesteuerter Logik abfangen, überwachen oder erweitern müssen.
sglang
MetaSGLang ist ein hochperformantes LLM-Serving-Framework, das sich auf schnelle, strukturierte Generierung für JSON, Regex und agentenbasierte Workflows unter Verwendung seines RadixAttention-Prefix-Cachings spezialisiert. Es bietet deutlich schnellere Inferenz, insbesondere für Aufgaben mit wiederholten Präfixen, was es ideal für komplexe, strukturierte Ausgaben und Mehrfachdialoge macht. Wählen Sie SGLang gegenüber Alternativen wie vLLM, wenn Sie constrained decoding benötigen oder Anwendungen mit umfangreicher Präfix-Weitergabe entwickeln.
