correlate-observability-signals
Acerca de
Esta habilidad unifica métricas, registros y trazas para permitir una depuración cohesiva y un análisis rápido de causas raíz entre sistemas. Ayuda a implementar la vinculación de registros a trazas mediante ejemplares y a construir paneles unificados utilizando los métodos RED/USE. Úsela al investigar incidentes complejos en múltiples sistemas o al migrar de herramientas aisladas a una plataforma de observabilidad unificada.
Instalación rápida
Claude Code
Recomendadonpx skills add pjt222/agent-almanac -a claude-code/plugin add https://github.com/pjt222/agent-almanacgit clone https://github.com/pjt222/agent-almanac.git ~/.claude/skills/correlate-observability-signalsCopia y pega este comando en Claude Code para instalar esta habilidad
Documentación
Correlate Observability Signals
Connect metrics, logs, and traces for unified debugging across the three pillars of observability.
适用场景
- Investigating complex incidents that span multiple systems
- Reducing MTTR (mean time to resolution)
- Building unified observability dashboards
- Implementing distributed tracing
- Moving from siloed tools to unified observability
输入
- 必需: Prometheus (metrics)
- 必需: Log aggregation system (Loki, Elasticsearch, CloudWatch)
- 必需: Distributed tracing backend (Tempo, Jaeger, Zipkin)
- 可选: Grafana for unified visualization
- 可选: OpenTelemetry instrumentation
步骤
See Extended Examples for complete configuration files and templates.
第 1 步:Implement Trace Context Propagation
Add trace IDs to all logs and metrics using OpenTelemetry:
// Go example: Propagate trace context to logs
package main
import (
"context"
"log"
"go.opentelemetry.io/otel"
"go.opentelemetry.io/otel/trace"
)
func handleRequest(ctx context.Context, userID string) {
// Extract trace context
span := trace.SpanFromContext(ctx)
traceID := span.SpanContext().TraceID().String()
// Include trace ID in structured logs
log.Printf("trace_id=%s user_id=%s action=process_request", traceID, userID)
// Business logic here
processData(ctx, userID)
}
func processData(ctx context.Context, userID string) {
tracer := otel.Tracer("my-service")
ctx, span := tracer.Start(ctx, "processData")
defer span.End()
traceID := span.SpanContext().TraceID().String()
log.Printf("trace_id=%s user_id=%s action=process_data", traceID, userID)
// More work
}
Python example:
# Python: Flask with OpenTelemetry
from flask import Flask, request
from opentelemetry import trace
from opentelemetry.instrumentation.flask import FlaskInstrumentor
import logging
app = Flask(__name__)
FlaskInstrumentor().instrument_app(app)
logging.basicConfig(
format='%(asctime)s trace_id=%(otelTraceID)s span_id=%(otelSpanID)s %(message)s',
level=logging.INFO
)
@app.route('/api/users/<user_id>')
def get_user(user_id):
span = trace.get_current_span()
trace_id = format(span.get_span_context().trace_id, '032x')
logging.info(f"Fetching user {user_id}", extra={
'otelTraceID': trace_id,
'otelSpanID': format(span.get_span_context().span_id, '016x')
})
# Business logic
return {"user_id": user_id}
预期结果: All logs include trace_id field, enabling log-to-trace correlation.
失败处理: If trace IDs missing, check OpenTelemetry SDK initialization and context propagation.
第 2 步:Configure Exemplars in Prometheus
Exemplars link metrics to traces:
# prometheus.yml
global:
scrape_interval: 15s
# Enable exemplar storage
exemplars:
max_exemplars: 100000 # Per TSDB block
scrape_configs:
- job_name: 'api-service'
static_configs:
- targets: ['api-service:8080']
# Scrape exemplars
metric_relabel_configs:
- source_labels: [__name__]
regex: 'http_request_duration_seconds.*'
action: keep
Instrument application to emit exemplars:
// Go: Emit exemplars with Prometheus histogram
package main
import (
"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/promauto"
"go.opentelemetry.io/otel/trace"
)
var httpDuration = promauto.NewHistogramVec(
prometheus.HistogramOpts{
Name: "http_request_duration_seconds",
Help: "HTTP request duration",
Buckets: prometheus.DefBuckets,
},
[]string{"method", "endpoint", "status"},
)
func recordRequest(ctx context.Context, method, endpoint, status string, duration float64) {
// Get trace ID for exemplar
span := trace.SpanFromContext(ctx)
traceID := span.SpanContext().TraceID().String()
// Record metric with exemplar
observer := httpDuration.WithLabelValues(method, endpoint, status)
observer.(prometheus.ExemplarObserver).ObserveWithExemplar(
duration,
prometheus.Labels{"trace_id": traceID},
)
}
Query exemplars in Prometheus:
# Histogram with exemplars
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
In Grafana, exemplars appear as dots on histogram graphs that link to traces.
预期结果: Grafana shows exemplars in metric graphs, clicking opens corresponding trace.
失败处理: Verify Prometheus version ≥2.26 (exemplar support), check Grafana data source config enables exemplars.
第 3 步:Build Unified Dashboard with RED Method
RED Method: Rate, Errors, Duration (for services)
{
"dashboard": {
"title": "API Service - RED Dashboard",
"panels": [
{
"title": "Request Rate (req/s)",
"type": "graph",
"targets": [
{
"expr": "sum(rate(http_requests_total{job=\"api-service\"}[5m])) by (endpoint)",
"legendFormat": "{{ endpoint }}"
}
],
"exemplars": true
},
{
"title": "Error Rate (%)",
"type": "graph",
"targets": [
{
"expr": "sum(rate(http_requests_total{job=\"api-service\", status=~\"5..\"}[5m])) / sum(rate(http_requests_total{job=\"api-service\"}[5m])) * 100",
"legendFormat": "Error %"
}
],
"exemplars": true
},
{
"title": "Request Duration (p50, p95, p99)",
"type": "graph",
"targets": [
{
"expr": "histogram_quantile(0.50, rate(http_request_duration_seconds_bucket{job=\"api-service\"}[5m]))",
"legendFormat": "p50"
},
{
"expr": "histogram_quantile(0.95, rate(http_request_duration_seconds_bucket{job=\"api-service\"}[5m]))",
"legendFormat": "p95"
},
{
"expr": "histogram_quantile(0.99, rate(http_request_duration_seconds_bucket{job=\"api-service\"}[5m]))",
"legendFormat": "p99"
}
],
"exemplars": true
},
{
"title": "Correlated Logs",
"type": "logs",
"datasource": "Loki",
"targets": [
{
"expr": "{job=\"api-service\"} |= \"error\""
}
],
"options": {
"showTime": true,
"enableLogDetails": true
}
}
]
}
}
预期结果: Single dashboard showing rate, errors, duration + correlated logs.
失败处理: If panels show "No Data", verify metric names match your instrumentation.
第 4 步:Implement USE Method for Resources
USE Method: Utilization, Saturation, Errors (for resources like CPU, memory, disk)
{
"dashboard": {
"title": "Node Resources - USE Dashboard",
"panels": [
{
"title": "CPU Utilization (%)",
"type": "graph",
"targets": [
{
"expr": "100 - (avg(rate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)",
"legendFormat": "CPU Usage %"
}
]
},
{
"title": "CPU Saturation (Load Average)",
"type": "graph",
"targets": [
{
"expr": "node_load1",
"legendFormat": "1min load"
},
{
"expr": "node_load5",
"legendFormat": "5min load"
},
{
"expr": "count(node_cpu_seconds_total{mode=\"idle\"})",
"legendFormat": "CPU cores (threshold)"
}
]
},
{
"title": "Memory Utilization (%)",
"type": "graph",
"targets": [
{
"expr": "(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100",
"legendFormat": "Memory Usage %"
}
]
},
{
"title": "Memory Saturation (Page Faults)",
"type": "graph",
"targets": [
{
"expr": "rate(node_vmstat_pgmajfault[5m])",
"legendFormat": "Major page faults/s"
}
]
},
{
"title": "Disk Utilization (%)",
"type": "graph",
"targets": [
{
"expr": "(node_filesystem_size_bytes - node_filesystem_free_bytes) / node_filesystem_size_bytes * 100",
"legendFormat": "{{ device }}"
}
]
},
{
"title": "Disk Saturation (IO Wait %)",
"type": "graph",
"targets": [
{
"expr": "rate(node_cpu_seconds_total{mode=\"iowait\"}[5m]) * 100",
"legendFormat": "IO Wait %"
}
]
}
]
}
}
预期结果: Dashboard showing resource health across all USE dimensions.
失败处理: Ensure node_exporter is running and scraping system metrics.
第 5 步:Link Logs to Traces in Loki
Configure Loki to extract trace IDs:
# loki-config.yml
schema_config:
configs:
- from: 2024-01-01
store: boltdb-shipper
object_store: s3
schema: v11
index:
prefix: index_
period: 24h
# Derived fields for trace linking
query_config:
derived_fields:
- name: TraceID
source: trace_id
url: 'https://tempo.company.com/trace/${__value.raw}'
urlDisplayLabel: 'View Trace'
In Grafana, configure Loki data source:
{
"name": "Loki",
"type": "loki",
"url": "http://loki:3100",
"jsonData": {
"derivedFields": [
{
"datasourceUid": "tempo-uid",
"matcherRegex": "trace_id=(\\w+)",
"name": "TraceID",
"url": "$${__value.raw}"
}
]
}
}
预期结果: Clicking trace ID in Loki logs opens corresponding trace in Tempo.
失败处理: Verify regex matches your log format, check Tempo data source UID.
第 6 步:Create Unified Incident View
Build a dashboard that brings all signals together:
{
"dashboard": {
"title": "Incident Investigation",
"templating": {
"list": [
{
# ... (see EXAMPLES.md for complete configuration)
Workflow during incident:
- Alert fires for high error rate
- On-call engineer opens Grafana dashboard
- Identifies spike in error rate at specific time
- Clicks exemplar dot on duration histogram → opens trace
- Trace shows slow database query
- Clicks "View Logs" on span → opens logs for that trace
- Logs reveal specific SQL query causing timeout
- Root cause identified in <2 minutes
预期结果: Single pane of glass for debugging, jumping between metrics/logs/traces.
失败处理: If links don't work, check data source configurations and trace ID propagation.
验证清单
- Trace IDs present in all application logs
- Prometheus scraping exemplars
- Grafana dashboards show exemplar dots on histograms
- Clicking exemplar opens corresponding trace in Tempo/Jaeger
- Loki logs have "View Trace" links that work
- RED dashboard created for key services
- USE dashboard created for infrastructure
- Unified incident dashboard tested during GameDay
常见问题
- Inconsistent trace ID format: OpenTelemetry uses 32-char hex, Jaeger uses 16-char. Choose one.
- Missing context propagation: If trace IDs don't flow across services, distributed tracing breaks. Use OpenTelemetry auto-instrumentation.
- Exemplar overload: Too many exemplars (>100k) can slow Prometheus. Sample high-volume metrics.
- Clock skew: Traces span multiple services. Ensure NTP is configured; clock drift causes trace ordering issues.
- Data retention mismatch: If traces expire before metrics, correlation breaks. Align retention policies.
相关技能
setup-prometheus-monitoring- metrics foundation for correlationconfigure-log-aggregation- logs foundation for correlationinstrument-distributed-tracing- traces foundation for correlationbuild-grafana-dashboards- unified visualization layer
Repositorio GitHub
Habilidades relacionadas
content-collections
MetaEsta habilidad proporciona una configuración probada en producción para Content Collections, una herramienta centrada en TypeScript que transforma archivos Markdown/MDX en colecciones de datos con tipado seguro mediante validación Zod. Úsala al construir blogs, sitios de documentación o aplicaciones Vite + React con mucho contenido para garantizar seguridad de tipos y validación automática de contenido. Abarca todo, desde la configuración del plugin de Vite y compilación MDX hasta la optimización de despliegue y validación de esquemas.
polymarket
MetaEsta habilidad permite a los desarrolladores crear aplicaciones con la plataforma de mercados de predicción Polymarket, incluyendo la integración de API para operaciones y datos de mercado. También proporciona transmisión de datos en tiempo real a través de WebSocket para monitorear operaciones en vivo y actividad del mercado. Úsela para implementar estrategias de trading o crear herramientas que procesen actualizaciones de mercado en tiempo real.
creating-opencode-plugins
MetaEsta habilidad ayuda a los desarrolladores a crear complementos de OpenCode que se conectan a más de 25 tipos de eventos, como comandos, archivos y operaciones LSP. Proporciona la estructura del complemento, las especificaciones de la API de eventos y los patrones de implementación para módulos en JavaScript/TypeScript. Úsala cuando necesites interceptar, monitorear o extender el ciclo de vida del asistente de IA de OpenCode con lógica personalizada basada en eventos.
sglang
MetaSGLang es un framework de alto rendimiento para el servicio de LLM que se especializa en generación rápida y estructurada para JSON, expresiones regulares y flujos de trabajo de agentes utilizando su caché de prefijos RadixAttention. Ofrece una inferencia significativamente más rápida, especialmente para tareas con prefijos repetidos, lo que lo hace ideal para salidas complejas y estructuradas, y conversaciones multiturno. Elige SGLang sobre alternativas como vLLM cuando necesites decodificación restringida o estés construyendo aplicaciones con uso extensivo de prefijos compartidos.
