correlate-observability-signals
Acerca de
Esta habilidad unifica métricas, registros y trazas para una depuración cohesiva, permitiendo un análisis rápido de causas raíz en incidentes complejos y multisistema. Implementa funciones como el enlace de registros a trazas mediante ejemplares y construye paneles unificados utilizando métodos RED/USE. Úsela para reducir el tiempo de resolución al migrar de herramientas aisladas a una plataforma de observabilidad unificada.
Instalación rápida
Claude Code
Recomendadonpx skills add pjt222/agent-almanac -a claude-code/plugin add https://github.com/pjt222/agent-almanacgit clone https://github.com/pjt222/agent-almanac.git ~/.claude/skills/correlate-observability-signalsCopia y pega este comando en Claude Code para instalar esta habilidad
Documentación
Correlate Observability Signals
Connect metrics, logs, traces. Unified debugging across three pillars of observability.
When Use
- Investigating complex incidents spanning many systems
- Cutting MTTR (mean time to resolution)
- Building unified observability dashboards
- Implementing distributed tracing
- Moving from siloed tools to unified observability
Inputs
- Required: Prometheus (metrics)
- Required: Log aggregation system (Loki, Elasticsearch, CloudWatch)
- Required: Distributed tracing backend (Tempo, Jaeger, Zipkin)
- Optional: Grafana for unified visualization
- Optional: OpenTelemetry instrumentation
Steps
See Extended Examples for complete configuration files and templates.
Step 1: Implement Trace Context Propagation
Add trace IDs to all logs and metrics. Use OpenTelemetry.
// Go example: Propagate trace context to logs
package main
import (
"context"
"log"
"go.opentelemetry.io/otel"
"go.opentelemetry.io/otel/trace"
)
func handleRequest(ctx context.Context, userID string) {
// Extract trace context
span := trace.SpanFromContext(ctx)
traceID := span.SpanContext().TraceID().String()
// Include trace ID in structured logs
log.Printf("trace_id=%s user_id=%s action=process_request", traceID, userID)
// Business logic here
processData(ctx, userID)
}
func processData(ctx context.Context, userID string) {
tracer := otel.Tracer("my-service")
ctx, span := tracer.Start(ctx, "processData")
defer span.End()
traceID := span.SpanContext().TraceID().String()
log.Printf("trace_id=%s user_id=%s action=process_data", traceID, userID)
// More work
}
Python example:
# Python: Flask with OpenTelemetry
from flask import Flask, request
from opentelemetry import trace
from opentelemetry.instrumentation.flask import FlaskInstrumentor
import logging
app = Flask(__name__)
FlaskInstrumentor().instrument_app(app)
logging.basicConfig(
format='%(asctime)s trace_id=%(otelTraceID)s span_id=%(otelSpanID)s %(message)s',
level=logging.INFO
)
@app.route('/api/users/<user_id>')
def get_user(user_id):
span = trace.get_current_span()
trace_id = format(span.get_span_context().trace_id, '032x')
logging.info(f"Fetching user {user_id}", extra={
'otelTraceID': trace_id,
'otelSpanID': format(span.get_span_context().span_id, '016x')
})
# Business logic
return {"user_id": user_id}
Got: All logs include trace_id field. Enables log-to-trace correlation.
If fail: Trace IDs missing? Check OpenTelemetry SDK init and context propagation.
Step 2: Configure Exemplars in Prometheus
Exemplars link metrics to traces.
# prometheus.yml
global:
scrape_interval: 15s
# Enable exemplar storage
exemplars:
max_exemplars: 100000 # Per TSDB block
scrape_configs:
- job_name: 'api-service'
static_configs:
- targets: ['api-service:8080']
# Scrape exemplars
metric_relabel_configs:
- source_labels: [__name__]
regex: 'http_request_duration_seconds.*'
action: keep
Instrument app to emit exemplars:
// Go: Emit exemplars with Prometheus histogram
package main
import (
"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/promauto"
"go.opentelemetry.io/otel/trace"
)
var httpDuration = promauto.NewHistogramVec(
prometheus.HistogramOpts{
Name: "http_request_duration_seconds",
Help: "HTTP request duration",
Buckets: prometheus.DefBuckets,
},
[]string{"method", "endpoint", "status"},
)
func recordRequest(ctx context.Context, method, endpoint, status string, duration float64) {
// Get trace ID for exemplar
span := trace.SpanFromContext(ctx)
traceID := span.SpanContext().TraceID().String()
// Record metric with exemplar
observer := httpDuration.WithLabelValues(method, endpoint, status)
observer.(prometheus.ExemplarObserver).ObserveWithExemplar(
duration,
prometheus.Labels{"trace_id": traceID},
)
}
Query exemplars in Prometheus:
# Histogram with exemplars
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
In Grafana, exemplars show as dots on histogram graphs. Link to traces.
Got: Grafana shows exemplars on metric graphs. Click opens matching trace.
If fail: Verify Prometheus version ≥2.26 (exemplar support). Check Grafana data source config enables exemplars.
Step 3: Build Unified Dashboard with RED Method
RED Method: Rate, Errors, Duration (for services)
{
"dashboard": {
"title": "API Service - RED Dashboard",
"panels": [
{
"title": "Request Rate (req/s)",
"type": "graph",
"targets": [
{
"expr": "sum(rate(http_requests_total{job=\"api-service\"}[5m])) by (endpoint)",
"legendFormat": "{{ endpoint }}"
}
],
"exemplars": true
},
{
"title": "Error Rate (%)",
"type": "graph",
"targets": [
{
"expr": "sum(rate(http_requests_total{job=\"api-service\", status=~\"5..\"}[5m])) / sum(rate(http_requests_total{job=\"api-service\"}[5m])) * 100",
"legendFormat": "Error %"
}
],
"exemplars": true
},
{
"title": "Request Duration (p50, p95, p99)",
"type": "graph",
"targets": [
{
"expr": "histogram_quantile(0.50, rate(http_request_duration_seconds_bucket{job=\"api-service\"}[5m]))",
"legendFormat": "p50"
},
{
"expr": "histogram_quantile(0.95, rate(http_request_duration_seconds_bucket{job=\"api-service\"}[5m]))",
"legendFormat": "p95"
},
{
"expr": "histogram_quantile(0.99, rate(http_request_duration_seconds_bucket{job=\"api-service\"}[5m]))",
"legendFormat": "p99"
}
],
"exemplars": true
},
{
"title": "Correlated Logs",
"type": "logs",
"datasource": "Loki",
"targets": [
{
"expr": "{job=\"api-service\"} |= \"error\""
}
],
"options": {
"showTime": true,
"enableLogDetails": true
}
}
]
}
}
Got: Single dashboard shows rate, errors, duration + correlated logs.
If fail: Panels show "No Data"? Verify metric names match instrumentation.
Step 4: Implement USE Method for Resources
USE Method: Utilization, Saturation, Errors (for resources like CPU, memory, disk)
{
"dashboard": {
"title": "Node Resources - USE Dashboard",
"panels": [
{
"title": "CPU Utilization (%)",
"type": "graph",
"targets": [
{
"expr": "100 - (avg(rate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)",
"legendFormat": "CPU Usage %"
}
]
},
{
"title": "CPU Saturation (Load Average)",
"type": "graph",
"targets": [
{
"expr": "node_load1",
"legendFormat": "1min load"
},
{
"expr": "node_load5",
"legendFormat": "5min load"
},
{
"expr": "count(node_cpu_seconds_total{mode=\"idle\"})",
"legendFormat": "CPU cores (threshold)"
}
]
},
{
"title": "Memory Utilization (%)",
"type": "graph",
"targets": [
{
"expr": "(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100",
"legendFormat": "Memory Usage %"
}
]
},
{
"title": "Memory Saturation (Page Faults)",
"type": "graph",
"targets": [
{
"expr": "rate(node_vmstat_pgmajfault[5m])",
"legendFormat": "Major page faults/s"
}
]
},
{
"title": "Disk Utilization (%)",
"type": "graph",
"targets": [
{
"expr": "(node_filesystem_size_bytes - node_filesystem_free_bytes) / node_filesystem_size_bytes * 100",
"legendFormat": "{{ device }}"
}
]
},
{
"title": "Disk Saturation (IO Wait %)",
"type": "graph",
"targets": [
{
"expr": "rate(node_cpu_seconds_total{mode=\"iowait\"}[5m]) * 100",
"legendFormat": "IO Wait %"
}
]
}
]
}
}
Got: Dashboard shows resource health across all USE dimensions.
If fail: Ensure node_exporter runs and scrapes system metrics.
Step 5: Link Logs to Traces in Loki
Configure Loki to extract trace IDs:
# loki-config.yml
schema_config:
configs:
- from: 2024-01-01
store: boltdb-shipper
object_store: s3
schema: v11
index:
prefix: index_
period: 24h
# Derived fields for trace linking
query_config:
derived_fields:
- name: TraceID
source: trace_id
url: 'https://tempo.company.com/trace/${__value.raw}'
urlDisplayLabel: 'View Trace'
In Grafana, configure Loki data source:
{
"name": "Loki",
"type": "loki",
"url": "http://loki:3100",
"jsonData": {
"derivedFields": [
{
"datasourceUid": "tempo-uid",
"matcherRegex": "trace_id=(\\w+)",
"name": "TraceID",
"url": "$${__value.raw}"
}
]
}
}
Got: Click trace ID in Loki logs → opens matching trace in Tempo.
If fail: Verify regex matches log format. Check Tempo data source UID.
Step 6: Create Unified Incident View
Build dashboard bringing all signals together:
{
"dashboard": {
"title": "Incident Investigation",
"templating": {
"list": [
{
# ... (see EXAMPLES.md for complete configuration)
Workflow during incident:
- Alert fires for high error rate
- On-call engineer opens Grafana dashboard
- Spots spike in error rate at specific time
- Clicks exemplar dot on duration histogram → opens trace
- Trace shows slow database query
- Clicks "View Logs" on span → opens logs for that trace
- Logs reveal specific SQL query causing timeout
- Root cause found in <2 minutes
Got: Single pane of glass for debugging. Jump between metrics/logs/traces.
If fail: Links break? Check data source configs and trace ID propagation.
Checks
- Trace IDs present in all app logs
- Prometheus scraping exemplars
- Grafana dashboards show exemplar dots on histograms
- Click exemplar opens matching trace in Tempo/Jaeger
- Loki logs have "View Trace" links that work
- RED dashboard built for key services
- USE dashboard built for infrastructure
- Unified incident dashboard tested during GameDay
Pitfalls
- Inconsistent trace ID format: OpenTelemetry uses 32-char hex, Jaeger uses 16-char. Pick one.
- Missing context propagation: Trace IDs don't flow across services → distributed tracing breaks. Use OpenTelemetry auto-instrumentation.
- Exemplar overload: Too many exemplars (>100k) → slow Prometheus. Sample high-volume metrics.
- Clock skew: Traces span many services. Run NTP; clock drift → trace ordering issues.
- Data retention mismatch: Traces expire before metrics → correlation breaks. Align retention policies.
See Also
setup-prometheus-monitoring- metrics foundation for correlationconfigure-log-aggregation- logs foundation for correlationinstrument-distributed-tracing- traces foundation for correlationbuild-grafana-dashboards- unified visualization layer
Repositorio GitHub
Habilidades relacionadas
content-collections
MetaEsta habilidad proporciona una configuración probada en producción para Content Collections, una herramienta centrada en TypeScript que transforma archivos Markdown/MDX en colecciones de datos con tipado seguro mediante validación Zod. Úsala al construir blogs, sitios de documentación o aplicaciones Vite + React con mucho contenido para garantizar seguridad de tipos y validación automática de contenido. Abarca todo, desde la configuración del plugin de Vite y compilación MDX hasta la optimización de despliegue y validación de esquemas.
polymarket
MetaEsta habilidad permite a los desarrolladores crear aplicaciones con la plataforma de mercados de predicción Polymarket, incluyendo la integración de API para operaciones y datos de mercado. También proporciona transmisión de datos en tiempo real a través de WebSocket para monitorear operaciones en vivo y actividad del mercado. Úsela para implementar estrategias de trading o crear herramientas que procesen actualizaciones de mercado en tiempo real.
creating-opencode-plugins
MetaEsta habilidad ayuda a los desarrolladores a crear complementos de OpenCode que se conectan a más de 25 tipos de eventos, como comandos, archivos y operaciones LSP. Proporciona la estructura del complemento, las especificaciones de la API de eventos y los patrones de implementación para módulos en JavaScript/TypeScript. Úsala cuando necesites interceptar, monitorear o extender el ciclo de vida del asistente de IA de OpenCode con lógica personalizada basada en eventos.
sglang
MetaSGLang es un framework de alto rendimiento para el servicio de LLM que se especializa en generación rápida y estructurada para JSON, expresiones regulares y flujos de trabajo de agentes utilizando su caché de prefijos RadixAttention. Ofrece una inferencia significativamente más rápida, especialmente para tareas con prefijos repetidos, lo que lo hace ideal para salidas complejas y estructuradas, y conversaciones multiturno. Elige SGLang sobre alternativas como vLLM cuando necesites decodificación restringida o estés construyendo aplicaciones con uso extensivo de prefijos compartidos.
