correlate-observability-signals
About
This skill unifies metrics, logs, and traces for cohesive debugging and rapid root cause analysis. It helps implement exemplars for log-to-trace linking and build unified dashboards using RED/USE methods. Use it when investigating complex, multi-system incidents or moving from siloed tools to a unified observability platform.
Quick Install
Claude Code
Recommendednpx skills add pjt222/agent-almanac -a claude-code/plugin add https://github.com/pjt222/agent-almanacgit clone https://github.com/pjt222/agent-almanac.git ~/.claude/skills/correlate-observability-signalsCopy and paste this command in Claude Code to install this skill
Documentation
Correlate Observability Signals
Connect metrics, logs, and traces for unified debugging across the three pillars of observability.
When to Use
- Investigating complex incidents that span multiple systems
- Reducing MTTR (mean time to resolution)
- Building unified observability dashboards
- Implementing distributed tracing
- Moving from siloed tools to unified observability
Inputs
- Required: Prometheus (metrics)
- Required: Log aggregation system (Loki, Elasticsearch, CloudWatch)
- Required: Distributed tracing backend (Tempo, Jaeger, Zipkin)
- Optional: Grafana for unified visualization
- Optional: OpenTelemetry instrumentation
Procedure
See Extended Examples for complete configuration files and templates.
Step 1: Implement Trace Context Propagation
Add trace IDs to all logs and metrics using OpenTelemetry:
// Go example: Propagate trace context to logs
package main
import (
"context"
"log"
"go.opentelemetry.io/otel"
"go.opentelemetry.io/otel/trace"
)
func handleRequest(ctx context.Context, userID string) {
// Extract trace context
span := trace.SpanFromContext(ctx)
traceID := span.SpanContext().TraceID().String()
// Include trace ID in structured logs
log.Printf("trace_id=%s user_id=%s action=process_request", traceID, userID)
// Business logic here
processData(ctx, userID)
}
func processData(ctx context.Context, userID string) {
tracer := otel.Tracer("my-service")
ctx, span := tracer.Start(ctx, "processData")
defer span.End()
traceID := span.SpanContext().TraceID().String()
log.Printf("trace_id=%s user_id=%s action=process_data", traceID, userID)
// More work
}
Python example:
# Python: Flask with OpenTelemetry
from flask import Flask, request
from opentelemetry import trace
from opentelemetry.instrumentation.flask import FlaskInstrumentor
import logging
app = Flask(__name__)
FlaskInstrumentor().instrument_app(app)
logging.basicConfig(
format='%(asctime)s trace_id=%(otelTraceID)s span_id=%(otelSpanID)s %(message)s',
level=logging.INFO
)
@app.route('/api/users/<user_id>')
def get_user(user_id):
span = trace.get_current_span()
trace_id = format(span.get_span_context().trace_id, '032x')
logging.info(f"Fetching user {user_id}", extra={
'otelTraceID': trace_id,
'otelSpanID': format(span.get_span_context().span_id, '016x')
})
# Business logic
return {"user_id": user_id}
Got: All logs include trace_id field, enabling log-to-trace correlation.
If fail: If trace IDs missing, check OpenTelemetry SDK initialization and context propagation.
Step 2: Configure Exemplars in Prometheus
Exemplars link metrics to traces:
# prometheus.yml
global:
scrape_interval: 15s
# Enable exemplar storage
exemplars:
max_exemplars: 100000 # Per TSDB block
scrape_configs:
- job_name: 'api-service'
static_configs:
- targets: ['api-service:8080']
# Scrape exemplars
metric_relabel_configs:
- source_labels: [__name__]
regex: 'http_request_duration_seconds.*'
action: keep
Instrument application to emit exemplars:
// Go: Emit exemplars with Prometheus histogram
package main
import (
"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/promauto"
"go.opentelemetry.io/otel/trace"
)
var httpDuration = promauto.NewHistogramVec(
prometheus.HistogramOpts{
Name: "http_request_duration_seconds",
Help: "HTTP request duration",
Buckets: prometheus.DefBuckets,
},
[]string{"method", "endpoint", "status"},
)
func recordRequest(ctx context.Context, method, endpoint, status string, duration float64) {
// Get trace ID for exemplar
span := trace.SpanFromContext(ctx)
traceID := span.SpanContext().TraceID().String()
// Record metric with exemplar
observer := httpDuration.WithLabelValues(method, endpoint, status)
observer.(prometheus.ExemplarObserver).ObserveWithExemplar(
duration,
prometheus.Labels{"trace_id": traceID},
)
}
Query exemplars in Prometheus:
# Histogram with exemplars
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
In Grafana, exemplars appear as dots on histogram graphs that link to traces.
Got: Grafana shows exemplars in metric graphs, clicking opens corresponding trace.
If fail: Verify Prometheus version ≥2.26 (exemplar support), check Grafana data source config enables exemplars.
Step 3: Build Unified Dashboard with RED Method
RED Method: Rate, Errors, Duration (for services)
{
"dashboard": {
"title": "API Service - RED Dashboard",
"panels": [
{
"title": "Request Rate (req/s)",
"type": "graph",
"targets": [
{
"expr": "sum(rate(http_requests_total{job=\"api-service\"}[5m])) by (endpoint)",
"legendFormat": "{{ endpoint }}"
}
],
"exemplars": true
},
{
"title": "Error Rate (%)",
"type": "graph",
"targets": [
{
"expr": "sum(rate(http_requests_total{job=\"api-service\", status=~\"5..\"}[5m])) / sum(rate(http_requests_total{job=\"api-service\"}[5m])) * 100",
"legendFormat": "Error %"
}
],
"exemplars": true
},
{
"title": "Request Duration (p50, p95, p99)",
"type": "graph",
"targets": [
{
"expr": "histogram_quantile(0.50, rate(http_request_duration_seconds_bucket{job=\"api-service\"}[5m]))",
"legendFormat": "p50"
},
{
"expr": "histogram_quantile(0.95, rate(http_request_duration_seconds_bucket{job=\"api-service\"}[5m]))",
"legendFormat": "p95"
},
{
"expr": "histogram_quantile(0.99, rate(http_request_duration_seconds_bucket{job=\"api-service\"}[5m]))",
"legendFormat": "p99"
}
],
"exemplars": true
},
{
"title": "Correlated Logs",
"type": "logs",
"datasource": "Loki",
"targets": [
{
"expr": "{job=\"api-service\"} |= \"error\""
}
],
"options": {
"showTime": true,
"enableLogDetails": true
}
}
]
}
}
Got: Single dashboard showing rate, errors, duration + correlated logs.
If fail: If panels show "No Data", verify metric names match your instrumentation.
Step 4: Implement USE Method for Resources
USE Method: Utilization, Saturation, Errors (for resources like CPU, memory, disk)
{
"dashboard": {
"title": "Node Resources - USE Dashboard",
"panels": [
{
"title": "CPU Utilization (%)",
"type": "graph",
"targets": [
{
"expr": "100 - (avg(rate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)",
"legendFormat": "CPU Usage %"
}
]
},
{
"title": "CPU Saturation (Load Average)",
"type": "graph",
"targets": [
{
"expr": "node_load1",
"legendFormat": "1min load"
},
{
"expr": "node_load5",
"legendFormat": "5min load"
},
{
"expr": "count(node_cpu_seconds_total{mode=\"idle\"})",
"legendFormat": "CPU cores (threshold)"
}
]
},
{
"title": "Memory Utilization (%)",
"type": "graph",
"targets": [
{
"expr": "(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100",
"legendFormat": "Memory Usage %"
}
]
},
{
"title": "Memory Saturation (Page Faults)",
"type": "graph",
"targets": [
{
"expr": "rate(node_vmstat_pgmajfault[5m])",
"legendFormat": "Major page faults/s"
}
]
},
{
"title": "Disk Utilization (%)",
"type": "graph",
"targets": [
{
"expr": "(node_filesystem_size_bytes - node_filesystem_free_bytes) / node_filesystem_size_bytes * 100",
"legendFormat": "{{ device }}"
}
]
},
{
"title": "Disk Saturation (IO Wait %)",
"type": "graph",
"targets": [
{
"expr": "rate(node_cpu_seconds_total{mode=\"iowait\"}[5m]) * 100",
"legendFormat": "IO Wait %"
}
]
}
]
}
}
Got: Dashboard showing resource health across all USE dimensions.
If fail: Ensure node_exporter is running and scraping system metrics.
Step 5: Link Logs to Traces in Loki
Configure Loki to extract trace IDs:
# loki-config.yml
schema_config:
configs:
- from: 2024-01-01
store: boltdb-shipper
object_store: s3
schema: v11
index:
prefix: index_
period: 24h
# Derived fields for trace linking
query_config:
derived_fields:
- name: TraceID
source: trace_id
url: 'https://tempo.company.com/trace/${__value.raw}'
urlDisplayLabel: 'View Trace'
In Grafana, configure Loki data source:
{
"name": "Loki",
"type": "loki",
"url": "http://loki:3100",
"jsonData": {
"derivedFields": [
{
"datasourceUid": "tempo-uid",
"matcherRegex": "trace_id=(\\w+)",
"name": "TraceID",
"url": "$${__value.raw}"
}
]
}
}
Got: Clicking trace ID in Loki logs opens corresponding trace in Tempo.
If fail: Verify regex matches your log format, check Tempo data source UID.
Step 6: Create Unified Incident View
Build a dashboard that brings all signals together:
{
"dashboard": {
"title": "Incident Investigation",
"templating": {
"list": [
{
# ... (see EXAMPLES.md for complete configuration)
Workflow during incident:
- Alert fires for high error rate
- On-call engineer opens Grafana dashboard
- Identifies spike in error rate at specific time
- Clicks exemplar dot on duration histogram → opens trace
- Trace shows slow database query
- Clicks "View Logs" on span → opens logs for that trace
- Logs reveal specific SQL query causing timeout
- Root cause identified in <2 minutes
Got: Single pane of glass for debugging, jumping between metrics/logs/traces.
If fail: If links don't work, check data source configurations and trace ID propagation.
Validation
- Trace IDs present in all application logs
- Prometheus scraping exemplars
- Grafana dashboards show exemplar dots on histograms
- Clicking exemplar opens corresponding trace in Tempo/Jaeger
- Loki logs have "View Trace" links that work
- RED dashboard created for key services
- USE dashboard created for infrastructure
- Unified incident dashboard tested during GameDay
Pitfalls
- Inconsistent trace ID format: OpenTelemetry uses 32-char hex, Jaeger uses 16-char. Choose one.
- Missing context propagation: If trace IDs don't flow across services, distributed tracing breaks. Use OpenTelemetry auto-instrumentation.
- Exemplar overload: Too many exemplars (>100k) can slow Prometheus. Sample high-volume metrics.
- Clock skew: Traces span multiple services. Ensure NTP is configured; clock drift causes trace ordering issues.
- Data retention mismatch: If traces expire before metrics, correlation breaks. Align retention policies.
Related Skills
setup-prometheus-monitoring- metrics foundation for correlationconfigure-log-aggregation- logs foundation for correlationinstrument-distributed-tracing- traces foundation for correlationbuild-grafana-dashboards- unified visualization layer
GitHub Repository
Related Skills
content-collections
MetaThis skill provides a production-tested setup for Content Collections, a TypeScript-first tool that transforms Markdown/MDX files into type-safe data collections with Zod validation. Use it when building blogs, documentation sites, or content-heavy Vite + React applications to ensure type safety and automatic content validation. It covers everything from Vite plugin configuration and MDX compilation to deployment optimization and schema validation.
polymarket
MetaThis skill enables developers to build applications with the Polymarket prediction markets platform, including API integration for trading and market data. It also provides real-time data streaming via WebSocket to monitor live trades and market activity. Use it for implementing trading strategies or creating tools that process live market updates.
creating-opencode-plugins
MetaThis skill helps developers create OpenCode plugins that hook into 25+ event types like commands, files, and LSP operations. It provides the plugin structure, event API specifications, and implementation patterns for JavaScript/TypeScript modules. Use it when you need to intercept, monitor, or extend the OpenCode AI assistant's lifecycle with custom event-driven logic.
sglang
MetaSGLang is a high-performance LLM serving framework that specializes in fast, structured generation for JSON, regex, and agentic workflows using its RadixAttention prefix caching. It delivers significantly faster inference, especially for tasks with repeated prefixes, making it ideal for complex, structured outputs and multi-turn conversations. Choose SGLang over alternatives like vLLM when you need constrained decoding or are building applications with extensive prefix sharing.
