correlate-observability-signals
关于
This skill unifies metrics, logs, and traces to enable cohesive debugging and rapid root cause analysis across systems. It helps implement log-to-trace linking via exemplars and build unified dashboards using RED/USE methods. Use it when investigating complex, multi-system incidents or moving from siloed tools to a unified observability platform.
快速安装
Claude Code
推荐npx skills add pjt222/agent-almanac -a claude-code/plugin add https://github.com/pjt222/agent-almanacgit clone https://github.com/pjt222/agent-almanac.git ~/.claude/skills/correlate-observability-signals在 Claude Code 中复制并粘贴此命令以安装该技能
技能文档
關聯可觀測性信號
連接指標、日誌、追蹤以統合除錯,貫通可觀測性三柱。
適用時機
- 查橫跨多系統之複雜事故
- 減 MTTR(平均解決時間)
- 建統一之可觀測性儀表板
- 施分散追蹤
- 由孤立工具移至統一可觀測性
輸入
- 必要:Prometheus(指標)
- 必要:日誌聚合系統(Loki、Elasticsearch、CloudWatch)
- 必要:分散追蹤後端(Tempo、Jaeger、Zipkin)
- 選擇性:Grafana 以行統一可視化
- 選擇性:OpenTelemetry 儀器化
步驟
完整配置檔與範本見 Extended Examples。
步驟一:施追蹤脈絡傳播
以 OpenTelemetry 加追蹤 ID 於所有日誌與指標:
// Go example: Propagate trace context to logs
package main
import (
"context"
"log"
"go.opentelemetry.io/otel"
"go.opentelemetry.io/otel/trace"
)
func handleRequest(ctx context.Context, userID string) {
// Extract trace context
span := trace.SpanFromContext(ctx)
traceID := span.SpanContext().TraceID().String()
// Include trace ID in structured logs
log.Printf("trace_id=%s user_id=%s action=process_request", traceID, userID)
// Business logic here
processData(ctx, userID)
}
func processData(ctx context.Context, userID string) {
tracer := otel.Tracer("my-service")
ctx, span := tracer.Start(ctx, "processData")
defer span.End()
traceID := span.SpanContext().TraceID().String()
log.Printf("trace_id=%s user_id=%s action=process_data", traceID, userID)
// More work
}
Python 之例:
# Python: Flask with OpenTelemetry
from flask import Flask, request
from opentelemetry import trace
from opentelemetry.instrumentation.flask import FlaskInstrumentor
import logging
app = Flask(__name__)
FlaskInstrumentor().instrument_app(app)
logging.basicConfig(
format='%(asctime)s trace_id=%(otelTraceID)s span_id=%(otelSpanID)s %(message)s',
level=logging.INFO
)
@app.route('/api/users/<user_id>')
def get_user(user_id):
span = trace.get_current_span()
trace_id = format(span.get_span_context().trace_id, '032x')
logging.info(f"Fetching user {user_id}", extra={
'otelTraceID': trace_id,
'otelSpanID': format(span.get_span_context().span_id, '016x')
})
# Business logic
return {"user_id": user_id}
預期: 所有日誌含 trace_id 欄,令日誌-追蹤可關聯。
失敗時: 若追蹤 ID 缺,察 OpenTelemetry SDK 初始化及脈絡傳播。
步驟二:配置 Prometheus 中之 exemplars
exemplars 連指標於追蹤:
# prometheus.yml
global:
scrape_interval: 15s
# Enable exemplar storage
exemplars:
max_exemplars: 100000 # Per TSDB block
scrape_configs:
- job_name: 'api-service'
static_configs:
- targets: ['api-service:8080']
# Scrape exemplars
metric_relabel_configs:
- source_labels: [__name__]
regex: 'http_request_duration_seconds.*'
action: keep
儀器化應用以發 exemplars:
// Go: Emit exemplars with Prometheus histogram
package main
import (
"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/promauto"
"go.opentelemetry.io/otel/trace"
)
var httpDuration = promauto.NewHistogramVec(
prometheus.HistogramOpts{
Name: "http_request_duration_seconds",
Help: "HTTP request duration",
Buckets: prometheus.DefBuckets,
},
[]string{"method", "endpoint", "status"},
)
func recordRequest(ctx context.Context, method, endpoint, status string, duration float64) {
// Get trace ID for exemplar
span := trace.SpanFromContext(ctx)
traceID := span.SpanContext().TraceID().String()
// Record metric with exemplar
observer := httpDuration.WithLabelValues(method, endpoint, status)
observer.(prometheus.ExemplarObserver).ObserveWithExemplar(
duration,
prometheus.Labels{"trace_id": traceID},
)
}
於 Prometheus 查 exemplars:
# Histogram with exemplars
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
於 Grafana,exemplars 顯為直方圖上之點,可連至追蹤。
預期: Grafana 於指標圖中顯 exemplars,點之則開對應追蹤。
失敗時: 驗 Prometheus 版本 ≥2.26(支援 exemplar),察 Grafana 資料源配置啟 exemplars。
步驟三:以 RED 法建統一儀表板
RED 法:率(Rate)、誤(Errors)、延(Duration)——用於服務。
{
"dashboard": {
"title": "API Service - RED Dashboard",
"panels": [
{
"title": "Request Rate (req/s)",
"type": "graph",
"targets": [
{
"expr": "sum(rate(http_requests_total{job=\"api-service\"}[5m])) by (endpoint)",
"legendFormat": "{{ endpoint }}"
}
],
"exemplars": true
},
{
"title": "Error Rate (%)",
"type": "graph",
"targets": [
{
"expr": "sum(rate(http_requests_total{job=\"api-service\", status=~\"5..\"}[5m])) / sum(rate(http_requests_total{job=\"api-service\"}[5m])) * 100",
"legendFormat": "Error %"
}
],
"exemplars": true
},
{
"title": "Request Duration (p50, p95, p99)",
"type": "graph",
"targets": [
{
"expr": "histogram_quantile(0.50, rate(http_request_duration_seconds_bucket{job=\"api-service\"}[5m]))",
"legendFormat": "p50"
},
{
"expr": "histogram_quantile(0.95, rate(http_request_duration_seconds_bucket{job=\"api-service\"}[5m]))",
"legendFormat": "p95"
},
{
"expr": "histogram_quantile(0.99, rate(http_request_duration_seconds_bucket{job=\"api-service\"}[5m]))",
"legendFormat": "p99"
}
],
"exemplars": true
},
{
"title": "Correlated Logs",
"type": "logs",
"datasource": "Loki",
"targets": [
{
"expr": "{job=\"api-service\"} |= \"error\""
}
],
"options": {
"showTime": true,
"enableLogDetails": true
}
}
]
}
}
預期: 單一儀表板顯率、誤、延與關聯日誌。
失敗時: 若面板顯「No Data」,驗指標名合儀器化。
步驟四:以 USE 法施於資源
USE 法:利用(Utilization)、飽和(Saturation)、誤(Errors)——用於資源如 CPU、記憶體、磁碟。
{
"dashboard": {
"title": "Node Resources - USE Dashboard",
"panels": [
{
"title": "CPU Utilization (%)",
"type": "graph",
"targets": [
{
"expr": "100 - (avg(rate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)",
"legendFormat": "CPU Usage %"
}
]
},
{
"title": "CPU Saturation (Load Average)",
"type": "graph",
"targets": [
{
"expr": "node_load1",
"legendFormat": "1min load"
},
{
"expr": "node_load5",
"legendFormat": "5min load"
},
{
"expr": "count(node_cpu_seconds_total{mode=\"idle\"})",
"legendFormat": "CPU cores (threshold)"
}
]
},
{
"title": "Memory Utilization (%)",
"type": "graph",
"targets": [
{
"expr": "(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100",
"legendFormat": "Memory Usage %"
}
]
},
{
"title": "Memory Saturation (Page Faults)",
"type": "graph",
"targets": [
{
"expr": "rate(node_vmstat_pgmajfault[5m])",
"legendFormat": "Major page faults/s"
}
]
},
{
"title": "Disk Utilization (%)",
"type": "graph",
"targets": [
{
"expr": "(node_filesystem_size_bytes - node_filesystem_free_bytes) / node_filesystem_size_bytes * 100",
"legendFormat": "{{ device }}"
}
]
},
{
"title": "Disk Saturation (IO Wait %)",
"type": "graph",
"targets": [
{
"expr": "rate(node_cpu_seconds_total{mode=\"iowait\"}[5m]) * 100",
"legendFormat": "IO Wait %"
}
]
}
]
}
}
預期: 儀表板顯各 USE 維度之資源健康。
失敗時: 確保 node_exporter 運行,擷取系統指標。
步驟五:Loki 中連日誌於追蹤
配 Loki 以萃取追蹤 ID:
# loki-config.yml
schema_config:
configs:
- from: 2024-01-01
store: boltdb-shipper
object_store: s3
schema: v11
index:
prefix: index_
period: 24h
# Derived fields for trace linking
query_config:
derived_fields:
- name: TraceID
source: trace_id
url: 'https://tempo.company.com/trace/${__value.raw}'
urlDisplayLabel: 'View Trace'
於 Grafana,配 Loki 資料源:
{
"name": "Loki",
"type": "loki",
"url": "http://loki:3100",
"jsonData": {
"derivedFields": [
{
"datasourceUid": "tempo-uid",
"matcherRegex": "trace_id=(\\w+)",
"name": "TraceID",
"url": "$${__value.raw}"
}
]
}
}
預期: 點 Loki 日誌中之追蹤 ID 則於 Tempo 中開對應追蹤。
失敗時: 驗正則合日誌格式,察 Tempo 資料源 UID。
步驟六:建統一事故視圖
造一儀表板,集所有信號:
{
"dashboard": {
"title": "Incident Investigation",
"templating": {
"list": [
{
# ... (see EXAMPLES.md for complete configuration)
事故時之工作流:
- 告警觸於高誤率
- 值班工程師開 Grafana 儀表板
- 辨特定時之誤率尖峰
- 點延遲直方圖上之 exemplar 點——開追蹤
- 追蹤顯緩之資料庫查詢
- 點 span 上之「View Logs」——開該追蹤之日誌
- 日誌揭致逾時之具體 SQL 查詢
- 2 分內辨根因
預期: 單一視窗以除錯,可於指標、日誌、追蹤間跳。
失敗時: 若連結不行,察資料源配置與追蹤 ID 傳播。
驗證
- 追蹤 ID 存於所有應用日誌
- Prometheus 擷取 exemplars
- Grafana 儀表板於直方圖顯 exemplar 點
- 點 exemplar 則於 Tempo/Jaeger 開對應追蹤
- Loki 日誌有可用之「View Trace」連結
- 已為關鍵服務建 RED 儀表板
- 已為基礎設施建 USE 儀表板
- GameDay 時已測統一事故儀表板
常見陷阱
- 追蹤 ID 格式不一:OpenTelemetry 用 32 字十六進制,Jaeger 用 16 字。擇一
- 脈絡傳播缺失:追蹤 ID 若不跨服務流,分散追蹤崩。用 OpenTelemetry 自動儀器化
- exemplar 過載:exemplars 過多(>100k)可緩 Prometheus。取樣高量指標
- 時鐘偏移:追蹤跨多服務。確保 NTP 已配;時鐘漂致追蹤序錯
- 資料保留不配:追蹤若先於指標過期,關聯斷。對齊保留策略
相關技能
setup-prometheus-monitoring- 關聯之指標基礎configure-log-aggregation- 關聯之日誌基礎instrument-distributed-tracing- 關聯之追蹤基礎build-grafana-dashboards- 統一之可視化層
GitHub 仓库
相关推荐技能
content-collections
元Content Collections 是一个 TypeScript 优先的构建工具,可将本地 Markdown/MDX 文件转换为类型安全的数据集合。它专为构建博客、文档站和内容密集型 Vite+React 应用而设计,提供基于 Zod 的自动模式验证。该工具涵盖从 Vite 插件配置、MDX 编译到生产环境部署的完整工作流。
polymarket
元这个Claude Skill为开发者提供完整的Polymarket预测市场开发支持,涵盖API调用、交易执行和市场数据分析。关键特性包括实时WebSocket数据流,可监控实时交易、订单和市场动态。开发者可用它构建预测市场应用、实施交易策略并集成实时市场预测功能。
creating-opencode-plugins
元该Skill帮助开发者创建OpenCode插件,用于接入命令、文件、LSP等25+种事件。它提供了插件结构、事件API规范和JavaScript/TypeScript实现模式,适合需要拦截操作、扩展功能或自定义事件处理的场景。开发者可通过它快速构建响应式模块来增强OpenCode AI助手的能力。
sglang
元SGLang是一个专为LLM设计的高性能推理框架,特别适用于需要结构化输出的场景。它通过RadixAttention前缀缓存技术,在处理JSON、正则表达式、工具调用等具有重复前缀的复杂工作流时,能实现极速生成。如果你正在构建智能体或多轮对话系统,并追求远超vLLM的推理性能,SGLang是理想选择。
