スキル一覧に戻る

setup-prometheus-monitoring

pjt222
更新日 Yesterday
1 閲覧
17
2
17
GitHubで表示
その他general

について

このスキルは、スクレイプ設定、サービスディスカバリー、記録ルールを備えた本番環境対応のPrometheus監視システムを構成します。マイクロサービスとインフラストラクチャのための集中型時系列メトリクス収集を可能にし、SLOとアラートの基盤として機能します。モダンなオブザーバビリティを確立するため、またはレガシー監視システムからの移行にご利用ください。

クイックインストール

Claude Code

推奨
メイン
npx skills add pjt222/agent-almanac -a claude-code
プラグインコマンド代替
/plugin add https://github.com/pjt222/agent-almanac
Git クローン代替
git clone https://github.com/pjt222/agent-almanac.git ~/.claude/skills/setup-prometheus-monitoring

このコマンドをClaude Codeにコピー&ペーストしてスキルをインストールします

ドキュメント

Setup Prometheus Monitoring

Configure prod-ready Prometheus w/ scrape targets, recording rules, federation.

Use When

  • Centralized metrics for microservices|distributed
  • Time-series monitor app+infra
  • Foundation for SLO/SLI + alerting
  • Consolidate metrics from multi Prometheus via federation
  • Migrate legacy → modern observability

In

  • Required: Scrape targets (services, exporters, endpoints)
  • Required: Retention period + storage reqs
  • Optional: Existing service discovery (K8s, Consul, EC2)
  • Optional: Recording rules for pre-agg metrics
  • Optional: Federation hierarchy multi-cluster

Do

Step 1: Install + Configure

# Create Prometheus directory structure
mkdir -p /etc/prometheus/{rules,file_sd}
mkdir -p /var/lib/prometheus

# Download Prometheus (adjust version as needed)
cd /tmp
wget https://github.com/prometheus/prometheus/releases/download/v2.48.0/prometheus-2.48.0.linux-amd64.tar.gz
tar xvf prometheus-2.48.0.linux-amd64.tar.gz
sudo cp prometheus-2.48.0.linux-amd64/{prometheus,promtool} /usr/local/bin/

/etc/prometheus/prometheus.yml:

global:
  scrape_interval: 15s
  scrape_timeout: 10s
  evaluation_interval: 15s
  external_labels:
    cluster: 'production'
    region: 'us-east-1'

# Alertmanager configuration
alerting:
  alertmanagers:
    - static_configs:
        - targets:
            - localhost:9093

# Load recording and alerting rules
rule_files:
  - "rules/*.yml"

# Scrape configurations
scrape_configs:
  # Prometheus self-monitoring
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']
        labels:
          env: 'production'

  # Node exporter for host metrics
  - job_name: 'node'
    static_configs:
      - targets:
          - 'node1:9100'
          - 'node2:9100'
        labels:
          env: 'production'

  # Application metrics with file-based service discovery
  - job_name: 'app-services'
    file_sd_configs:
      - files:
          - '/etc/prometheus/file_sd/services.json'
        refresh_interval: 30s
    relabel_configs:
      - source_labels: [__address__]
        target_label: instance
      - source_labels: [env]
        target_label: environment

→ Prometheus starts, UI at http://localhost:9090, targets in Status > Targets.

If err:

  • Syntax: promtool check config /etc/prometheus/prometheus.yml
  • Perms: sudo chown -R prometheus:prometheus /etc/prometheus /var/lib/prometheus
  • Logs: journalctl -u prometheus -f

Step 2: Service Discovery

Dynamic targets → no manual.

K8s add to scrape_configs:

  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      # Only scrape pods with prometheus.io/scrape annotation
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      # Use custom port if specified
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_port]
        action: replace
        target_label: __address__
        regex: ([^:]+)(?::\d+)?;(\d+)
        replacement: $1:$2
      # Add namespace as label
      - source_labels: [__meta_kubernetes_namespace]
        target_label: kubernetes_namespace
      # Add pod name as label
      - source_labels: [__meta_kubernetes_pod_name]
        target_label: kubernetes_pod_name

File-based /etc/prometheus/file_sd/services.json:

[
  {
    "targets": ["web-app-1:8080", "web-app-2:8080"],
    "labels": {
      "job": "web-app",
      "env": "production",
      "team": "platform"
    }
  },
  {
    "targets": ["api-service-1:9090", "api-service-2:9090"],
    "labels": {
      "job": "api-service",
      "env": "production",
      "team": "backend"
    }
  }
]

Consul:

  - job_name: 'consul-services'
    consul_sd_configs:
      - server: 'consul.example.com:8500'
        services: []  # Empty list means discover all services
    relabel_configs:
      - source_labels: [__meta_consul_service]
        target_label: job
      - source_labels: [__meta_consul_tags]
        regex: '.*,monitoring,.*'
        action: keep

→ Dynamic targets in UI, auto-update on scale|change.

If err:

  • K8s: RBAC kubectl auth can-i list pods --as=system:serviceaccount:monitoring:prometheus
  • File SD: python -m json.tool /etc/prometheus/file_sd/services.json
  • Consul: curl http://consul.example.com:8500/v1/catalog/services

Step 3: Recording Rules

Pre-aggregate expensive queries → dashboard perf + alerting efficiency.

/etc/prometheus/rules/recording_rules.yml:

groups:
  - name: api_aggregations
    interval: 30s
    rules:
      # Calculate request rate per endpoint (5m window)
      - record: job:http_requests:rate5m
        expr: |
          sum by (job, endpoint, method) (
            rate(http_requests_total[5m])
          )

      # Calculate error rate percentage
      - record: job:http_errors:rate5m
        expr: |
          sum by (job) (
            rate(http_requests_total{status=~"5.."}[5m])
          ) / sum by (job) (
            rate(http_requests_total[5m])
          ) * 100

      # P95 latency by endpoint
      - record: job:http_request_duration_seconds:p95
        expr: |
          histogram_quantile(0.95,
            sum by (job, endpoint, le) (
              rate(http_request_duration_seconds_bucket[5m])
            )
          )

  - name: resource_aggregations
    interval: 1m
    rules:
      # CPU usage by instance
      - record: instance:cpu_usage:ratio
        expr: |
          1 - avg by (instance) (
            rate(node_cpu_seconds_total{mode="idle"}[5m])
          )

      # Memory usage percentage
      - record: instance:memory_usage:ratio
        expr: |
          1 - (
            node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes
          )

      # Disk usage by mount point
      - record: instance:disk_usage:ratio
        expr: |
          1 - (
            node_filesystem_avail_bytes{fstype!~"tmpfs|fuse.*"}
            / node_filesystem_size_bytes{fstype!~"tmpfs|fuse.*"}
          )

Validate + reload:

# Validate rules syntax
promtool check rules /etc/prometheus/rules/recording_rules.yml

# Reload Prometheus configuration (without restart)
curl -X POST http://localhost:9090/-/reload

# Or send SIGHUP signal
sudo killall -HUP prometheus

→ Rules eval, new metrics w/ job: prefix, query perf improved.

If err:

  • promtool check rules
  • Eval interval matches data avail
  • Missing source: curl http://localhost:9090/api/v1/targets
  • Logs: journalctl -u prometheus | grep -i error

Step 4: Storage + Retention

/etc/systemd/system/prometheus.service:

[Unit]
Description=Prometheus Monitoring System
Documentation=https://prometheus.io/docs/introduction/overview/
After=network-online.target

[Service]
Type=simple
User=prometheus
Group=prometheus
ExecStart=/usr/local/bin/prometheus \
  --config.file=/etc/prometheus/prometheus.yml \
  --storage.tsdb.path=/var/lib/prometheus \
  --storage.tsdb.retention.time=30d \
  --storage.tsdb.retention.size=50GB \
  --web.console.templates=/etc/prometheus/consoles \
  --web.console.libraries=/etc/prometheus/console_libraries \
  --web.listen-address=:9090 \
  --web.enable-lifecycle \
  --web.enable-admin-api

Restart=always
RestartSec=10s

[Install]
WantedBy=multi-user.target

Key flags:

  • --storage.tsdb.retention.time=30d: 30d data
  • --storage.tsdb.retention.size=50GB: 50GB cap (whichever first)
  • --storage.tsdb.wal-compression: ↓disk I/O
  • --web.enable-lifecycle: Reload via HTTP POST
  • --web.enable-admin-api: Snapshot + delete APIs

Enable + start:

sudo systemctl daemon-reload
sudo systemctl enable prometheus
sudo systemctl start prometheus
sudo systemctl status prometheus

→ Retains per policy, disk within limits, old auto-pruned.

If err:

  • Disk: du -sh /var/lib/prometheus
  • TSDB: curl http://localhost:9090/api/v1/status/tsdb
  • Retention: curl http://localhost:9090/api/v1/status/runtimeinfo | jq .data.storageRetention
  • Force cleanup: curl -X POST http://localhost:9090/api/v1/admin/tsdb/delete_series?match[]={__name__=~".+"}

Step 5: Federation (Multi-Cluster)

Hierarchical for aggregating across clusters.

Edge instances per cluster, set external labels:

global:
  external_labels:
    cluster: 'production-east'
    datacenter: 'us-east-1'

Central add federation scrape:

scrape_configs:
  - job_name: 'federate-production'
    honor_labels: true
    metrics_path: '/federate'
    params:
      'match[]':
        # Aggregate only pre-computed recording rules
        - '{__name__=~"job:.*"}'
        # Include alert states
        - '{__name__=~"ALERTS.*"}'
        # Include critical infrastructure metrics
        - 'up{job=~".*"}'
    static_configs:
      - targets:
          - 'prometheus-east.example.com:9090'
          - 'prometheus-west.example.com:9090'
        labels:
          env: 'production'
    relabel_configs:
      - source_labels: [__address__]
        target_label: instance
      - source_labels: [__address__]
        regex: 'prometheus-(.*).example.com.*'
        target_label: cluster
        replacement: '$1'

Best practices:

  • honor_labels: true preserves original
  • Federate only recording rules + aggregates (not raw)
  • Scrape intervals longer than edge eval
  • match[] filters → don't federate everything

→ Central shows federated metrics from all, queries span regions, min duplication.

If err:

  • Endpoint accessible: curl http://prometheus-east.example.com:9090/federate?match[]={__name__=~"job:.*"} | head -20
  • Label conflicts (central vs edge external)
  • Federation lag: compare timestamps
  • Match patterns: curl http://localhost:9090/api/v1/label/__name__/values | jq .data | grep "job:"

Step 6: HA (Optional)

Redundant instances identical configs for failover.

Thanos|Cortex for true HA, or load-balanced:

# prometheus-1.yml and prometheus-2.yml (identical configs)
global:
  scrape_interval: 15s
  external_labels:
    prometheus: 'prometheus-1'  # Different per instance
    replica: 'A'

# Use --web.external-url flag for each instance
# prometheus-1: --web.external-url=http://prometheus-1.example.com:9090
# prometheus-2: --web.external-url=http://prometheus-2.example.com:9090

Grafana queries both:

{
  "name": "Prometheus-HA",
  "type": "prometheus",
  "url": "http://prometheus-lb.example.com",
  "jsonData": {
    "httpMethod": "POST",
    "timeInterval": "15s"
  }
}

HAProxy|nginx for LB:

upstream prometheus_backend {
    server prometheus-1.example.com:9090 max_fails=3 fail_timeout=30s;
    server prometheus-2.example.com:9090 max_fails=3 fail_timeout=30s;
}

server {
    listen 9090;
    location / {
        proxy_pass http://prometheus_backend;
        proxy_set_header Host $host;
    }
}

→ Queries balanced, auto-failover if 1 down, no data loss single-instance fail.

If err:

  • Both scrape same targets (slight time skew OK)
  • Config drift between
  • Dedup in queries (Grafana shows dup series)
  • LB health checks

Check

  • UI accessible
  • All scrape targets UP in Status > Targets
  • Service discovery dynamic add|remove
  • Recording rules eval w/o errs
  • Retention matches configured time|size
  • Federation pulls from edge
  • Queries return expected cardinality (not excessive)
  • Disk stable + within budget
  • Reload via HTTP|SIGHUP
  • Self-monitor metrics (up, scrape duration)

Traps

  • High cardinality: Avoid unbounded labels (user IDs, timestamps, UUIDs). Recording rules to agg before storage.
  • Scrape interval mismatch: Recording rules eval ≥ scrape intervals → no gaps.
  • Federation overload: All metrics = massive dup. Only federate aggregated rules.
  • Missing relabel: Service discovery → confusing|dup labels w/o relabel.
  • Retention too short: Set longer than longest dashboard window → no "no data" gaps.
  • No resource limits: Excessive mem w/ high cardinality. Set --storage.tsdb.max-block-duration + monitor heap.
  • Disabled lifecycle: W/o --web.enable-lifecycle, reloads need full restart → scrape gaps.

  • configure-alerting-rules — alerting rules + Alertmanager routing
  • build-grafana-dashboards — visualize w/ Grafana
  • define-slo-sli-sla — SLO/SLI via recording rules + error budget
  • instrument-distributed-tracing — complement metrics w/ tracing

GitHub リポジトリ

pjt222/agent-almanac
パス: i18n/caveman-ultra/skills/setup-prometheus-monitoring
0
agentsagentskillsai-assisted-developmentclaude-codeskillsteams

関連スキル

llamaguard

その他

LlamaGuardは、暴力やヘイトスピーチなど6つの安全性カテゴリーにおいて、LLMの入力と出力をモデレートするMetaの70-80億パラメータモデルです。94〜95%の精度を提供し、vLLM、Hugging Face、Amazon SageMakerを使用してデプロイ可能です。このスキルを使用して、AIアプリケーションにコンテンツフィルタリングと安全策を簡単に統合できます。

スキルを見る

cost-optimization

その他

このClaudeスキルは、リソースの適正サイジング、タグ付け戦略、支出分析を通じて、開発者がクラウドコストを最適化することを支援します。AWS、Azure、GCPにわたるクラウド支出の削減とコストガバナンスの実施のためのフレームワークを提供します。インフラコストの分析、リソースの適正サイジング、または予算制約への対応が必要な際にご利用ください。

スキルを見る

quantizing-models-bitsandbytes

その他

このスキルは、bitsandbytesを使用してLLMを8ビットまたは4ビット精度に量子化し、精度の低下を最小限に抑えつつ50〜75%のメモリ削減を実現します。限られたGPUメモリでより大規模なモデルを実行したり、推論を高速化するのに理想的で、INT8、NF4、FP4などのフォーマットをサポートしています。HuggingFace Transformersと統合され、QLoRAトレーニングや8ビットオプティマイザーを可能にします。

スキルを見る

dispatching-parallel-agents

その他

このClaudeスキルは、複数のエージェントを配備し、3つ以上の独立した問題を並行して調査・修正します。共有状態や依存関係がなく解決可能な、無関係な障害が発生するシナリオ向けに設計されています。中核となる機能は並列問題解決であり、効率を最大化するために独立した問題領域ごとに1つのエージェントを割り当てます。

スキルを見る