SKILL·5BAE6B

setup-prometheus-monitoring

Name: setup-prometheus-monitoring
Author: pjt222

pjt222

Обновлено 1 month ago

9 просмотров

Другоеgeneral

О программе

Этот навык настраивает готовую к эксплуатации систему мониторинга Prometheus с конфигурациями сбора метрик, обнаружением сервисов и правилами записи. Он обеспечивает централизованный сбор временных рядов метрик для микросервисов и инфраструктуры, служа основой для SLO и системы оповещений. Используйте его для внедрения современной наблюдаемости или миграции с устаревших систем мониторинга.

Быстрая установка

Claude Code

Рекомендуется

Основной

npx skills add pjt222/agent-almanac -a claude-code

Команда плагинаАльтернативный

/plugin add https://github.com/pjt222/agent-almanac

Git клонированиеАльтернативный

git clone https://github.com/pjt222/agent-almanac.git ~/.claude/skills/setup-prometheus-monitoring

Скопируйте и вставьте эту команду в Claude Code для установки этого навыка

Документация

Setup Prometheus Monitoring

Configure prod-ready Prometheus w/ scrape targets, recording rules, federation.

Use When

Centralized metrics for microservices|distributed
Time-series monitor app+infra
Foundation for SLO/SLI + alerting
Consolidate metrics from multi Prometheus via federation
Migrate legacy → modern observability

In

Required: Scrape targets (services, exporters, endpoints)
Required: Retention period + storage reqs
Optional: Existing service discovery (K8s, Consul, EC2)
Optional: Recording rules for pre-agg metrics
Optional: Federation hierarchy multi-cluster

Do

Step 1: Install + Configure

# Create Prometheus directory structure
mkdir -p /etc/prometheus/{rules,file_sd}
mkdir -p /var/lib/prometheus

# Download Prometheus (adjust version as needed)
cd /tmp
wget https://github.com/prometheus/prometheus/releases/download/v2.48.0/prometheus-2.48.0.linux-amd64.tar.gz
tar xvf prometheus-2.48.0.linux-amd64.tar.gz
sudo cp prometheus-2.48.0.linux-amd64/{prometheus,promtool} /usr/local/bin/

/etc/prometheus/prometheus.yml:

global:
  scrape_interval: 15s
  scrape_timeout: 10s
  evaluation_interval: 15s
  external_labels:
    cluster: 'production'
    region: 'us-east-1'

# Alertmanager configuration
alerting:
  alertmanagers:
    - static_configs:
        - targets:
            - localhost:9093

# Load recording and alerting rules
rule_files:
  - "rules/*.yml"

# Scrape configurations
scrape_configs:
  # Prometheus self-monitoring
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']
        labels:
          env: 'production'

  # Node exporter for host metrics
  - job_name: 'node'
    static_configs:
      - targets:
          - 'node1:9100'
          - 'node2:9100'
        labels:
          env: 'production'

  # Application metrics with file-based service discovery
  - job_name: 'app-services'
    file_sd_configs:
      - files:
          - '/etc/prometheus/file_sd/services.json'
        refresh_interval: 30s
    relabel_configs:
      - source_labels: [__address__]
        target_label: instance
      - source_labels: [env]
        target_label: environment

→ Prometheus starts, UI at http://localhost:9090, targets in Status > Targets.

If err:

Syntax: promtool check config /etc/prometheus/prometheus.yml
Perms: sudo chown -R prometheus:prometheus /etc/prometheus /var/lib/prometheus
Logs: journalctl -u prometheus -f

Step 2: Service Discovery

Dynamic targets → no manual.

K8s add to scrape_configs:

  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      # Only scrape pods with prometheus.io/scrape annotation
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      # Use custom port if specified
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_port]
        action: replace
        target_label: __address__
        regex: ([^:]+)(?::\d+)?;(\d+)
        replacement: $1:$2
      # Add namespace as label
      - source_labels: [__meta_kubernetes_namespace]
        target_label: kubernetes_namespace
      # Add pod name as label
      - source_labels: [__meta_kubernetes_pod_name]
        target_label: kubernetes_pod_name

File-based /etc/prometheus/file_sd/services.json:

[
  {
    "targets": ["web-app-1:8080", "web-app-2:8080"],
    "labels": {
      "job": "web-app",
      "env": "production",
      "team": "platform"
    }
  },
  {
    "targets": ["api-service-1:9090", "api-service-2:9090"],
    "labels": {
      "job": "api-service",
      "env": "production",
      "team": "backend"
    }
  }
]

Consul:

  - job_name: 'consul-services'
    consul_sd_configs:
      - server: 'consul.example.com:8500'
        services: []  # Empty list means discover all services
    relabel_configs:
      - source_labels: [__meta_consul_service]
        target_label: job
      - source_labels: [__meta_consul_tags]
        regex: '.*,monitoring,.*'
        action: keep

→ Dynamic targets in UI, auto-update on scale|change.

If err:

K8s: RBAC kubectl auth can-i list pods --as=system:serviceaccount:monitoring:prometheus
File SD: python -m json.tool /etc/prometheus/file_sd/services.json
Consul: curl http://consul.example.com:8500/v1/catalog/services

Step 3: Recording Rules

Pre-aggregate expensive queries → dashboard perf + alerting efficiency.

/etc/prometheus/rules/recording_rules.yml:

groups:
  - name: api_aggregations
    interval: 30s
    rules:
      # Calculate request rate per endpoint (5m window)
      - record: job:http_requests:rate5m
        expr: |
          sum by (job, endpoint, method) (
            rate(http_requests_total[5m])
          )

      # Calculate error rate percentage
      - record: job:http_errors:rate5m
        expr: |
          sum by (job) (
            rate(http_requests_total{status=~"5.."}[5m])
          ) / sum by (job) (
            rate(http_requests_total[5m])
          ) * 100

      # P95 latency by endpoint
      - record: job:http_request_duration_seconds:p95
        expr: |
          histogram_quantile(0.95,
            sum by (job, endpoint, le) (
              rate(http_request_duration_seconds_bucket[5m])
            )
          )

  - name: resource_aggregations
    interval: 1m
    rules:
      # CPU usage by instance
      - record: instance:cpu_usage:ratio
        expr: |
          1 - avg by (instance) (
            rate(node_cpu_seconds_total{mode="idle"}[5m])
          )

      # Memory usage percentage
      - record: instance:memory_usage:ratio
        expr: |
          1 - (
            node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes
          )

      # Disk usage by mount point
      - record: instance:disk_usage:ratio
        expr: |
          1 - (
            node_filesystem_avail_bytes{fstype!~"tmpfs|fuse.*"}
            / node_filesystem_size_bytes{fstype!~"tmpfs|fuse.*"}
          )

Validate + reload:

# Validate rules syntax
promtool check rules /etc/prometheus/rules/recording_rules.yml

# Reload Prometheus configuration (without restart)
curl -X POST http://localhost:9090/-/reload

# Or send SIGHUP signal
sudo killall -HUP prometheus

→ Rules eval, new metrics w/ job: prefix, query perf improved.

If err:

promtool check rules
Eval interval matches data avail
Missing source: curl http://localhost:9090/api/v1/targets
Logs: journalctl -u prometheus | grep -i error

Step 4: Storage + Retention

/etc/systemd/system/prometheus.service:

[Unit]
Description=Prometheus Monitoring System
Documentation=https://prometheus.io/docs/introduction/overview/
After=network-online.target

[Service]
Type=simple
User=prometheus
Group=prometheus
ExecStart=/usr/local/bin/prometheus \
  --config.file=/etc/prometheus/prometheus.yml \
  --storage.tsdb.path=/var/lib/prometheus \
  --storage.tsdb.retention.time=30d \
  --storage.tsdb.retention.size=50GB \
  --web.console.templates=/etc/prometheus/consoles \
  --web.console.libraries=/etc/prometheus/console_libraries \
  --web.listen-address=:9090 \
  --web.enable-lifecycle \
  --web.enable-admin-api

Restart=always
RestartSec=10s

[Install]
WantedBy=multi-user.target

Key flags:

--storage.tsdb.retention.time=30d: 30d data
--storage.tsdb.retention.size=50GB: 50GB cap (whichever first)
--storage.tsdb.wal-compression: ↓disk I/O
--web.enable-lifecycle: Reload via HTTP POST
--web.enable-admin-api: Snapshot + delete APIs

Enable + start:

sudo systemctl daemon-reload
sudo systemctl enable prometheus
sudo systemctl start prometheus
sudo systemctl status prometheus

→ Retains per policy, disk within limits, old auto-pruned.

If err:

Disk: du -sh /var/lib/prometheus
TSDB: curl http://localhost:9090/api/v1/status/tsdb
Retention: curl http://localhost:9090/api/v1/status/runtimeinfo | jq .data.storageRetention
Force cleanup: curl -X POST http://localhost:9090/api/v1/admin/tsdb/delete_series?match[]={__name__=~".+"}

Step 5: Federation (Multi-Cluster)

Hierarchical for aggregating across clusters.

Edge instances per cluster, set external labels:

global:
  external_labels:
    cluster: 'production-east'
    datacenter: 'us-east-1'

Central add federation scrape:

scrape_configs:
  - job_name: 'federate-production'
    honor_labels: true
    metrics_path: '/federate'
    params:
      'match[]':
        # Aggregate only pre-computed recording rules
        - '{__name__=~"job:.*"}'
        # Include alert states
        - '{__name__=~"ALERTS.*"}'
        # Include critical infrastructure metrics
        - 'up{job=~".*"}'
    static_configs:
      - targets:
          - 'prometheus-east.example.com:9090'
          - 'prometheus-west.example.com:9090'
        labels:
          env: 'production'
    relabel_configs:
      - source_labels: [__address__]
        target_label: instance
      - source_labels: [__address__]
        regex: 'prometheus-(.*).example.com.*'
        target_label: cluster
        replacement: '$1'

Best practices:

honor_labels: true preserves original
Federate only recording rules + aggregates (not raw)
Scrape intervals longer than edge eval
match[] filters → don't federate everything

→ Central shows federated metrics from all, queries span regions, min duplication.

If err:

Endpoint accessible: curl http://prometheus-east.example.com:9090/federate?match[]={__name__=~"job:.*"} | head -20
Label conflicts (central vs edge external)
Federation lag: compare timestamps
Match patterns: curl http://localhost:9090/api/v1/label/__name__/values | jq .data | grep "job:"

Step 6: HA (Optional)

Redundant instances identical configs for failover.

Thanos|Cortex for true HA, or load-balanced:

# prometheus-1.yml and prometheus-2.yml (identical configs)
global:
  scrape_interval: 15s
  external_labels:
    prometheus: 'prometheus-1'  # Different per instance
    replica: 'A'

# Use --web.external-url flag for each instance
# prometheus-1: --web.external-url=http://prometheus-1.example.com:9090
# prometheus-2: --web.external-url=http://prometheus-2.example.com:9090

Grafana queries both:

{
  "name": "Prometheus-HA",
  "type": "prometheus",
  "url": "http://prometheus-lb.example.com",
  "jsonData": {
    "httpMethod": "POST",
    "timeInterval": "15s"
  }
}

HAProxy|nginx for LB:

upstream prometheus_backend {
    server prometheus-1.example.com:9090 max_fails=3 fail_timeout=30s;
    server prometheus-2.example.com:9090 max_fails=3 fail_timeout=30s;
}

server {
    listen 9090;
    location / {
        proxy_pass http://prometheus_backend;
        proxy_set_header Host $host;
    }
}

→ Queries balanced, auto-failover if 1 down, no data loss single-instance fail.

If err:

Both scrape same targets (slight time skew OK)
Config drift between
Dedup in queries (Grafana shows dup series)
LB health checks

Check

Traps

High cardinality: Avoid unbounded labels (user IDs, timestamps, UUIDs). Recording rules to agg before storage.
Scrape interval mismatch: Recording rules eval ≥ scrape intervals → no gaps.
Federation overload: All metrics = massive dup. Only federate aggregated rules.
Missing relabel: Service discovery → confusing|dup labels w/o relabel.
Retention too short: Set longer than longest dashboard window → no "no data" gaps.
No resource limits: Excessive mem w/ high cardinality. Set --storage.tsdb.max-block-duration + monitor heap.
Disabled lifecycle: W/o --web.enable-lifecycle, reloads need full restart → scrape gaps.

→

configure-alerting-rules — alerting rules + Alertmanager routing
build-grafana-dashboards — visualize w/ Grafana
define-slo-sli-sla — SLO/SLI via recording rules + error budget
instrument-distributed-tracing — complement metrics w/ tracing

GitHub репозиторий

pjt222/agent-almanac

Путь: i18n/caveman-ultra/skills/setup-prometheus-monitoring

agentsagentskillsai-assisted-developmentclaude-codeskillsteams

FAQ

Frequently asked questions

What is the setup-prometheus-monitoring skill?

setup-prometheus-monitoring is a Claude Skill by pjt222. Skills package instructions and resources that Claude loads on demand, so Claude can perform setup-prometheus-monitoring-related tasks without extra prompting.

How do I install setup-prometheus-monitoring?

Use the install commands on this page: add setup-prometheus-monitoring to Claude Code as a plugin, or clone its repository into your skills directory, then restart Claude so it picks up the skill.

What category does setup-prometheus-monitoring belong to?

setup-prometheus-monitoring is in the Other category, tagged general.

Is setup-prometheus-monitoring free to use?

Yes. setup-prometheus-monitoring is listed on AIMCP and free to install. It runs inside Claude, so no separate service account is required to use the skill itself.

setup-prometheus-monitoring

О программе

Быстрая установка

Claude Code

Документация

Setup Prometheus Monitoring

Use When

In

Do

Step 1: Install + Configure

Step 2: Service Discovery

Step 3: Recording Rules

Step 4: Storage + Retention

Step 5: Federation (Multi-Cluster)

Step 6: HA (Optional)

Check

Traps

→

GitHub репозиторий

Frequently asked questions

What is the setup-prometheus-monitoring skill?

How do I install setup-prometheus-monitoring?

What category does setup-prometheus-monitoring belong to?

Is setup-prometheus-monitoring free to use?

Похожие навыки