返回技能列表

setup-prometheus-monitoring

pjt222
更新于 6 days ago
23 次查看
17
2
17
在 GitHub 上查看
其他general

关于

This skill configures Prometheus for comprehensive metrics collection, including scrape configurations, service discovery, and recording rules. It's designed for setting up centralized monitoring of microservices, implementing time-series tracking for applications and infrastructure, and establishing SLO/SLI foundations. Use it when deploying modern observability stacks or migrating from legacy monitoring solutions.

快速安装

Claude Code

推荐
主要方式
npx skills add pjt222/agent-almanac -a claude-code
插件命令备选方式
/plugin add https://github.com/pjt222/agent-almanac
Git 克隆备选方式
git clone https://github.com/pjt222/agent-almanac.git ~/.claude/skills/setup-prometheus-monitoring

在 Claude Code 中复制并粘贴此命令以安装该技能

技能文档

設 Prometheus 察

配產備 Prometheus 釋含採標、錄則、聯。

  • 為微服或散系設集指採→用
  • 行時序察為應與基設指→用
  • 為 SLO/SLI 追與警立基→用
  • 跨諸 Prometheus 經聯合指→用
  • 自舊察方遷至今察棧→用

  • :採標列(服、出器、端)
  • :留期與儲需
  • :既服發現機(Kubernetes、Consul、EC2)
  • :錄則為預聚指
  • :聯階為多叢設

一:裝配 Prometheus

建基 Prometheus 配含全設與採間:

mkdir -p /etc/prometheus/{rules,file_sd}
mkdir -p /var/lib/prometheus

cd /tmp
wget https://github.com/prometheus/prometheus/releases/download/v2.48.0/prometheus-2.48.0.linux-amd64.tar.gz
tar xvf prometheus-2.48.0.linux-amd64.tar.gz
sudo cp prometheus-2.48.0.linux-amd64/{prometheus,promtool} /usr/local/bin/

/etc/prometheus/prometheus.yml

global:
  scrape_interval: 15s
  scrape_timeout: 10s
  evaluation_interval: 15s
  external_labels:
    cluster: 'production'
    region: 'us-east-1'

alerting:
  alertmanagers:
    - static_configs:
        - targets:
            - localhost:9093

rule_files:
  - "rules/*.yml"

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']
        labels:
          env: 'production'

  - job_name: 'node'
    static_configs:
      - targets:
          - 'node1:9100'
          - 'node2:9100'
        labels:
          env: 'production'

  - job_name: 'app-services'
    file_sd_configs:
      - files:
          - '/etc/prometheus/file_sd/services.json'
        refresh_interval: 30s
    relabel_configs:
      - source_labels: [__address__]
        target_label: instance
      - source_labels: [env]
        target_label: environment

得:Prometheus 啟、網 UI 達於 http://localhost:9090、標列於 Status > Targets。

敗:

  • promtool check config /etc/prometheus/prometheus.yml 察語
  • 驗檔權:sudo chown -R prometheus:prometheus /etc/prometheus /var/lib/prometheus
  • 察日誌:journalctl -u prometheus -f

二:配服發現

設動標發現以免手管。

Kubernetes

  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_port]
        action: replace
        target_label: __address__
        regex: ([^:]+)(?::\d+)?;(\d+)
        replacement: $1:$2
      - source_labels: [__meta_kubernetes_namespace]
        target_label: kubernetes_namespace
      - source_labels: [__meta_kubernetes_pod_name]
        target_label: kubernetes_pod_name

檔基服發現—建 /etc/prometheus/file_sd/services.json

[
  {
    "targets": ["web-app-1:8080", "web-app-2:8080"],
    "labels": {
      "job": "web-app",
      "env": "production",
      "team": "platform"
    }
  },
  {
    "targets": ["api-service-1:9090", "api-service-2:9090"],
    "labels": {
      "job": "api-service",
      "env": "production",
      "team": "backend"
    }
  }
]

Consul

  - job_name: 'consul-services'
    consul_sd_configs:
      - server: 'consul.example.com:8500'
        services: []
    relabel_configs:
      - source_labels: [__meta_consul_service]
        target_label: job
      - source_labels: [__meta_consul_tags]
        regex: '.*,monitoring,.*'
        action: keep

得:動標現於 Prometheus UI、服變/縮時自更。

敗:

  • Kubernetes:驗 RBAC kubectl auth can-i list pods --as=system:serviceaccount:monitoring:prometheus
  • 檔 SD:驗 JSON python -m json.tool /etc/prometheus/file_sd/services.json
  • Consul:測連 curl http://consul.example.com:8500/v1/catalog/services

三:建錄則

預聚貴詢為儀板性與警效。

/etc/prometheus/rules/recording_rules.yml

groups:
  - name: api_aggregations
    interval: 30s
    rules:
      - record: job:http_requests:rate5m
        expr: |
          sum by (job, endpoint, method) (
            rate(http_requests_total[5m])
          )

      - record: job:http_errors:rate5m
        expr: |
          sum by (job) (
            rate(http_requests_total{status=~"5.."}[5m])
          ) / sum by (job) (
            rate(http_requests_total[5m])
          ) * 100

      - record: job:http_request_duration_seconds:p95
        expr: |
          histogram_quantile(0.95,
            sum by (job, endpoint, le) (
              rate(http_request_duration_seconds_bucket[5m])
            )
          )

  - name: resource_aggregations
    interval: 1m
    rules:
      - record: instance:cpu_usage:ratio
        expr: |
          1 - avg by (instance) (
            rate(node_cpu_seconds_total{mode="idle"}[5m])
          )

      - record: instance:memory_usage:ratio
        expr: |
          1 - (
            node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes
          )

      - record: instance:disk_usage:ratio
        expr: |
          1 - (
            node_filesystem_avail_bytes{fstype!~"tmpfs|fuse.*"}
            / node_filesystem_size_bytes{fstype!~"tmpfs|fuse.*"}
          )

驗並重載:

promtool check rules /etc/prometheus/rules/recording_rules.yml

curl -X POST http://localhost:9090/-/reload

sudo killall -HUP prometheus

得:錄則成評、新指見於 Prometheus 含 job: 前、儀板詢性進。

敗:

  • promtool check rules 察則語
  • 驗評間合資可
  • 缺源指:curl http://localhost:9090/api/v1/targets
  • 察日誌評誤:journalctl -u prometheus | grep -i error

四:配儲與留

優儲為留需與詢性。

/etc/systemd/system/prometheus.service

[Unit]
Description=Prometheus Monitoring System
Documentation=https://prometheus.io/docs/introduction/overview/
After=network-online.target

[Service]
Type=simple
User=prometheus
Group=prometheus
ExecStart=/usr/local/bin/prometheus \
  --config.file=/etc/prometheus/prometheus.yml \
  --storage.tsdb.path=/var/lib/prometheus \
  --storage.tsdb.retention.time=30d \
  --storage.tsdb.retention.size=50GB \
  --web.console.templates=/etc/prometheus/consoles \
  --web.console.libraries=/etc/prometheus/console_libraries \
  --web.listen-address=:9090 \
  --web.enable-lifecycle \
  --web.enable-admin-api

Restart=always
RestartSec=10s

[Install]
WantedBy=multi-user.target

要儲旗:

  • --storage.tsdb.retention.time=30d:留 30 日
  • --storage.tsdb.retention.size=50GB:限 50GB(先觸者)
  • --storage.tsdb.wal-compression:WAL 壓(減盤 I/O)
  • --web.enable-lifecycle:HTTP POST 重載
  • --web.enable-admin-api:快照與刪 API

啟動:

sudo systemctl daemon-reload
sudo systemctl enable prometheus
sudo systemctl start prometheus
sudo systemctl status prometheus

得:Prometheus 按策留指、盤用於限內、舊資自剪。

敗:

  • 察盤用:du -sh /var/lib/prometheus
  • 察 TSDB 統:curl http://localhost:9090/api/v1/status/tsdb
  • 驗留設:curl http://localhost:9090/api/v1/status/runtimeinfo | jq .data.storageRetention
  • 強清:curl -X POST http://localhost:9090/api/v1/admin/tsdb/delete_series?match[]={__name__=~".+"}

五:設聯(多叢)

配階 Prometheus 為跨叢聚指。

Prometheus(各叢)確外標設:

global:
  external_labels:
    cluster: 'production-east'
    datacenter: 'us-east-1'

Prometheus 加聯採配:

scrape_configs:
  - job_name: 'federate-production'
    honor_labels: true
    metrics_path: '/federate'
    params:
      'match[]':
        - '{__name__=~"job:.*"}'
        - '{__name__=~"ALERTS.*"}'
        - 'up{job=~".*"}'
    static_configs:
      - targets:
          - 'prometheus-east.example.com:9090'
          - 'prometheus-west.example.com:9090'
        labels:
          env: 'production'
    relabel_configs:
      - source_labels: [__address__]
        target_label: instance
      - source_labels: [__address__]
        regex: 'prometheus-(.*).example.com.*'
        target_label: cluster
        replacement: '$1'

聯佳實:

  • honor_labels: true 保原標
  • 僅聯錄則與聚(非原指)
  • 設宜採間(長於邊評)
  • match[] 濾指(避全聯)

得:央 Prometheus 示諸叢聯指、詢可跨域、最少資重。

敗:

  • 驗聯端達:curl http://prometheus-east.example.com:9090/federate?match[]={__name__=~"job:.*"} | head -20
  • 察標衝(央 vs 邊外標)
  • 察聯延:較時印異
  • 察配式:curl http://localhost:9090/api/v1/label/__name__/values | jq .data | grep "job:"

六:行高可(可)

釋冗 Prometheus 含同配為轉。

ThanosCortex 為真 HA、或簡負衡:

global:
  scrape_interval: 15s
  external_labels:
    prometheus: 'prometheus-1'
    replica: 'A'

配 Grafana 詢二:

{
  "name": "Prometheus-HA",
  "type": "prometheus",
  "url": "http://prometheus-lb.example.com",
  "jsonData": {
    "httpMethod": "POST",
    "timeInterval": "15s"
  }
}

用 HAProxy 或 nginx 為負衡:

upstream prometheus_backend {
    server prometheus-1.example.com:9090 max_fails=3 fail_timeout=30s;
    server prometheus-2.example.com:9090 max_fails=3 fail_timeout=30s;
}

server {
    listen 9090;
    location / {
        proxy_pass http://prometheus_backend;
        proxy_set_header Host $host;
    }
}

得:詢請跨衡、單敗自轉、單敗無資失。

敗:

  • 驗二採同標(微時偏可)
  • 察配漂於二
  • 察詢去重(Grafana 示重序)
  • 察負衡健察

  • Prometheus 網 UI 達於期端
  • 諸配採標於 Status > Targets 示 UP
  • 服發現動加除標如期
  • 錄則成評(日無誤)
  • 指留合配時/大限
  • 聯(如配)拉指自邊
  • 詢返期指基(不過)
  • 盤用穩於配儲預內
  • 配重載經 HTTP 端或 SIGHUP 行
  • Prometheus 自察指可(up、scrape duration 等)

  • 高基指:避無界值標(user ID、時印、UUID)。錄則聚於儲前
  • 採間不合:錄則評間 ≥ 採間以免缺
  • 聯過載:聯諸指生大資重。僅聯聚錄則
  • 缺重標配:無正重標→服發現生混或重標
  • 留過短:留長於最長儀板時窗以免「無資」缺
  • 無資源限:高基時 Prometheus 可耗大記憶。設 --storage.tsdb.max-block-duration 並察堆用
  • 禁生命週期端:無 --web.enable-lifecycle→配重載需全重啟致採缺

  • configure-alerting-rules
  • build-grafana-dashboards
  • define-slo-sli-sla
  • instrument-distributed-tracing

GitHub 仓库

pjt222/agent-almanac
路径: i18n/wenyan-ultra/skills/setup-prometheus-monitoring
0
agentsagentskillsai-assisted-developmentclaude-codeskillsteams

相关推荐技能

llamaguard

其他

LlamaGuard是Meta推出的7-8B参数内容审核模型,专门用于过滤LLM的输入和输出内容。它能检测六大安全风险类别(暴力/仇恨、性内容、武器、违禁品、自残、犯罪计划),准确率达94-95%。开发者可通过HuggingFace、vLLM或Sagemaker快速部署,并能与NeMo Guardrails集成实现自动化安全防护。

查看技能

cost-optimization

其他

这个Claude Skill帮助开发者优化云成本,通过资源调整、标记策略和预留实例来降低AWS、Azure和GCP的开支。它适用于减少云支出、分析基础设施成本或实施成本治理策略的场景。关键功能包括提供成本可视化、资源规模调整指导和定价模型优化建议。

查看技能

quantizing-models-bitsandbytes

其他

这个Skill使用bitsandbytes库量化大语言模型,能在GPU内存有限时通过8位或4位量化减少50-75%内存占用,同时保持精度损失最小。它支持INT8、NF4、FP4等多种量化格式,可与HuggingFace Transformers无缝集成,适用于需要部署更大模型或加速推理的场景。还提供QLoRA训练和8位优化器支持,让开发者能轻松实现高效模型压缩。

查看技能

dispatching-parallel-agents

其他

该Skill用于并行处理3个以上无依赖关系的独立故障,可为每个问题域分派专属Claude代理同时执行调查修复。它通过并发处理多个独立问题显著提升故障排查效率,特别适用于测试文件、子系统等无共享状态的场景。

查看技能