setup-prometheus-monitoring
About
This skill configures a production-ready Prometheus monitoring system with scrape configurations, service discovery, and recording rules. It enables centralized time-series metrics collection for microservices and infrastructure, serving as a foundation for SLOs and alerting. Use it to establish modern observability or migrate from legacy monitoring systems.
Quick Install
Claude Code
Recommendednpx skills add pjt222/agent-almanac -a claude-code/plugin add https://github.com/pjt222/agent-almanacgit clone https://github.com/pjt222/agent-almanac.git ~/.claude/skills/setup-prometheus-monitoringCopy and paste this command in Claude Code to install this skill
Documentation
Setup Prometheus Monitoring
Configure prod-ready Prometheus w/ scrape targets, recording rules, federation.
Use When
- Centralized metrics for microservices|distributed
- Time-series monitor app+infra
- Foundation for SLO/SLI + alerting
- Consolidate metrics from multi Prometheus via federation
- Migrate legacy → modern observability
In
- Required: Scrape targets (services, exporters, endpoints)
- Required: Retention period + storage reqs
- Optional: Existing service discovery (K8s, Consul, EC2)
- Optional: Recording rules for pre-agg metrics
- Optional: Federation hierarchy multi-cluster
Do
Step 1: Install + Configure
# Create Prometheus directory structure
mkdir -p /etc/prometheus/{rules,file_sd}
mkdir -p /var/lib/prometheus
# Download Prometheus (adjust version as needed)
cd /tmp
wget https://github.com/prometheus/prometheus/releases/download/v2.48.0/prometheus-2.48.0.linux-amd64.tar.gz
tar xvf prometheus-2.48.0.linux-amd64.tar.gz
sudo cp prometheus-2.48.0.linux-amd64/{prometheus,promtool} /usr/local/bin/
/etc/prometheus/prometheus.yml:
global:
scrape_interval: 15s
scrape_timeout: 10s
evaluation_interval: 15s
external_labels:
cluster: 'production'
region: 'us-east-1'
# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets:
- localhost:9093
# Load recording and alerting rules
rule_files:
- "rules/*.yml"
# Scrape configurations
scrape_configs:
# Prometheus self-monitoring
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
labels:
env: 'production'
# Node exporter for host metrics
- job_name: 'node'
static_configs:
- targets:
- 'node1:9100'
- 'node2:9100'
labels:
env: 'production'
# Application metrics with file-based service discovery
- job_name: 'app-services'
file_sd_configs:
- files:
- '/etc/prometheus/file_sd/services.json'
refresh_interval: 30s
relabel_configs:
- source_labels: [__address__]
target_label: instance
- source_labels: [env]
target_label: environment
→ Prometheus starts, UI at http://localhost:9090, targets in Status > Targets.
If err:
- Syntax:
promtool check config /etc/prometheus/prometheus.yml - Perms:
sudo chown -R prometheus:prometheus /etc/prometheus /var/lib/prometheus - Logs:
journalctl -u prometheus -f
Step 2: Service Discovery
Dynamic targets → no manual.
K8s add to scrape_configs:
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
# Only scrape pods with prometheus.io/scrape annotation
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
# Use custom port if specified
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
target_label: __address__
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
# Add namespace as label
- source_labels: [__meta_kubernetes_namespace]
target_label: kubernetes_namespace
# Add pod name as label
- source_labels: [__meta_kubernetes_pod_name]
target_label: kubernetes_pod_name
File-based /etc/prometheus/file_sd/services.json:
[
{
"targets": ["web-app-1:8080", "web-app-2:8080"],
"labels": {
"job": "web-app",
"env": "production",
"team": "platform"
}
},
{
"targets": ["api-service-1:9090", "api-service-2:9090"],
"labels": {
"job": "api-service",
"env": "production",
"team": "backend"
}
}
]
Consul:
- job_name: 'consul-services'
consul_sd_configs:
- server: 'consul.example.com:8500'
services: [] # Empty list means discover all services
relabel_configs:
- source_labels: [__meta_consul_service]
target_label: job
- source_labels: [__meta_consul_tags]
regex: '.*,monitoring,.*'
action: keep
→ Dynamic targets in UI, auto-update on scale|change.
If err:
- K8s: RBAC
kubectl auth can-i list pods --as=system:serviceaccount:monitoring:prometheus - File SD:
python -m json.tool /etc/prometheus/file_sd/services.json - Consul:
curl http://consul.example.com:8500/v1/catalog/services
Step 3: Recording Rules
Pre-aggregate expensive queries → dashboard perf + alerting efficiency.
/etc/prometheus/rules/recording_rules.yml:
groups:
- name: api_aggregations
interval: 30s
rules:
# Calculate request rate per endpoint (5m window)
- record: job:http_requests:rate5m
expr: |
sum by (job, endpoint, method) (
rate(http_requests_total[5m])
)
# Calculate error rate percentage
- record: job:http_errors:rate5m
expr: |
sum by (job) (
rate(http_requests_total{status=~"5.."}[5m])
) / sum by (job) (
rate(http_requests_total[5m])
) * 100
# P95 latency by endpoint
- record: job:http_request_duration_seconds:p95
expr: |
histogram_quantile(0.95,
sum by (job, endpoint, le) (
rate(http_request_duration_seconds_bucket[5m])
)
)
- name: resource_aggregations
interval: 1m
rules:
# CPU usage by instance
- record: instance:cpu_usage:ratio
expr: |
1 - avg by (instance) (
rate(node_cpu_seconds_total{mode="idle"}[5m])
)
# Memory usage percentage
- record: instance:memory_usage:ratio
expr: |
1 - (
node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes
)
# Disk usage by mount point
- record: instance:disk_usage:ratio
expr: |
1 - (
node_filesystem_avail_bytes{fstype!~"tmpfs|fuse.*"}
/ node_filesystem_size_bytes{fstype!~"tmpfs|fuse.*"}
)
Validate + reload:
# Validate rules syntax
promtool check rules /etc/prometheus/rules/recording_rules.yml
# Reload Prometheus configuration (without restart)
curl -X POST http://localhost:9090/-/reload
# Or send SIGHUP signal
sudo killall -HUP prometheus
→ Rules eval, new metrics w/ job: prefix, query perf improved.
If err:
promtool check rules- Eval interval matches data avail
- Missing source:
curl http://localhost:9090/api/v1/targets - Logs:
journalctl -u prometheus | grep -i error
Step 4: Storage + Retention
/etc/systemd/system/prometheus.service:
[Unit]
Description=Prometheus Monitoring System
Documentation=https://prometheus.io/docs/introduction/overview/
After=network-online.target
[Service]
Type=simple
User=prometheus
Group=prometheus
ExecStart=/usr/local/bin/prometheus \
--config.file=/etc/prometheus/prometheus.yml \
--storage.tsdb.path=/var/lib/prometheus \
--storage.tsdb.retention.time=30d \
--storage.tsdb.retention.size=50GB \
--web.console.templates=/etc/prometheus/consoles \
--web.console.libraries=/etc/prometheus/console_libraries \
--web.listen-address=:9090 \
--web.enable-lifecycle \
--web.enable-admin-api
Restart=always
RestartSec=10s
[Install]
WantedBy=multi-user.target
Key flags:
--storage.tsdb.retention.time=30d: 30d data--storage.tsdb.retention.size=50GB: 50GB cap (whichever first)--storage.tsdb.wal-compression: ↓disk I/O--web.enable-lifecycle: Reload via HTTP POST--web.enable-admin-api: Snapshot + delete APIs
Enable + start:
sudo systemctl daemon-reload
sudo systemctl enable prometheus
sudo systemctl start prometheus
sudo systemctl status prometheus
→ Retains per policy, disk within limits, old auto-pruned.
If err:
- Disk:
du -sh /var/lib/prometheus - TSDB:
curl http://localhost:9090/api/v1/status/tsdb - Retention:
curl http://localhost:9090/api/v1/status/runtimeinfo | jq .data.storageRetention - Force cleanup:
curl -X POST http://localhost:9090/api/v1/admin/tsdb/delete_series?match[]={__name__=~".+"}
Step 5: Federation (Multi-Cluster)
Hierarchical for aggregating across clusters.
Edge instances per cluster, set external labels:
global:
external_labels:
cluster: 'production-east'
datacenter: 'us-east-1'
Central add federation scrape:
scrape_configs:
- job_name: 'federate-production'
honor_labels: true
metrics_path: '/federate'
params:
'match[]':
# Aggregate only pre-computed recording rules
- '{__name__=~"job:.*"}'
# Include alert states
- '{__name__=~"ALERTS.*"}'
# Include critical infrastructure metrics
- 'up{job=~".*"}'
static_configs:
- targets:
- 'prometheus-east.example.com:9090'
- 'prometheus-west.example.com:9090'
labels:
env: 'production'
relabel_configs:
- source_labels: [__address__]
target_label: instance
- source_labels: [__address__]
regex: 'prometheus-(.*).example.com.*'
target_label: cluster
replacement: '$1'
Best practices:
honor_labels: truepreserves original- Federate only recording rules + aggregates (not raw)
- Scrape intervals longer than edge eval
match[]filters → don't federate everything
→ Central shows federated metrics from all, queries span regions, min duplication.
If err:
- Endpoint accessible:
curl http://prometheus-east.example.com:9090/federate?match[]={__name__=~"job:.*"} | head -20 - Label conflicts (central vs edge external)
- Federation lag: compare timestamps
- Match patterns:
curl http://localhost:9090/api/v1/label/__name__/values | jq .data | grep "job:"
Step 6: HA (Optional)
Redundant instances identical configs for failover.
Thanos|Cortex for true HA, or load-balanced:
# prometheus-1.yml and prometheus-2.yml (identical configs)
global:
scrape_interval: 15s
external_labels:
prometheus: 'prometheus-1' # Different per instance
replica: 'A'
# Use --web.external-url flag for each instance
# prometheus-1: --web.external-url=http://prometheus-1.example.com:9090
# prometheus-2: --web.external-url=http://prometheus-2.example.com:9090
Grafana queries both:
{
"name": "Prometheus-HA",
"type": "prometheus",
"url": "http://prometheus-lb.example.com",
"jsonData": {
"httpMethod": "POST",
"timeInterval": "15s"
}
}
HAProxy|nginx for LB:
upstream prometheus_backend {
server prometheus-1.example.com:9090 max_fails=3 fail_timeout=30s;
server prometheus-2.example.com:9090 max_fails=3 fail_timeout=30s;
}
server {
listen 9090;
location / {
proxy_pass http://prometheus_backend;
proxy_set_header Host $host;
}
}
→ Queries balanced, auto-failover if 1 down, no data loss single-instance fail.
If err:
- Both scrape same targets (slight time skew OK)
- Config drift between
- Dedup in queries (Grafana shows dup series)
- LB health checks
Check
- UI accessible
- All scrape targets UP in Status > Targets
- Service discovery dynamic add|remove
- Recording rules eval w/o errs
- Retention matches configured time|size
- Federation pulls from edge
- Queries return expected cardinality (not excessive)
- Disk stable + within budget
- Reload via HTTP|SIGHUP
- Self-monitor metrics (up, scrape duration)
Traps
- High cardinality: Avoid unbounded labels (user IDs, timestamps, UUIDs). Recording rules to agg before storage.
- Scrape interval mismatch: Recording rules eval ≥ scrape intervals → no gaps.
- Federation overload: All metrics = massive dup. Only federate aggregated rules.
- Missing relabel: Service discovery → confusing|dup labels w/o relabel.
- Retention too short: Set longer than longest dashboard window → no "no data" gaps.
- No resource limits: Excessive mem w/ high cardinality. Set
--storage.tsdb.max-block-duration+ monitor heap. - Disabled lifecycle: W/o
--web.enable-lifecycle, reloads need full restart → scrape gaps.
→
configure-alerting-rules— alerting rules + Alertmanager routingbuild-grafana-dashboards— visualize w/ Grafanadefine-slo-sli-sla— SLO/SLI via recording rules + error budgetinstrument-distributed-tracing— complement metrics w/ tracing
GitHub Repository
Related Skills
llamaguard
OtherLlamaGuard is Meta's 7-8B parameter model for moderating LLM inputs and outputs across six safety categories like violence and hate speech. It offers 94-95% accuracy and can be deployed using vLLM, Hugging Face, or Amazon SageMaker. Use this skill to easily integrate content filtering and safety guardrails into your AI applications.
cost-optimization
OtherThis Claude Skill helps developers optimize cloud costs through resource rightsizing, tagging strategies, and spending analysis. It provides a framework for reducing cloud expenses and implementing cost governance across AWS, Azure, and GCP. Use it when you need to analyze infrastructure costs, right-size resources, or meet budget constraints.
quantizing-models-bitsandbytes
OtherThis skill quantizes LLMs to 8-bit or 4-bit precision using bitsandbytes, achieving 50-75% memory reduction with minimal accuracy loss. It's ideal for running larger models on limited GPU memory or accelerating inference, supporting formats like INT8, NF4, and FP4. The skill integrates with HuggingFace Transformers and enables QLoRA training and 8-bit optimizers.
dispatching-parallel-agents
OtherThis Claude Skill dispatches multiple agents to investigate and fix 3+ independent problems concurrently. It is designed for scenarios involving unrelated failures that can be resolved without shared state or dependencies. The core capability is parallel problem-solving, assigning one agent per independent problem domain to maximize efficiency.
