define-slo-sli-sla
О программе
Этот навык Claude помогает разработчикам определять и внедрять измеримые цели надежности (SLO/SLI/SLA) с использованием Prometheus и инструментов вроде Sloth или Pyrra. Он позволяет отслеживать бюджет ошибок, настраивать оповещения по скорости сгорания и автоматизировать отчетность для балансировки между разработкой функций и надежностью системы. Используйте его при внедрении основанных на данных практик SRE для клиентских сервисов.
Быстрая установка
Claude Code
Рекомендуетсяnpx skills add pjt222/agent-almanac -a claude-code/plugin add https://github.com/pjt222/agent-almanacgit clone https://github.com/pjt222/agent-almanac.git ~/.claude/skills/define-slo-sli-slaСкопируйте и вставьте эту команду в Claude Code для установки этого навыка
Документация
Define SLO/SLI/SLA
Measurable reliability targets → SLIs track → err budget manage.
Use When
- Reliability targets → customer-facing svc/API
- Clear expect → provider ↔ consumer
- Feature velocity ↔ reliability via err budget
- Objective criteria → incident severity
- Arbitrary uptime → data-driven metrics
- SRE impl
- Svc quality → measure + improve
In
- Required: Svc desc + critical user journeys
- Required: Historical metrics (req rates, latencies, err rates)
- Optional: Existing SLA commitments
- Optional: Business reqs → availability/perf
- Optional: Incident history + customer impact
Do
See Extended Examples for complete configuration files and templates.
Step 1: SLI/SLO/SLA hierarchy
Relationship + diffs.
Definitions:
SLI (Service Level Indicator)
- **What**: A quantitative measure of service behavior
- **Example**: Request success rate, request latency, system throughput
- **Measurement**: `successful_requests / total_requests * 100`
SLO (Service Level Objective)
- **What**: Target value or range for an SLI over a time window
- **Example**: 99.9% of requests succeed in 30-day window
- **Purpose**: Internal reliability target to guide operations
SLA (Service Level Agreement)
- **What**: Contractual commitment with consequences for missing SLO
- **Example**: 99.9% uptime SLA with refunds if breached
- **Purpose**: External promise to customers with penalties
Hierarchy:
SLA (99.9% uptime, customer refunds)
├─ SLO (99.95% success rate, internal target)
│ └─ SLI (actual measured: 99.97% success rate)
└─ Error Budget (0.05% failures allowed per month)
Key: SLO stricter than SLA → buffer before customer impact.
Ex:
- SLA: 99.9% (promise)
- SLO: 99.95% (internal)
- Buffer: 0.05%
→ Team understands, SLI metrics agreed, SLO targets aligned.
If err:
- Read Google SRE book SLI/SLO/SLA chapters
- Stakeholder workshop → align defs
- Start w/ success-rate SLI before latency SLOs
Step 2: Select SLIs
Reflect user experience + business impact.
Four Golden Signals (Google SRE):
-
Latency: Req serve time
# P95 latency histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le) ) -
Traffic: Demand
# Requests per second sum(rate(http_requests_total[5m])) -
Errors: Failed req rate
# Error rate percentage sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) * 100 -
Saturation: How full
# CPU saturation avg(rate(node_cpu_seconds_total{mode!="idle"}[5m]))
Common SLI patterns:
# Availability SLI
availability:
description: "Percentage of successful requests"
query: |
sum(rate(http_requests_total{status!~"5.."}[5m]))
/ sum(rate(http_requests_total[5m]))
good_threshold: 0.999 # 99.9%
# Latency SLI
latency:
description: "P99 request latency under 500ms"
query: |
histogram_quantile(0.99,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
) < 0.5
good_threshold: 0.95 # 95% of windows meet target
# Throughput SLI
throughput:
description: "Requests processed per second"
query: |
sum(rate(http_requests_total[5m]))
good_threshold: 1000 # Minimum 1000 req/s
# Data freshness SLI (for batch jobs)
freshness:
description: "Data updated within last hour"
query: |
(time() - max(data_last_updated_timestamp)) < 3600
good_threshold: 1 # Always fresh
SLI criteria:
- User-visible → reflects experience
- Measurable → from existing metrics
- Actionable → team fixes via eng work
- Meaningful → correlates w/ customer satisfaction
- Simple → easy explain
Avoid:
- Internal sys metrics (CPU, mem) not user-visible
- Vanity metrics → no customer impact
- Complex composite scores
→ 2-4 SLIs/svc, availability+latency min, team agrees on queries.
If err:
- Map user journey → critical fail points
- Incident history → which metrics predicted impact?
- A/B test → degrade metric, measure complaints
- Start simple, iterate
Step 3: SLO targets + time windows
Realistic + achievable.
SLO spec format:
service: user-api
slos:
- name: availability
objective: 99.9
description: |
99.9% of requests return non-5xx status codes
# ... (see EXAMPLES.md for complete configuration)
Time window:
- 30d → external SLAs
- 7d → eng teams feedback
- 1d → high-freq svc
30d window err budget ex:
SLO: 99.9% availability over 30 days
Allowed failures: 0.1%
Total requests per month: 100M
Error budget: 100,000 failed requests
Daily budget: ~3,333 failed requests
Realistic targets:
-
Baseline perf:
# Check actual availability over past 90 days avg_over_time( (sum(rate(http_requests_total{status!~"5.."}[5m])) / sum(rate(http_requests_total[5m])))[90d:5m] ) # Result: 99.95% → Set SLO at 99.9% (safer than current) -
Cost of nines:
99% → 7.2 hours downtime/month (low reliability) 99.9% → 43 minutes downtime/month (good) 99.95% → 22 minutes downtime/month (very good) 99.99% → 4.3 minutes downtime/month (expensive) 99.999% → 26 seconds downtime/month (very expensive) -
Balance:
- Too strict → expensive, slow features
- Too loose → bad UX, churn
- Sweet spot → slightly > user expectations
→ SLOs set w/ buy-in, rationale docs, err budget calc.
If err:
- Start achievable (99% if 98.5% now)
- Iterate quarterly
- Exec sponsorship vs "five nines" demands
- Doc cost-benefit/nine
Step 4: SLO monitoring w/ Sloth
Sloth → Prometheus rules + alerts from SLO specs.
Install Sloth:
# Binary installation
wget https://github.com/slok/sloth/releases/download/v0.11.0/sloth-linux-amd64
chmod +x sloth-linux-amd64
sudo mv sloth-linux-amd64 /usr/local/bin/sloth
# Or Docker
docker pull ghcr.io/slok/sloth:latest
Sloth SLO spec (slos/user-api.yml):
version: "prometheus/v1"
service: "user-api"
labels:
team: "platform"
tier: "1"
slos:
# ... (see EXAMPLES.md for complete configuration)
Generate rules:
# Generate recording and alerting rules
sloth generate -i slos/user-api.yml -o prometheus/rules/user-api-slo.yml
# Validate generated rules
promtool check rules prometheus/rules/user-api-slo.yml
Recording rules (excerpt):
groups:
- name: sloth-slo-sli-recordings-user-api-requests-availability
interval: 30s
rules:
# SLI: Ratio of good events
- record: slo:sli_error:ratio_rate5m
# ... (see EXAMPLES.md for complete configuration)
Alerts:
groups:
- name: sloth-slo-alerts-user-api-requests-availability
rules:
# Fast burn: 2% budget consumed in 1 hour
- alert: UserAPIHighErrorRate
expr: |
# ... (see EXAMPLES.md for complete configuration)
Load rules:
# prometheus.yml
rule_files:
- "rules/user-api-slo.yml"
Reload:
curl -X POST http://localhost:9090/-/reload
→ Multi-window multi-burn alerts, rules eval OK, alerts fire on incidents.
If err:
yamllint slos/user-api.yml- Sloth ver ≥ v0.11
- Verify
curl http://localhost:9090/api/v1/rules - Synth err injection → trigger alerts
- Check Sloth docs → SLI event query format
Step 5: Err budget dashboards
Grafana → SLO compliance + budget consumption.
Grafana JSON (excerpt):
{
"dashboard": {
"title": "SLO Dashboard - User API",
"panels": [
{
"type": "stat",
# ... (see EXAMPLES.md for complete configuration)
Key metrics:
- SLO target vs SLI
- Budget remaining (% + abs)
- Burn rate
- Historical SLI (30d rolling)
- Time to exhaustion
Err budget policy (md panel):
## Error Budget Policy
**Current Status**: 78% budget remaining
### If Error Budget > 50%
- ✅ Full speed ahead on new features
# ... (see EXAMPLES.md for complete configuration)
→ Real-time compliance, budget depletion visible, informed velocity decisions.
If err:
- Verify rules:
curl http://localhost:9090/api/v1/rules | jq '.data.groups[].rules[] | select(.name | contains("slo:"))' - Prometheus datasource URL correct
- Query in Explore view before dashboard
- Time range → 30d for monthly SLOs
Step 6: Err budget policy
Org process → budget mgmt.
Policy template:
service: user-api
slo:
availability: 99.9%
latency_p99: 200ms
window: 30 days
# ... (see EXAMPLES.md for complete configuration)
Automate enforcement:
# Example: Deployment gate script
import requests
import sys
def check_error_budget(service):
# Query Prometheus for error budget
# ... (see EXAMPLES.md for complete configuration)
CI/CD:
# .github/workflows/deploy.yml
jobs:
check-error-budget:
runs-on: ubuntu-latest
steps:
- name: Check SLO Error Budget
run: |
python scripts/check_error_budget.py user-api
- name: Deploy
if: success()
run: |
kubectl apply -f deploy/
→ Policy docs, auto gates block risky deploys on budget depletion, team aligned.
If err:
- Start manual (Slack reminders)
- Automate w/ soft gates (warns)
- Exec buy-in before hard gates (block deploys)
- Quarterly review
Check
- SLIs → user exp + business impact
- SLO targets → stakeholder agree + rationale docs
- Prometheus rules → SLI metrics OK
- Multi-burn alerts → tested w/ synth errs
- Grafana → real-time SLO + budget
- Err budget policy docs + communicated
- Auto gates → block risky deploys
- Weekly/monthly SLO reviews scheduled
- Incident retros → SLO impact analysis
- SLO reports → stakeholders
Traps
- Too strict SLOs: "Five nines" w/o cost analysis → burnout + slow velocity. Start achievable, iterate up.
- Too many SLIs: 10+ → confusion. Focus 2-4 user-facing.
- No SLA buffer: SLO = SLA → no margin. Keep 0.05-0.1%.
- Ignore err budget: Track SLOs w/o action → defeats purpose. Enforce policy.
- Vanity metrics: Internal (CPU, mem) vs user-visible (latency, errs) → misaligned priorities.
- No buy-in: Eng-only SLOs → conflicts w/ product/biz. Get exec sponsorship.
- Static SLOs: Never review → stale. Revisit quarterly.
→
setup-prometheus-monitoring— metrics collection for SLI calcconfigure-alerting-rules— burn rate alerts → Alertmanagerbuild-grafana-dashboards— viz SLO compliance + budgetwrite-incident-runbook— SLO impact in runbooks
GitHub репозиторий
Похожие навыки
railway-docs
ДокументацияЭтот навык получает актуальную документацию Railway, чтобы отвечать на вопросы о функциях, возможностях или конкретных URL-адресах документации. Он гарантирует, что разработчики получают точную и современную информацию напрямую из официальных источников Railway. Используйте его, когда пользователи спрашивают, как работает Railway, или ссылаются на документацию Railway.
n8n-code-python
ДокументацияЭтот навык Claude предоставляет экспертные рекомендации по написанию кода Python в узлах Code платформы n8n, в частности, по использованию стандартной библиотеки Python и работе со специальным синтаксисом n8n, таким как `_input`, `_json` и `_node`. Он помогает разработчикам понять ограничения Python в среде n8n и рекомендует использовать JavaScript для большинства рабочих процессов, предлагая решения на Python для конкретных задач по преобразованию данных.
archon
ДокументацияНавык Archon предоставляет семантический поиск на основе RAG и управление проектами через REST API. Используйте его для запросов к документации, управления иерархическими проектами/задачами и выполнения поиска информации с возможностью загрузки документов. Всегда в первую очередь обращайтесь к Archon при поиске во внешней документации, прежде чем использовать другие источники.
n8n-code-javascript
ДокументацияЭтот навык Claude предоставляет экспертные рекомендации по написанию кода JavaScript в узлах Code платформы n8n. Он охватывает важный синтаксис, специфичный для n8n, включая переменные `$input`/`$json`, HTTP-хелперы и работу с DateTime, а также помогает в устранении распространённых ошибок. Используйте его при разработке рабочих процессов в n8n, требующих кастомной обработки JavaScript в узлах Code.
