SKILL·BD26C5

define-slo-sli-sla

Name: define-slo-sli-sla
Author: pjt222

pjt222

Aktualisiert 1 month ago

8 Ansichten

Dokumentationaiautomationdata

Über

Diese Claude Skill unterstützt Entwickler dabei, messbare Zuverlässigkeitsziele (SLO/SLI/SLA) mit Prometheus und Tools wie Sloth oder Pyrra zu definieren und umzusetzen. Sie ermöglicht die Verfolgung von Error Budgets, Burn-Rate-Alarme und automatisierte Berichterstattung, um Feature-Entwicklung mit Systemzuverlässigkeit in Einklang zu bringen. Nutzen Sie sie bei der Einführung datengestützter SRE-Praktiken für kundenorientierte Dienste.

Schnellinstallation

Claude Code

Dokumentation

Define SLO/SLI/SLA

Measurable reliability targets → SLIs track → err budget manage.

Use When

Reliability targets → customer-facing svc/API
Clear expect → provider ↔ consumer
Feature velocity ↔ reliability via err budget
Objective criteria → incident severity
Arbitrary uptime → data-driven metrics
SRE impl
Svc quality → measure + improve

In

Required: Svc desc + critical user journeys
Required: Historical metrics (req rates, latencies, err rates)
Optional: Existing SLA commitments
Optional: Business reqs → availability/perf
Optional: Incident history + customer impact

Do

See Extended Examples for complete configuration files and templates.

Step 1: SLI/SLO/SLA hierarchy

Relationship + diffs.

Definitions:

SLI (Service Level Indicator)
- **What**: A quantitative measure of service behavior
- **Example**: Request success rate, request latency, system throughput
- **Measurement**: `successful_requests / total_requests * 100`

SLO (Service Level Objective)
- **What**: Target value or range for an SLI over a time window
- **Example**: 99.9% of requests succeed in 30-day window
- **Purpose**: Internal reliability target to guide operations

SLA (Service Level Agreement)
- **What**: Contractual commitment with consequences for missing SLO
- **Example**: 99.9% uptime SLA with refunds if breached
- **Purpose**: External promise to customers with penalties

Hierarchy:

SLA (99.9% uptime, customer refunds)
  ├─ SLO (99.95% success rate, internal target)
  │   └─ SLI (actual measured: 99.97% success rate)
  └─ Error Budget (0.05% failures allowed per month)

Key: SLO stricter than SLA → buffer before customer impact.

Ex:

SLA: 99.9% (promise)
SLO: 99.95% (internal)
Buffer: 0.05%

→ Team understands, SLI metrics agreed, SLO targets aligned.

If err:

Read Google SRE book SLI/SLO/SLA chapters
Stakeholder workshop → align defs
Start w/ success-rate SLI before latency SLOs

Step 2: Select SLIs

Reflect user experience + business impact.

Four Golden Signals (Google SRE):

Latency: Req serve time

# P95 latency
histogram_quantile(0.95,
  sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
)

Traffic: Demand

# Requests per second
sum(rate(http_requests_total[5m]))

Errors: Failed req rate

# Error rate percentage
sum(rate(http_requests_total{status=~"5.."}[5m]))
/ sum(rate(http_requests_total[5m])) * 100

Saturation: How full

# CPU saturation
avg(rate(node_cpu_seconds_total{mode!="idle"}[5m]))

Common SLI patterns:

# Availability SLI
availability:
  description: "Percentage of successful requests"
  query: |
    sum(rate(http_requests_total{status!~"5.."}[5m]))
    / sum(rate(http_requests_total[5m]))
  good_threshold: 0.999  # 99.9%

# Latency SLI
latency:
  description: "P99 request latency under 500ms"
  query: |
    histogram_quantile(0.99,
      sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
    ) < 0.5
  good_threshold: 0.95  # 95% of windows meet target

# Throughput SLI
throughput:
  description: "Requests processed per second"
  query: |
    sum(rate(http_requests_total[5m]))
  good_threshold: 1000  # Minimum 1000 req/s

# Data freshness SLI (for batch jobs)
freshness:
  description: "Data updated within last hour"
  query: |
    (time() - max(data_last_updated_timestamp)) < 3600
  good_threshold: 1  # Always fresh

SLI criteria:

User-visible → reflects experience
Measurable → from existing metrics
Actionable → team fixes via eng work
Meaningful → correlates w/ customer satisfaction
Simple → easy explain

Avoid:

Internal sys metrics (CPU, mem) not user-visible
Vanity metrics → no customer impact
Complex composite scores

→ 2-4 SLIs/svc, availability+latency min, team agrees on queries.

If err:

Map user journey → critical fail points
Incident history → which metrics predicted impact?
A/B test → degrade metric, measure complaints
Start simple, iterate

Step 3: SLO targets + time windows

Realistic + achievable.

SLO spec format:

service: user-api
slos:
  - name: availability
    objective: 99.9
    description: |
      99.9% of requests return non-5xx status codes
# ... (see EXAMPLES.md for complete configuration)

Time window:

30d → external SLAs
7d → eng teams feedback
1d → high-freq svc

30d window err budget ex:

SLO: 99.9% availability over 30 days
Allowed failures: 0.1%
Total requests per month: 100M
Error budget: 100,000 failed requests
Daily budget: ~3,333 failed requests

Realistic targets:

Baseline perf:

# Check actual availability over past 90 days
avg_over_time(
  (sum(rate(http_requests_total{status!~"5.."}[5m]))
  / sum(rate(http_requests_total[5m])))[90d:5m]
)
# Result: 99.95% → Set SLO at 99.9% (safer than current)

Cost of nines:

99%    → 7.2 hours downtime/month (low reliability)
99.9%  → 43 minutes downtime/month (good)
99.95% → 22 minutes downtime/month (very good)
99.99% → 4.3 minutes downtime/month (expensive)
99.999% → 26 seconds downtime/month (very expensive)

Balance:
- Too strict → expensive, slow features
- Too loose → bad UX, churn
- Sweet spot → slightly > user expectations

→ SLOs set w/ buy-in, rationale docs, err budget calc.

If err:

Start achievable (99% if 98.5% now)
Iterate quarterly
Exec sponsorship vs "five nines" demands
Doc cost-benefit/nine

Step 4: SLO monitoring w/ Sloth

Sloth → Prometheus rules + alerts from SLO specs.

Install Sloth:

# Binary installation
wget https://github.com/slok/sloth/releases/download/v0.11.0/sloth-linux-amd64
chmod +x sloth-linux-amd64
sudo mv sloth-linux-amd64 /usr/local/bin/sloth

# Or Docker
docker pull ghcr.io/slok/sloth:latest

Sloth SLO spec (slos/user-api.yml):

version: "prometheus/v1"
service: "user-api"
labels:
  team: "platform"
  tier: "1"
slos:
# ... (see EXAMPLES.md for complete configuration)

Generate rules:

# Generate recording and alerting rules
sloth generate -i slos/user-api.yml -o prometheus/rules/user-api-slo.yml

# Validate generated rules
promtool check rules prometheus/rules/user-api-slo.yml

Recording rules (excerpt):

groups:
  - name: sloth-slo-sli-recordings-user-api-requests-availability
    interval: 30s
    rules:
      # SLI: Ratio of good events
      - record: slo:sli_error:ratio_rate5m
# ... (see EXAMPLES.md for complete configuration)

Alerts:

groups:
  - name: sloth-slo-alerts-user-api-requests-availability
    rules:
      # Fast burn: 2% budget consumed in 1 hour
      - alert: UserAPIHighErrorRate
        expr: |
# ... (see EXAMPLES.md for complete configuration)

Load rules:

# prometheus.yml
rule_files:
  - "rules/user-api-slo.yml"

Reload:

curl -X POST http://localhost:9090/-/reload

→ Multi-window multi-burn alerts, rules eval OK, alerts fire on incidents.

If err:

yamllint slos/user-api.yml
Sloth ver ≥ v0.11
Verify curl http://localhost:9090/api/v1/rules
Synth err injection → trigger alerts
Check Sloth docs → SLI event query format

Step 5: Err budget dashboards

Grafana → SLO compliance + budget consumption.

Grafana JSON (excerpt):

{
  "dashboard": {
    "title": "SLO Dashboard - User API",
    "panels": [
      {
        "type": "stat",
# ... (see EXAMPLES.md for complete configuration)

Key metrics:

SLO target vs SLI
Budget remaining (% + abs)
Burn rate
Historical SLI (30d rolling)
Time to exhaustion

Err budget policy (md panel):

## Error Budget Policy

**Current Status**: 78% budget remaining

### If Error Budget > 50%
- ✅ Full speed ahead on new features
# ... (see EXAMPLES.md for complete configuration)

→ Real-time compliance, budget depletion visible, informed velocity decisions.

If err:

Verify rules: curl http://localhost:9090/api/v1/rules | jq '.data.groups[].rules[] | select(.name | contains("slo:"))'
Prometheus datasource URL correct
Query in Explore view before dashboard
Time range → 30d for monthly SLOs

Step 6: Err budget policy

Org process → budget mgmt.

Policy template:

service: user-api
slo:
  availability: 99.9%
  latency_p99: 200ms
  window: 30 days

# ... (see EXAMPLES.md for complete configuration)

Automate enforcement:

# Example: Deployment gate script
import requests
import sys

def check_error_budget(service):
    # Query Prometheus for error budget
# ... (see EXAMPLES.md for complete configuration)

CI/CD:

# .github/workflows/deploy.yml
jobs:
  check-error-budget:
    runs-on: ubuntu-latest
    steps:
      - name: Check SLO Error Budget
        run: |
          python scripts/check_error_budget.py user-api
      - name: Deploy
        if: success()
        run: |
          kubectl apply -f deploy/

→ Policy docs, auto gates block risky deploys on budget depletion, team aligned.

If err:

Start manual (Slack reminders)
Automate w/ soft gates (warns)
Exec buy-in before hard gates (block deploys)
Quarterly review

Check

Traps

Too strict SLOs: "Five nines" w/o cost analysis → burnout + slow velocity. Start achievable, iterate up.
Too many SLIs: 10+ → confusion. Focus 2-4 user-facing.
No SLA buffer: SLO = SLA → no margin. Keep 0.05-0.1%.
Ignore err budget: Track SLOs w/o action → defeats purpose. Enforce policy.
Vanity metrics: Internal (CPU, mem) vs user-visible (latency, errs) → misaligned priorities.
No buy-in: Eng-only SLOs → conflicts w/ product/biz. Get exec sponsorship.
Static SLOs: Never review → stale. Revisit quarterly.

→

setup-prometheus-monitoring — metrics collection for SLI calc
configure-alerting-rules — burn rate alerts → Alertmanager
build-grafana-dashboards — viz SLO compliance + budget
write-incident-runbook — SLO impact in runbooks

GitHub Repository

pjt222/agent-almanac

Pfad: i18n/caveman-ultra/skills/define-slo-sli-sla

agentsagentskillsai-assisted-developmentclaude-codeskillsteams

FAQ

Frequently asked questions

What is the define-slo-sli-sla skill?

define-slo-sli-sla is a Claude Skill by pjt222. Skills package instructions and resources that Claude loads on demand, so Claude can perform define-slo-sli-sla-related tasks without extra prompting.

How do I install define-slo-sli-sla?

Use the install commands on this page: add define-slo-sli-sla to Claude Code as a plugin, or clone its repository into your skills directory, then restart Claude so it picks up the skill.

What category does define-slo-sli-sla belong to?

define-slo-sli-sla is in the Documentation category, tagged ai, automation and data.

Is define-slo-sli-sla free to use?

Yes. define-slo-sli-sla is listed on AIMCP and free to install. It runs inside Claude, so no separate service account is required to use the skill itself.

define-slo-sli-sla

Über

Schnellinstallation

Claude Code

Dokumentation

Define SLO/SLI/SLA

Use When

In

Do

Step 1: SLI/SLO/SLA hierarchy

Step 2: Select SLIs

Step 3: SLO targets + time windows

Step 4: SLO monitoring w/ Sloth

Step 5: Err budget dashboards

Step 6: Err budget policy

Check

Traps

→

GitHub Repository

Frequently asked questions

What is the define-slo-sli-sla skill?

How do I install define-slo-sli-sla?

What category does define-slo-sli-sla belong to?

Is define-slo-sli-sla free to use?

Verwandte Skills