SKILL·28907D

define-slo-sli-sla

Name: define-slo-sli-sla
Author: pjt222

pjt222

更新于 1 month ago

22 次查看

文档aiautomationdata

关于

This skill helps developers define and implement SLOs, SLIs, and SLAs with error budgets for customer-facing services. It enables data-driven reliability management using Prometheus and tools like Sloth or Pyrra for tracking, alerting, and reporting. Use it to balance feature development against system reliability and to establish measurable SRE practices.

快速安装

Claude Code

技能文档

定 SLO/SLI/SLA

立可量可靠目標，以指標測之，理誤差預算。

用

定客端服務可靠目標
立供需雙方明約
以誤差預算衡功能速度+可靠
建事故嚴重度+應對客觀標準
由任意在線目標→數據驅動
行 SRE 之法
測並改品質

入

必：服務述+關鍵用戶路徑
必：歷史度量（請求率、時延、錯誤率）
可：現存 SLA 承諾
可：業務要求
可：事故史+客影響

法

詳例見 Extended Examples。

一：識 SLI、SLO、SLA 層級

三者之別與繫。

定義：

SLI (Service Level Indicator)
- **What**: A quantitative measure of service behavior
- **Example**: Request success rate, request latency, system throughput
- **Measurement**: `successful_requests / total_requests * 100`

SLO (Service Level Objective)
- **What**: Target value or range for an SLI over a time window
- **Example**: 99.9% of requests succeed in 30-day window
- **Purpose**: Internal reliability target to guide operations

SLA (Service Level Agreement)
- **What**: Contractual commitment with consequences for missing SLO
- **Example**: 99.9% uptime SLA with refunds if breached
- **Purpose**: External promise to customers with penalties

層級：

SLA (99.9% uptime, customer refunds)
  ├─ SLO (99.95% success rate, internal target)
  │   └─ SLI (actual measured: 99.97% success rate)
  └─ Error Budget (0.05% failures allowed per month)

要： SLO 當嚴於 SLA，留緩衝於客受影前。

例：

SLA：99.9% 可用（對客）
SLO：99.95% 可用（內部）
緩衝：0.05% 墊

得：團識別。度量選為 SLI 一致。SLO 目標共識。

敗：

讀 Google SRE 書 SLI/SLO/SLA 章
辦相關者工坊齊定義
先簡成功率 SLI，後繁時延 SLO

二：擇 SLI

繫用戶體驗+業務影響。

四金信號（Google SRE）：

時延：服務請求時

# P95 latency
histogram_quantile(0.95,
  sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
)

流量：系統需求

# Requests per second
sum(rate(http_requests_total[5m]))

錯誤：失敗率

# Error rate percentage
sum(rate(http_requests_total{status=~"5.."}[5m]))
/ sum(rate(http_requests_total[5m])) * 100

飽和：系統「滿」度

# CPU saturation
avg(rate(node_cpu_seconds_total{mode!="idle"}[5m]))

常見 SLI 型：

# Availability SLI
availability:
  description: "Percentage of successful requests"
  query: |
    sum(rate(http_requests_total{status!~"5.."}[5m]))
    / sum(rate(http_requests_total[5m]))
  good_threshold: 0.999  # 99.9%

# Latency SLI
latency:
  description: "P99 request latency under 500ms"
  query: |
    histogram_quantile(0.99,
      sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
    ) < 0.5
  good_threshold: 0.95  # 95% of windows meet target

# Throughput SLI
throughput:
  description: "Requests processed per second"
  query: |
    sum(rate(http_requests_total[5m]))
  good_threshold: 1000  # Minimum 1000 req/s

# Data freshness SLI (for batch jobs)
freshness:
  description: "Data updated within last hour"
  query: |
    (time() - max(data_last_updated_timestamp)) < 3600
  good_threshold: 1  # Always fresh

SLI 擇標：

用戶可見：繫實體驗
可量：由現指標可量
可行：團可工程改之
有義：繫客滿意
簡：易懂易釋

避：

內部指標（CPU、內存）用戶不見
虛榮指標
過繁復合分

得：每服務選 2-4 SLI，至少含可用+時延。團一致於查詢。

敗：

繪用戶路徑識關鍵失敗點
析事故史：何指標預客影響？
A/B 驗：劣化指標→測客訴
先簡可用 SLI，漸加繁

三：定 SLO 目標+窗

現實可達目標。

SLO 規範格式：

service: user-api
slos:
  - name: availability
    objective: 99.9
    description: |
      99.9% of requests return non-5xx status codes
# ... (see EXAMPLES.md for complete configuration)

時窗：

30 日（月）：典型外 SLA
7 日（週）：工程反饋速
1 日（日）：高頻服務需急應

30 日窗誤差預算例：

SLO: 99.9% availability over 30 days
Allowed failures: 0.1%
Total requests per month: 100M
Error budget: 100,000 failed requests
Daily budget: ~3,333 failed requests

定現實目標：

基線現況：

# Check actual availability over past 90 days
avg_over_time(
  (sum(rate(http_requests_total{status!~"5.."}[5m]))
  / sum(rate(http_requests_total[5m])))[90d:5m]
)
# Result: 99.95% → Set SLO at 99.9% (safer than current)

計九成之價：

99%    → 7.2 hours downtime/month (low reliability)
99.9%  → 43 minutes downtime/month (good)
99.95% → 22 minutes downtime/month (very good)
99.99% → 4.3 minutes downtime/month (expensive)
99.999% → 26 seconds downtime/month (very expensive)

衡用戶喜+工程價：
- 太嚴：昂，緩功能開發
- 太鬆：差體驗，客流失
- 最佳：略優於期望

得： SLO 目標定，業務相關者贊同，錄理由，計誤差預算。

敗：

起於可達（若現 98.5%→目標 99%）
按實績每季調 SLO
獲高層支持現實目標抗「五九」
錄每加一九之價效析

四：以 Sloth 行 SLO 監

由 SLO 規範生 Prometheus 記錄規則+告警。

裝 Sloth：

# Binary installation
wget https://github.com/slok/sloth/releases/download/v0.11.0/sloth-linux-amd64
chmod +x sloth-linux-amd64
sudo mv sloth-linux-amd64 /usr/local/bin/sloth

# Or Docker
docker pull ghcr.io/slok/sloth:latest

建 Sloth SLO 規範 (slos/user-api.yml)：

version: "prometheus/v1"
service: "user-api"
labels:
  team: "platform"
  tier: "1"
slos:
# ... (see EXAMPLES.md for complete configuration)

生 Prometheus 規則：

# Generate recording and alerting rules
sloth generate -i slos/user-api.yml -o prometheus/rules/user-api-slo.yml

# Validate generated rules
promtool check rules prometheus/rules/user-api-slo.yml

生記錄規則（節錄）：

groups:
  - name: sloth-slo-sli-recordings-user-api-requests-availability
    interval: 30s
    rules:
      # SLI: Ratio of good events
      - record: slo:sli_error:ratio_rate5m
# ... (see EXAMPLES.md for complete configuration)

生告警：

groups:
  - name: sloth-slo-alerts-user-api-requests-availability
    rules:
      # Fast burn: 2% budget consumed in 1 hour
      - alert: UserAPIHighErrorRate
        expr: |
# ... (see EXAMPLES.md for complete configuration)

載規則入 Prometheus：

# prometheus.yml
rule_files:
  - "rules/user-api-slo.yml"

重載：

curl -X POST http://localhost:9090/-/reload

得： Sloth 生多窗多燃率告警，記錄規則評估成，合成錯注入觸發告警。

敗：

yamllint slos/user-api.yml 驗 YAML
查 Sloth 版本（v0.11+ 宜）
驗 Prometheus 記錄規則評估：curl http://localhost:9090/api/v1/rules
以合成錯注入測告警
查 Sloth 文檔 SLI 事件查詢格式

五：建誤差預算儀板

於 Grafana 見 SLO 遵守+預算耗用。

Grafana 儀板 JSON（節錄）：

{
  "dashboard": {
    "title": "SLO Dashboard - User API",
    "panels": [
      {
        "type": "stat",
# ... (see EXAMPLES.md for complete configuration)

要視指標：

SLO 目標 vs 現 SLI
預算餘（比例+絕值）
燃率（耗速）
歷史 SLI 趨勢（30 日滾窗）
耗盡時（若現率續）

誤差預算政策儀板（markdown 面板）：

## Error Budget Policy

**Current Status**: 78% budget remaining

### If Error Budget > 50%
- ✅ Full speed ahead on new features
# ... (see EXAMPLES.md for complete configuration)

得：儀板示實時 SLO 遵守，預算耗用可見，團可據以決功能速度。

敗：

驗記錄規則存：curl http://localhost:9090/api/v1/rules | jq '.data.groups[].rules[] | select(.name | contains("slo:"))'
查 Grafana 之 Prometheus 數據源 URL
Explore 視圖驗查詢結果後加儀板
時範圍設合（如 30d 月 SLO）

六：立誤差預算政策

定組織級預算管理流程。

政策模：

service: user-api
slo:
  availability: 99.9%
  latency_p99: 200ms
  window: 30 days

# ... (see EXAMPLES.md for complete configuration)

自動政策執行：

# Example: Deployment gate script
import requests
import sys

def check_error_budget(service):
    # Query Prometheus for error budget
# ... (see EXAMPLES.md for complete configuration)

入 CI/CD：

# .github/workflows/deploy.yml
jobs:
  check-error-budget:
    runs-on: ubuntu-latest
    steps:
      - name: Check SLO Error Budget
        run: |
          python scripts/check_error_budget.py user-api
      - name: Deploy
        if: success()
        run: |
          kubectl apply -f deploy/

得：政策明錄，自動門防預算耗時險部署，團可靠優先一致。

敗：

起於手動政策（Slack 提醒）
漸自動化以軟門（警非阻）
硬門前獲高層贊同
每季審政策效，按需調閾

驗

忌

SLO 過嚴：「五九」無價析→倦+緩速。起可達，漸升。
SLI 過多：10+ 指標惑人。聚 2-4 關鍵用戶面。
SLO 無 SLA 緩衝：SLO 等 SLA→無餘地。留 0.05-0.1% 緩衝。
忽略預算：只跟蹤不行動→失義。執行政策。
虛榮指標為 SLI：內部指標（CPU、內存）代用戶面（時延、錯）→錯置優先。
相關者不贊同：純工程 SLO 無產品/業務贊同→衝突。獲高層。
SLO 僵化：系統變而不審→每季按實績+客反饋重訪。

參

setup-prometheus-monitoring
configure-alerting-rules
build-grafana-dashboards
write-incident-runbook

GitHub 仓库

pjt222/agent-almanac

路径: i18n/wenyan-ultra/skills/define-slo-sli-sla

agentsagentskillsai-assisted-developmentclaude-codeskillsteams

FAQ

Frequently asked questions

What is the define-slo-sli-sla skill?

define-slo-sli-sla is a Claude Skill by pjt222. Skills package instructions and resources that Claude loads on demand, so Claude can perform define-slo-sli-sla-related tasks without extra prompting.

How do I install define-slo-sli-sla?

Use the install commands on this page: add define-slo-sli-sla to Claude Code as a plugin, or clone its repository into your skills directory, then restart Claude so it picks up the skill.

What category does define-slo-sli-sla belong to?

define-slo-sli-sla is in the Documentation category, tagged ai, automation and data.

Is define-slo-sli-sla free to use?

Yes. define-slo-sli-sla is listed on AIMCP and free to install. It runs inside Claude, so no separate service account is required to use the skill itself.