MCP HubMCP Hub
返回技能列表

slo-implementation

camoneart
更新于 Today
17 次查看
2
2
在 GitHub 上查看
其他general

关于

This Claude Skill helps developers define and implement Service Level Indicators (SLIs) and Service Level Objectives (SLOs) with error budgets and alerting. Use it when establishing reliability targets, implementing SRE practices, or measuring service performance to balance reliability with innovation velocity.

技能文档

SLO Implementation

Framework for defining and implementing Service Level Indicators (SLIs), Service Level Objectives (SLOs), and error budgets.

Purpose

Implement measurable reliability targets using SLIs, SLOs, and error budgets to balance reliability with innovation velocity.

When to Use

  • Define service reliability targets
  • Measure user-perceived reliability
  • Implement error budgets
  • Create SLO-based alerts
  • Track reliability goals

SLI/SLO/SLA Hierarchy

SLA (Service Level Agreement)
  ↓ Contract with customers
SLO (Service Level Objective)
  ↓ Internal reliability target
SLI (Service Level Indicator)
  ↓ Actual measurement

Defining SLIs

Common SLI Types

1. Availability SLI

# Successful requests / Total requests
sum(rate(http_requests_total{status!~"5.."}[28d]))
/
sum(rate(http_requests_total[28d]))

2. Latency SLI

# Requests below latency threshold / Total requests
sum(rate(http_request_duration_seconds_bucket{le="0.5"}[28d]))
/
sum(rate(http_request_duration_seconds_count[28d]))

3. Durability SLI

# Successful writes / Total writes
sum(storage_writes_successful_total)
/
sum(storage_writes_total)

Reference: See references/slo-definitions.md

Setting SLO Targets

Availability SLO Examples

SLO %Downtime/MonthDowntime/Year
99%7.2 hours3.65 days
99.9%43.2 minutes8.76 hours
99.95%21.6 minutes4.38 hours
99.99%4.32 minutes52.56 minutes

Choose Appropriate SLOs

Consider:

  • User expectations
  • Business requirements
  • Current performance
  • Cost of reliability
  • Competitor benchmarks

Example SLOs:

slos:
  - name: api_availability
    target: 99.9
    window: 28d
    sli: |
      sum(rate(http_requests_total{status!~"5.."}[28d]))
      /
      sum(rate(http_requests_total[28d]))

  - name: api_latency_p95
    target: 99
    window: 28d
    sli: |
      sum(rate(http_request_duration_seconds_bucket{le="0.5"}[28d]))
      /
      sum(rate(http_request_duration_seconds_count[28d]))

Error Budget Calculation

Error Budget Formula

Error Budget = 1 - SLO Target

Example:

  • SLO: 99.9% availability
  • Error Budget: 0.1% = 43.2 minutes/month
  • Current Error: 0.05% = 21.6 minutes/month
  • Remaining Budget: 50%

Error Budget Policy

error_budget_policy:
  - remaining_budget: 100%
    action: Normal development velocity
  - remaining_budget: 50%
    action: Consider postponing risky changes
  - remaining_budget: 10%
    action: Freeze non-critical changes
  - remaining_budget: 0%
    action: Feature freeze, focus on reliability

Reference: See references/error-budget.md

SLO Implementation

Prometheus Recording Rules

# SLI Recording Rules
groups:
  - name: sli_rules
    interval: 30s
    rules:
      # Availability SLI
      - record: sli:http_availability:ratio
        expr: |
          sum(rate(http_requests_total{status!~"5.."}[28d]))
          /
          sum(rate(http_requests_total[28d]))

      # Latency SLI (requests < 500ms)
      - record: sli:http_latency:ratio
        expr: |
          sum(rate(http_request_duration_seconds_bucket{le="0.5"}[28d]))
          /
          sum(rate(http_request_duration_seconds_count[28d]))

  - name: slo_rules
    interval: 5m
    rules:
      # SLO compliance (1 = meeting SLO, 0 = violating)
      - record: slo:http_availability:compliance
        expr: sli:http_availability:ratio >= bool 0.999

      - record: slo:http_latency:compliance
        expr: sli:http_latency:ratio >= bool 0.99

      # Error budget remaining (percentage)
      - record: slo:http_availability:error_budget_remaining
        expr: |
          (sli:http_availability:ratio - 0.999) / (1 - 0.999) * 100

      # Error budget burn rate
      - record: slo:http_availability:burn_rate_5m
        expr: |
          (1 - (
            sum(rate(http_requests_total{status!~"5.."}[5m]))
            /
            sum(rate(http_requests_total[5m]))
          )) / (1 - 0.999)

SLO Alerting Rules

groups:
  - name: slo_alerts
    interval: 1m
    rules:
      # Fast burn: 14.4x rate, 1 hour window
      # Consumes 2% error budget in 1 hour
      - alert: SLOErrorBudgetBurnFast
        expr: |
          slo:http_availability:burn_rate_1h > 14.4
          and
          slo:http_availability:burn_rate_5m > 14.4
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Fast error budget burn detected"
          description: "Error budget burning at {{ $value }}x rate"

      # Slow burn: 6x rate, 6 hour window
      # Consumes 5% error budget in 6 hours
      - alert: SLOErrorBudgetBurnSlow
        expr: |
          slo:http_availability:burn_rate_6h > 6
          and
          slo:http_availability:burn_rate_30m > 6
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "Slow error budget burn detected"
          description: "Error budget burning at {{ $value }}x rate"

      # Error budget exhausted
      - alert: SLOErrorBudgetExhausted
        expr: slo:http_availability:error_budget_remaining < 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "SLO error budget exhausted"
          description: "Error budget remaining: {{ $value }}%"

SLO Dashboard

Grafana Dashboard Structure:

┌────────────────────────────────────┐
│ SLO Compliance (Current)           │
│ ✓ 99.95% (Target: 99.9%)          │
├────────────────────────────────────┤
│ Error Budget Remaining: 65%        │
│ ████████░░ 65%                     │
├────────────────────────────────────┤
│ SLI Trend (28 days)                │
│ [Time series graph]                │
├────────────────────────────────────┤
│ Burn Rate Analysis                 │
│ [Burn rate by time window]         │
└────────────────────────────────────┘

Example Queries:

# Current SLO compliance
sli:http_availability:ratio * 100

# Error budget remaining
slo:http_availability:error_budget_remaining

# Days until error budget exhausted (at current burn rate)
(slo:http_availability:error_budget_remaining / 100)
*
28
/
(1 - sli:http_availability:ratio) * (1 - 0.999)

Multi-Window Burn Rate Alerts

# Combination of short and long windows reduces false positives
rules:
  - alert: SLOBurnRateHigh
    expr: |
      (
        slo:http_availability:burn_rate_1h > 14.4
        and
        slo:http_availability:burn_rate_5m > 14.4
      )
      or
      (
        slo:http_availability:burn_rate_6h > 6
        and
        slo:http_availability:burn_rate_30m > 6
      )
    labels:
      severity: critical

SLO Review Process

Weekly Review

  • Current SLO compliance
  • Error budget status
  • Trend analysis
  • Incident impact

Monthly Review

  • SLO achievement
  • Error budget usage
  • Incident postmortems
  • SLO adjustments

Quarterly Review

  • SLO relevance
  • Target adjustments
  • Process improvements
  • Tooling enhancements

Best Practices

  1. Start with user-facing services
  2. Use multiple SLIs (availability, latency, etc.)
  3. Set achievable SLOs (don't aim for 100%)
  4. Implement multi-window alerts to reduce noise
  5. Track error budget consistently
  6. Review SLOs regularly
  7. Document SLO decisions
  8. Align with business goals
  9. Automate SLO reporting
  10. Use SLOs for prioritization

Reference Files

  • assets/slo-template.md - SLO definition template
  • references/slo-definitions.md - SLO definition patterns
  • references/error-budget.md - Error budget calculations

Related Skills

  • prometheus-configuration - For metric collection
  • grafana-dashboards - For SLO visualization

快速安装

/plugin add https://github.com/camoneart/claude-code/tree/main/slo-implementation

在 Claude Code 中复制并粘贴此命令以安装该技能

GitHub 仓库

camoneart/claude-code
路径: skills/slo-implementation

相关推荐技能

analyzing-dependencies

这个Claude Skill能自动分析项目依赖的安全漏洞、过时包和许可证合规问题。它支持npm、pip、composer、gem和go modules等多种包管理器,帮助开发者识别潜在风险。当您需要检查依赖安全性、更新过时包或确保许可证兼容时,可使用"check dependencies"等触发短语来调用。

查看技能

work-execution-principles

其他

这个Claude Skill为开发者提供了一套通用的工作执行原则,涵盖任务分解、范围确定、测试策略和依赖管理。它确保开发活动中的一致质量标准,适用于代码审查、工作规划和架构决策等场景。该技能与所有编程语言和框架兼容,帮助开发者系统化地组织代码结构和定义工作边界。

查看技能

Git Commit Helper

Git Commit Helper能通过分析git diff自动生成规范的提交信息,适用于开发者编写提交消息或审查暂存区变更时。它能识别代码变更类型并自动匹配Conventional Commits规范,提供包含功能类型、作用域和描述的标准化消息。开发者只需提供git diff内容即可获得即用型的提交消息建议。

查看技能

algorithmic-art

该Skill使用p5.js创建包含种子随机性和交互参数探索的算法艺术,适用于生成艺术、流场或粒子系统等需求。它能自动生成算法哲学文档(.md)和对应的交互式艺术代码(.html/.js),确保作品原创性避免侵权。开发者可通过定义计算美学理念快速获得可交互的艺术实现方案。

查看技能