Back to Skills

slo-implementation

camoneart
Updated Yesterday
82 views
2
2
View on GitHub
Othergeneral

About

This Claude Skill helps developers define and implement Service Level Indicators (SLIs) and Service Level Objectives (SLOs) with error budgets and alerting. Use it when establishing reliability targets, implementing SRE practices, or measuring service performance to balance reliability with innovation velocity.

Quick Install

Claude Code

Recommended
Plugin CommandRecommended
/plugin add https://github.com/camoneart/claude-code
Git CloneAlternative
git clone https://github.com/camoneart/claude-code.git ~/.claude/skills/slo-implementation

Copy and paste this command in Claude Code to install this skill

Documentation

SLO Implementation

Framework for defining and implementing Service Level Indicators (SLIs), Service Level Objectives (SLOs), and error budgets.

Purpose

Implement measurable reliability targets using SLIs, SLOs, and error budgets to balance reliability with innovation velocity.

When to Use

  • Define service reliability targets
  • Measure user-perceived reliability
  • Implement error budgets
  • Create SLO-based alerts
  • Track reliability goals

SLI/SLO/SLA Hierarchy

SLA (Service Level Agreement)
  ↓ Contract with customers
SLO (Service Level Objective)
  ↓ Internal reliability target
SLI (Service Level Indicator)
  ↓ Actual measurement

Defining SLIs

Common SLI Types

1. Availability SLI

# Successful requests / Total requests
sum(rate(http_requests_total{status!~"5.."}[28d]))
/
sum(rate(http_requests_total[28d]))

2. Latency SLI

# Requests below latency threshold / Total requests
sum(rate(http_request_duration_seconds_bucket{le="0.5"}[28d]))
/
sum(rate(http_request_duration_seconds_count[28d]))

3. Durability SLI

# Successful writes / Total writes
sum(storage_writes_successful_total)
/
sum(storage_writes_total)

Reference: See references/slo-definitions.md

Setting SLO Targets

Availability SLO Examples

SLO %Downtime/MonthDowntime/Year
99%7.2 hours3.65 days
99.9%43.2 minutes8.76 hours
99.95%21.6 minutes4.38 hours
99.99%4.32 minutes52.56 minutes

Choose Appropriate SLOs

Consider:

  • User expectations
  • Business requirements
  • Current performance
  • Cost of reliability
  • Competitor benchmarks

Example SLOs:

slos:
  - name: api_availability
    target: 99.9
    window: 28d
    sli: |
      sum(rate(http_requests_total{status!~"5.."}[28d]))
      /
      sum(rate(http_requests_total[28d]))

  - name: api_latency_p95
    target: 99
    window: 28d
    sli: |
      sum(rate(http_request_duration_seconds_bucket{le="0.5"}[28d]))
      /
      sum(rate(http_request_duration_seconds_count[28d]))

Error Budget Calculation

Error Budget Formula

Error Budget = 1 - SLO Target

Example:

  • SLO: 99.9% availability
  • Error Budget: 0.1% = 43.2 minutes/month
  • Current Error: 0.05% = 21.6 minutes/month
  • Remaining Budget: 50%

Error Budget Policy

error_budget_policy:
  - remaining_budget: 100%
    action: Normal development velocity
  - remaining_budget: 50%
    action: Consider postponing risky changes
  - remaining_budget: 10%
    action: Freeze non-critical changes
  - remaining_budget: 0%
    action: Feature freeze, focus on reliability

Reference: See references/error-budget.md

SLO Implementation

Prometheus Recording Rules

# SLI Recording Rules
groups:
  - name: sli_rules
    interval: 30s
    rules:
      # Availability SLI
      - record: sli:http_availability:ratio
        expr: |
          sum(rate(http_requests_total{status!~"5.."}[28d]))
          /
          sum(rate(http_requests_total[28d]))

      # Latency SLI (requests < 500ms)
      - record: sli:http_latency:ratio
        expr: |
          sum(rate(http_request_duration_seconds_bucket{le="0.5"}[28d]))
          /
          sum(rate(http_request_duration_seconds_count[28d]))

  - name: slo_rules
    interval: 5m
    rules:
      # SLO compliance (1 = meeting SLO, 0 = violating)
      - record: slo:http_availability:compliance
        expr: sli:http_availability:ratio >= bool 0.999

      - record: slo:http_latency:compliance
        expr: sli:http_latency:ratio >= bool 0.99

      # Error budget remaining (percentage)
      - record: slo:http_availability:error_budget_remaining
        expr: |
          (sli:http_availability:ratio - 0.999) / (1 - 0.999) * 100

      # Error budget burn rate
      - record: slo:http_availability:burn_rate_5m
        expr: |
          (1 - (
            sum(rate(http_requests_total{status!~"5.."}[5m]))
            /
            sum(rate(http_requests_total[5m]))
          )) / (1 - 0.999)

SLO Alerting Rules

groups:
  - name: slo_alerts
    interval: 1m
    rules:
      # Fast burn: 14.4x rate, 1 hour window
      # Consumes 2% error budget in 1 hour
      - alert: SLOErrorBudgetBurnFast
        expr: |
          slo:http_availability:burn_rate_1h > 14.4
          and
          slo:http_availability:burn_rate_5m > 14.4
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Fast error budget burn detected"
          description: "Error budget burning at {{ $value }}x rate"

      # Slow burn: 6x rate, 6 hour window
      # Consumes 5% error budget in 6 hours
      - alert: SLOErrorBudgetBurnSlow
        expr: |
          slo:http_availability:burn_rate_6h > 6
          and
          slo:http_availability:burn_rate_30m > 6
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "Slow error budget burn detected"
          description: "Error budget burning at {{ $value }}x rate"

      # Error budget exhausted
      - alert: SLOErrorBudgetExhausted
        expr: slo:http_availability:error_budget_remaining < 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "SLO error budget exhausted"
          description: "Error budget remaining: {{ $value }}%"

SLO Dashboard

Grafana Dashboard Structure:

┌────────────────────────────────────┐
│ SLO Compliance (Current)           │
│ ✓ 99.95% (Target: 99.9%)          │
├────────────────────────────────────┤
│ Error Budget Remaining: 65%        │
│ ████████░░ 65%                     │
├────────────────────────────────────┤
│ SLI Trend (28 days)                │
│ [Time series graph]                │
├────────────────────────────────────┤
│ Burn Rate Analysis                 │
│ [Burn rate by time window]         │
└────────────────────────────────────┘

Example Queries:

# Current SLO compliance
sli:http_availability:ratio * 100

# Error budget remaining
slo:http_availability:error_budget_remaining

# Days until error budget exhausted (at current burn rate)
(slo:http_availability:error_budget_remaining / 100)
*
28
/
(1 - sli:http_availability:ratio) * (1 - 0.999)

Multi-Window Burn Rate Alerts

# Combination of short and long windows reduces false positives
rules:
  - alert: SLOBurnRateHigh
    expr: |
      (
        slo:http_availability:burn_rate_1h > 14.4
        and
        slo:http_availability:burn_rate_5m > 14.4
      )
      or
      (
        slo:http_availability:burn_rate_6h > 6
        and
        slo:http_availability:burn_rate_30m > 6
      )
    labels:
      severity: critical

SLO Review Process

Weekly Review

  • Current SLO compliance
  • Error budget status
  • Trend analysis
  • Incident impact

Monthly Review

  • SLO achievement
  • Error budget usage
  • Incident postmortems
  • SLO adjustments

Quarterly Review

  • SLO relevance
  • Target adjustments
  • Process improvements
  • Tooling enhancements

Best Practices

  1. Start with user-facing services
  2. Use multiple SLIs (availability, latency, etc.)
  3. Set achievable SLOs (don't aim for 100%)
  4. Implement multi-window alerts to reduce noise
  5. Track error budget consistently
  6. Review SLOs regularly
  7. Document SLO decisions
  8. Align with business goals
  9. Automate SLO reporting
  10. Use SLOs for prioritization

Reference Files

  • assets/slo-template.md - SLO definition template
  • references/slo-definitions.md - SLO definition patterns
  • references/error-budget.md - Error budget calculations

Related Skills

  • prometheus-configuration - For metric collection
  • grafana-dashboards - For SLO visualization

GitHub Repository

camoneart/claude-code
Path: skills/slo-implementation

Related Skills

subagent-driven-development

Development

This skill executes implementation plans by dispatching a fresh subagent for each independent task, with code review between tasks. It enables fast iteration while maintaining quality gates through this review process. Use it when working on mostly independent tasks within the same session to ensure continuous progress with built-in quality checks.

View skill

algorithmic-art

Meta

This Claude Skill creates original algorithmic art using p5.js with seeded randomness and interactive parameters. It generates .md files for algorithmic philosophies, plus .html and .js files for interactive generative art implementations. Use it when developers need to create flow fields, particle systems, or other computational art while avoiding copyright issues.

View skill

executing-plans

Design

Use the executing-plans skill when you have a complete implementation plan to execute in controlled batches with review checkpoints. It loads and critically reviews the plan, then executes tasks in small batches (default 3 tasks) while reporting progress between each batch for architect review. This ensures systematic implementation with built-in quality control checkpoints.

View skill

cost-optimization

Other

This Claude Skill helps developers optimize cloud costs through resource rightsizing, tagging strategies, and spending analysis. It provides a framework for reducing cloud expenses and implementing cost governance across AWS, Azure, and GCP. Use it when you need to analyze infrastructure costs, right-size resources, or meet budget constraints.

View skill