Back to Skills

canary-deployment

aj-geddes
Updated Today
27 views
7
7
View on GitHub
Otherautomation

About

This Claude Skill enables gradual canary deployments by rolling out new versions to subsets of users while monitoring metrics. It automatically triggers rollbacks when issues are detected based on predefined thresholds, minimizing user impact. Use it for low-risk rollouts, real-world testing with live traffic, and metrics-driven deployment strategies.

Documentation

Canary Deployment

Overview

Deploy new versions gradually to a small percentage of users, monitor metrics for issues, and automatically rollback or proceed based on predefined thresholds.

When to Use

  • Low-risk gradual rollouts
  • Real-world testing with live traffic
  • Automatic rollback on errors
  • User impact minimization
  • A/B testing integration
  • Metrics-driven deployments
  • High-traffic services

Implementation Examples

1. Istio-based Canary Deployment

# canary-deployment-istio.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: myapp-v1
  namespace: production
spec:
  replicas: 3
  selector:
    matchLabels:
      app: myapp
      version: v1
  template:
    metadata:
      labels:
        app: myapp
        version: v1
    spec:
      containers:
        - name: myapp
          image: myrepo/myapp:1.0.0
          ports:
            - containerPort: 8080

---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: myapp-v2
  namespace: production
spec:
  replicas: 1  # Start with minimal replicas for canary
  selector:
    matchLabels:
      app: myapp
      version: v2
  template:
    metadata:
      labels:
        app: myapp
        version: v2
    spec:
      containers:
        - name: myapp
          image: myrepo/myapp:2.0.0
          ports:
            - containerPort: 8080

---
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: myapp
  namespace: production
spec:
  hosts:
    - myapp
  http:
    # Canary: 5% to v2, 95% to v1
    - match:
        - headers:
            user-agent:
              regex: ".*Chrome.*"  # Test with Chrome
      route:
        - destination:
            host: myapp
            subset: v2
          weight: 100
      timeout: 10s

    # Default route with traffic split
    - route:
        - destination:
            host: myapp
            subset: v1
          weight: 95
        - destination:
            host: myapp
            subset: v2
          weight: 5
      timeout: 10s

---
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
  name: myapp
  namespace: production
spec:
  host: myapp
  trafficPolicy:
    connectionPool:
      http:
        http1MaxPendingRequests: 100
        maxRequestsPerConnection: 2

  subsets:
    - name: v1
      labels:
        version: v1

    - name: v2
      labels:
        version: v2
      trafficPolicy:
        outlierDetection:
          consecutive5xxErrors: 3
          interval: 30s
          baseEjectionTime: 30s

---
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: myapp
  namespace: production
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: myapp
  progressDeadlineSeconds: 300
  service:
    port: 80

  analysis:
    interval: 1m
    threshold: 5
    maxWeight: 50
    stepWeight: 5
    stepWeightPromotion: 10

  metrics:
    - name: request-success-rate
      thresholdRange:
        min: 99
      interval: 1m

    - name: request-duration
      thresholdRange:
        max: 500
      interval: 30s

  webhooks:
    - name: acceptance-test
      url: http://flagger-loadtester/
      timeout: 30s
      metadata:
        type: smoke
        cmd: "curl -sd 'test' http://myapp-canary/api/test"

    - name: load-test
      url: http://flagger-loadtester/
      timeout: 5s
      metadata:
        cmd: "hey -z 1m -q 10 -c 2 http://myapp-canary/"
        logCmdOutput: "true"

  # Automatic rollback on failure
  skipAnalysis: false

2. Kubernetes Native Canary Script

#!/bin/bash
# canary-rollout.sh - Canary deployment with k8s native tools

set -euo pipefail

NAMESPACE="${1:-production}"
DEPLOYMENT="${2:-myapp}"
NEW_VERSION="${3:-latest}"
CANARY_WEIGHT=10
MAX_WEIGHT=100
STEP_WEIGHT=10
CHECK_INTERVAL=60
MAX_ERROR_RATE=0.05

echo "Starting canary deployment for $DEPLOYMENT with version $NEW_VERSION"

# Get current replicas
CURRENT_REPLICAS=$(kubectl get deployment $DEPLOYMENT -n "$NAMESPACE" \
  -o jsonpath='{.spec.replicas}')
CANARY_REPLICAS=$((CURRENT_REPLICAS / 10 + 1))

echo "Current replicas: $CURRENT_REPLICAS, Canary replicas: $CANARY_REPLICAS"

# Create canary deployment
kubectl set image deployment/${DEPLOYMENT}-canary \
  ${DEPLOYMENT}=myrepo/${DEPLOYMENT}:${NEW_VERSION} \
  -n "$NAMESPACE"

# Scale up canary gradually
CURRENT_WEIGHT=$CANARY_WEIGHT
while [ $CURRENT_WEIGHT -le $MAX_WEIGHT ]; do
  echo "Setting traffic to canary: ${CURRENT_WEIGHT}%"

  # Update ingress or service to split traffic
  kubectl patch virtualservice ${DEPLOYMENT} -n "$NAMESPACE" --type merge \
    -p '{"spec":{"http":[{"route":[{"destination":{"host":"'${DEPLOYMENT}-stable'"},"weight":'$((100-CURRENT_WEIGHT))'},{"destination":{"host":"'${DEPLOYMENT}-canary'"},"weight":'${CURRENT_WEIGHT}'}]}]}}'

  # Wait and check metrics
  echo "Monitoring metrics for ${CHECK_INTERVAL}s..."
  sleep $CHECK_INTERVAL

  # Check error rate
  ERROR_RATE=$(kubectl exec -it deployment/${DEPLOYMENT}-canary -n "$NAMESPACE" -- \
    curl -s http://localhost:8080/metrics | grep http_requests_total | \
    awk '{print $2}' || echo "0")

  if (( $(echo "$ERROR_RATE > $MAX_ERROR_RATE" | bc -l) )); then
    echo "ERROR: Error rate exceeded threshold: $ERROR_RATE"
    echo "Rolling back canary deployment..."
    kubectl patch virtualservice ${DEPLOYMENT} -n "$NAMESPACE" --type merge \
      -p '{"spec":{"http":[{"route":[{"destination":{"host":"'${DEPLOYMENT}-stable'"},"weight":100}]}]}}'
    exit 1
  fi

  CURRENT_WEIGHT=$((CURRENT_WEIGHT + STEP_WEIGHT))
done

# Promote canary to stable
echo "Canary deployment successful! Promoting to stable..."
kubectl set image deployment/${DEPLOYMENT} \
  ${DEPLOYMENT}=myrepo/${DEPLOYMENT}:${NEW_VERSION} \
  -n "$NAMESPACE"

kubectl rollout status deployment/${DEPLOYMENT} -n "$NAMESPACE" --timeout=5m

echo "Canary deployment complete!"

3. Metrics-Based Canary Analysis

# canary-monitoring.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: canary-analysis
  namespace: production
data:
  analyze.sh: |
    #!/bin/bash
    set -euo pipefail

    CANARY_DEPLOYMENT="${1:-myapp-canary}"
    STABLE_DEPLOYMENT="${2:-myapp-stable}"
    THRESHOLD="${3:-0.05}"  # 5% error rate threshold
    NAMESPACE="production"

    echo "Analyzing canary metrics..."

    # Query Prometheus for metrics
    CANARY_ERROR_RATE=$(curl -s 'http://prometheus:9090/api/v1/query' \
      --data-urlencode 'query=rate(http_requests_total{status=~"5..",deployment="'${CANARY_DEPLOYMENT}'"}[5m])' | \
      jq -r '.data.result[0].value[1]' || echo "0")

    STABLE_ERROR_RATE=$(curl -s 'http://prometheus:9090/api/v1/query' \
      --data-urlencode 'query=rate(http_requests_total{status=~"5..",deployment="'${STABLE_DEPLOYMENT}'"}[5m])' | \
      jq -r '.data.result[0].value[1]' || echo "0")

    CANARY_LATENCY=$(curl -s 'http://prometheus:9090/api/v1/query' \
      --data-urlencode 'query=histogram_quantile(0.95, rate(http_request_duration_seconds_bucket{deployment="'${CANARY_DEPLOYMENT}'"}[5m]))' | \
      jq -r '.data.result[0].value[1]' || echo "0")

    STABLE_LATENCY=$(curl -s 'http://prometheus:9090/api/v1/query' \
      --data-urlencode 'query=histogram_quantile(0.95, rate(http_request_duration_seconds_bucket{deployment="'${STABLE_DEPLOYMENT}'"}[5m]))' | \
      jq -r '.data.result[0].value[1]' || echo "0")

    echo "Canary Error Rate: $CANARY_ERROR_RATE"
    echo "Stable Error Rate: $STABLE_ERROR_RATE"
    echo "Canary P95 Latency: ${CANARY_LATENCY}s"
    echo "Stable P95 Latency: ${STABLE_LATENCY}s"

    # Check if canary is within acceptable range
    if (( $(echo "$CANARY_ERROR_RATE > $THRESHOLD" | bc -l) )); then
      echo "FAIL: Canary error rate exceeds threshold"
      exit 1
    fi

    if (( $(echo "$CANARY_LATENCY > $STABLE_LATENCY * 1.2" | bc -l) )); then
      echo "FAIL: Canary latency is 20% higher than stable"
      exit 1
    fi

    echo "PASS: Canary meets quality criteria"
    exit 0

---
apiVersion: batch/v1
kind: Job
metadata:
  name: canary-analysis
  namespace: production
spec:
  template:
    spec:
      containers:
        - name: analyzer
          image: curlimages/curl:latest
          command:
            - sh
            - -c
            - |
              apk add --no-cache bc jq
              bash /scripts/analyze.sh
          volumeMounts:
            - name: scripts
              mountPath: /scripts
      volumes:
        - name: scripts
          configMap:
            name: canary-analysis
      restartPolicy: Never

4. Automated Canary Promotion

#!/bin/bash
# promote-canary.sh - Automatically promote successful canary

set -euo pipefail

NAMESPACE="${1:-production}"
DEPLOYMENT="${2:-myapp}"
MAX_DURATION="${3:-600}"  # Max 10 minutes for canary

start_time=$(date +%s)

echo "Starting automated canary promotion for $DEPLOYMENT"

while true; do
  current_time=$(date +%s)
  elapsed=$((current_time - start_time))

  if [ $elapsed -gt $MAX_DURATION ]; then
    echo "ERROR: Canary exceeded max duration"
    exit 1
  fi

  # Check canary health
  CANARY_REPLICAS=$(kubectl get deployment ${DEPLOYMENT}-canary -n "$NAMESPACE" \
    -o jsonpath='{.status.readyReplicas}')
  CANARY_DESIRED=$(kubectl get deployment ${DEPLOYMENT}-canary -n "$NAMESPACE" \
    -o jsonpath='{.spec.replicas}')

  if [ "$CANARY_REPLICAS" -ne "$CANARY_DESIRED" ]; then
    echo "Waiting for canary pods to be ready..."
    sleep 10
    continue
  fi

  # Run analysis
  if bash /scripts/analyze.sh "$DEPLOYMENT-canary" "$DEPLOYMENT-stable"; then
    echo "Canary analysis passed! Promoting to stable..."

    # Merge canary into stable
    kubectl set image deployment/${DEPLOYMENT} \
      ${DEPLOYMENT}=myrepo/${DEPLOYMENT}:$(kubectl get deployment ${DEPLOYMENT}-canary -n "$NAMESPACE" \
        -o jsonpath='{.spec.template.spec.containers[0].image}' | cut -d: -f2) \
      -n "$NAMESPACE"

    kubectl rollout status deployment/${DEPLOYMENT} -n "$NAMESPACE" --timeout=5m

    echo "Canary promoted successfully!"
    exit 0
  else
    echo "Canary analysis failed. Rolling back..."
    exit 1
  fi
done

Canary Best Practices

✅ DO

  • Start with small traffic percentage (5-10%)
  • Monitor key metrics continuously
  • Increase gradually based on metrics
  • Implement automatic rollback
  • Run load tests on canary
  • Test with real user traffic
  • Set appropriate thresholds
  • Document rollback procedures

❌ DON'T

  • Rush through canary phases
  • Ignore metrics
  • Mix canary and stable versions
  • Deploy to all users at once
  • Skip rollback testing
  • Use artificial load only
  • Set unrealistic thresholds
  • Deploy unvalidated changes

Metrics to Monitor

  • Error Rate: 5xx errors increase
  • Latency: P95/P99 response time
  • Throughput: Requests per second
  • Resource Usage: CPU, memory
  • Business Metrics: Conversion rate, revenue
  • User Experience: Session duration, bounce rate

Resources

Quick Install

/plugin add https://github.com/aj-geddes/useful-ai-prompts/tree/main/canary-deployment

Copy and paste this command in Claude Code to install this skill

GitHub 仓库

aj-geddes/useful-ai-prompts
Path: skills/canary-deployment

Related Skills

sglang

Meta

SGLang is a high-performance LLM serving framework that specializes in fast, structured generation for JSON, regex, and agentic workflows using its RadixAttention prefix caching. It delivers significantly faster inference, especially for tasks with repeated prefixes, making it ideal for complex, structured outputs and multi-turn conversations. Choose SGLang over alternatives like vLLM when you need constrained decoding or are building applications with extensive prefix sharing.

View skill

Algorithmic Art Generation

Meta

This skill helps developers create algorithmic art using p5.js, focusing on generative art, computational aesthetics, and interactive visualizations. It automatically activates for topics like "generative art" or "p5.js visualization" and guides you through creating unique algorithms with features like seeded randomness, flow fields, and particle systems. Use it when you need to build reproducible, code-driven artistic patterns.

View skill

business-rule-documentation

Meta

This skill provides standardized templates for systematically documenting business logic and domain knowledge following Domain-Driven Design principles. It helps developers capture business rules, process flows, decision trees, and terminology glossaries to maintain consistency between requirements and implementation. Use it when documenting domain models, creating business rule repositories, or bridging communication between business and technical teams.

View skill

huggingface-accelerate

Development

HuggingFace Accelerate provides the simplest API for adding distributed training to PyTorch scripts with just 4 lines of code. It offers a unified interface for multiple distributed training frameworks like DeepSpeed, FSDP, and DDP while handling automatic device placement and mixed precision. This makes it ideal for developers who want to quickly scale their PyTorch training across multiple GPUs or nodes without complex configuration.

View skill