canary-deployment
About
This Claude Skill enables gradual canary deployments by rolling out new versions to subsets of users while monitoring metrics. It automatically triggers rollbacks when issues are detected based on predefined thresholds, minimizing user impact. Use it for low-risk rollouts, real-world testing with live traffic, and metrics-driven deployment strategies.
Documentation
Canary Deployment
Overview
Deploy new versions gradually to a small percentage of users, monitor metrics for issues, and automatically rollback or proceed based on predefined thresholds.
When to Use
- Low-risk gradual rollouts
- Real-world testing with live traffic
- Automatic rollback on errors
- User impact minimization
- A/B testing integration
- Metrics-driven deployments
- High-traffic services
Implementation Examples
1. Istio-based Canary Deployment
# canary-deployment-istio.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: myapp-v1
namespace: production
spec:
replicas: 3
selector:
matchLabels:
app: myapp
version: v1
template:
metadata:
labels:
app: myapp
version: v1
spec:
containers:
- name: myapp
image: myrepo/myapp:1.0.0
ports:
- containerPort: 8080
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: myapp-v2
namespace: production
spec:
replicas: 1 # Start with minimal replicas for canary
selector:
matchLabels:
app: myapp
version: v2
template:
metadata:
labels:
app: myapp
version: v2
spec:
containers:
- name: myapp
image: myrepo/myapp:2.0.0
ports:
- containerPort: 8080
---
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: myapp
namespace: production
spec:
hosts:
- myapp
http:
# Canary: 5% to v2, 95% to v1
- match:
- headers:
user-agent:
regex: ".*Chrome.*" # Test with Chrome
route:
- destination:
host: myapp
subset: v2
weight: 100
timeout: 10s
# Default route with traffic split
- route:
- destination:
host: myapp
subset: v1
weight: 95
- destination:
host: myapp
subset: v2
weight: 5
timeout: 10s
---
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
name: myapp
namespace: production
spec:
host: myapp
trafficPolicy:
connectionPool:
http:
http1MaxPendingRequests: 100
maxRequestsPerConnection: 2
subsets:
- name: v1
labels:
version: v1
- name: v2
labels:
version: v2
trafficPolicy:
outlierDetection:
consecutive5xxErrors: 3
interval: 30s
baseEjectionTime: 30s
---
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
name: myapp
namespace: production
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: myapp
progressDeadlineSeconds: 300
service:
port: 80
analysis:
interval: 1m
threshold: 5
maxWeight: 50
stepWeight: 5
stepWeightPromotion: 10
metrics:
- name: request-success-rate
thresholdRange:
min: 99
interval: 1m
- name: request-duration
thresholdRange:
max: 500
interval: 30s
webhooks:
- name: acceptance-test
url: http://flagger-loadtester/
timeout: 30s
metadata:
type: smoke
cmd: "curl -sd 'test' http://myapp-canary/api/test"
- name: load-test
url: http://flagger-loadtester/
timeout: 5s
metadata:
cmd: "hey -z 1m -q 10 -c 2 http://myapp-canary/"
logCmdOutput: "true"
# Automatic rollback on failure
skipAnalysis: false
2. Kubernetes Native Canary Script
#!/bin/bash
# canary-rollout.sh - Canary deployment with k8s native tools
set -euo pipefail
NAMESPACE="${1:-production}"
DEPLOYMENT="${2:-myapp}"
NEW_VERSION="${3:-latest}"
CANARY_WEIGHT=10
MAX_WEIGHT=100
STEP_WEIGHT=10
CHECK_INTERVAL=60
MAX_ERROR_RATE=0.05
echo "Starting canary deployment for $DEPLOYMENT with version $NEW_VERSION"
# Get current replicas
CURRENT_REPLICAS=$(kubectl get deployment $DEPLOYMENT -n "$NAMESPACE" \
-o jsonpath='{.spec.replicas}')
CANARY_REPLICAS=$((CURRENT_REPLICAS / 10 + 1))
echo "Current replicas: $CURRENT_REPLICAS, Canary replicas: $CANARY_REPLICAS"
# Create canary deployment
kubectl set image deployment/${DEPLOYMENT}-canary \
${DEPLOYMENT}=myrepo/${DEPLOYMENT}:${NEW_VERSION} \
-n "$NAMESPACE"
# Scale up canary gradually
CURRENT_WEIGHT=$CANARY_WEIGHT
while [ $CURRENT_WEIGHT -le $MAX_WEIGHT ]; do
echo "Setting traffic to canary: ${CURRENT_WEIGHT}%"
# Update ingress or service to split traffic
kubectl patch virtualservice ${DEPLOYMENT} -n "$NAMESPACE" --type merge \
-p '{"spec":{"http":[{"route":[{"destination":{"host":"'${DEPLOYMENT}-stable'"},"weight":'$((100-CURRENT_WEIGHT))'},{"destination":{"host":"'${DEPLOYMENT}-canary'"},"weight":'${CURRENT_WEIGHT}'}]}]}}'
# Wait and check metrics
echo "Monitoring metrics for ${CHECK_INTERVAL}s..."
sleep $CHECK_INTERVAL
# Check error rate
ERROR_RATE=$(kubectl exec -it deployment/${DEPLOYMENT}-canary -n "$NAMESPACE" -- \
curl -s http://localhost:8080/metrics | grep http_requests_total | \
awk '{print $2}' || echo "0")
if (( $(echo "$ERROR_RATE > $MAX_ERROR_RATE" | bc -l) )); then
echo "ERROR: Error rate exceeded threshold: $ERROR_RATE"
echo "Rolling back canary deployment..."
kubectl patch virtualservice ${DEPLOYMENT} -n "$NAMESPACE" --type merge \
-p '{"spec":{"http":[{"route":[{"destination":{"host":"'${DEPLOYMENT}-stable'"},"weight":100}]}]}}'
exit 1
fi
CURRENT_WEIGHT=$((CURRENT_WEIGHT + STEP_WEIGHT))
done
# Promote canary to stable
echo "Canary deployment successful! Promoting to stable..."
kubectl set image deployment/${DEPLOYMENT} \
${DEPLOYMENT}=myrepo/${DEPLOYMENT}:${NEW_VERSION} \
-n "$NAMESPACE"
kubectl rollout status deployment/${DEPLOYMENT} -n "$NAMESPACE" --timeout=5m
echo "Canary deployment complete!"
3. Metrics-Based Canary Analysis
# canary-monitoring.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: canary-analysis
namespace: production
data:
analyze.sh: |
#!/bin/bash
set -euo pipefail
CANARY_DEPLOYMENT="${1:-myapp-canary}"
STABLE_DEPLOYMENT="${2:-myapp-stable}"
THRESHOLD="${3:-0.05}" # 5% error rate threshold
NAMESPACE="production"
echo "Analyzing canary metrics..."
# Query Prometheus for metrics
CANARY_ERROR_RATE=$(curl -s 'http://prometheus:9090/api/v1/query' \
--data-urlencode 'query=rate(http_requests_total{status=~"5..",deployment="'${CANARY_DEPLOYMENT}'"}[5m])' | \
jq -r '.data.result[0].value[1]' || echo "0")
STABLE_ERROR_RATE=$(curl -s 'http://prometheus:9090/api/v1/query' \
--data-urlencode 'query=rate(http_requests_total{status=~"5..",deployment="'${STABLE_DEPLOYMENT}'"}[5m])' | \
jq -r '.data.result[0].value[1]' || echo "0")
CANARY_LATENCY=$(curl -s 'http://prometheus:9090/api/v1/query' \
--data-urlencode 'query=histogram_quantile(0.95, rate(http_request_duration_seconds_bucket{deployment="'${CANARY_DEPLOYMENT}'"}[5m]))' | \
jq -r '.data.result[0].value[1]' || echo "0")
STABLE_LATENCY=$(curl -s 'http://prometheus:9090/api/v1/query' \
--data-urlencode 'query=histogram_quantile(0.95, rate(http_request_duration_seconds_bucket{deployment="'${STABLE_DEPLOYMENT}'"}[5m]))' | \
jq -r '.data.result[0].value[1]' || echo "0")
echo "Canary Error Rate: $CANARY_ERROR_RATE"
echo "Stable Error Rate: $STABLE_ERROR_RATE"
echo "Canary P95 Latency: ${CANARY_LATENCY}s"
echo "Stable P95 Latency: ${STABLE_LATENCY}s"
# Check if canary is within acceptable range
if (( $(echo "$CANARY_ERROR_RATE > $THRESHOLD" | bc -l) )); then
echo "FAIL: Canary error rate exceeds threshold"
exit 1
fi
if (( $(echo "$CANARY_LATENCY > $STABLE_LATENCY * 1.2" | bc -l) )); then
echo "FAIL: Canary latency is 20% higher than stable"
exit 1
fi
echo "PASS: Canary meets quality criteria"
exit 0
---
apiVersion: batch/v1
kind: Job
metadata:
name: canary-analysis
namespace: production
spec:
template:
spec:
containers:
- name: analyzer
image: curlimages/curl:latest
command:
- sh
- -c
- |
apk add --no-cache bc jq
bash /scripts/analyze.sh
volumeMounts:
- name: scripts
mountPath: /scripts
volumes:
- name: scripts
configMap:
name: canary-analysis
restartPolicy: Never
4. Automated Canary Promotion
#!/bin/bash
# promote-canary.sh - Automatically promote successful canary
set -euo pipefail
NAMESPACE="${1:-production}"
DEPLOYMENT="${2:-myapp}"
MAX_DURATION="${3:-600}" # Max 10 minutes for canary
start_time=$(date +%s)
echo "Starting automated canary promotion for $DEPLOYMENT"
while true; do
current_time=$(date +%s)
elapsed=$((current_time - start_time))
if [ $elapsed -gt $MAX_DURATION ]; then
echo "ERROR: Canary exceeded max duration"
exit 1
fi
# Check canary health
CANARY_REPLICAS=$(kubectl get deployment ${DEPLOYMENT}-canary -n "$NAMESPACE" \
-o jsonpath='{.status.readyReplicas}')
CANARY_DESIRED=$(kubectl get deployment ${DEPLOYMENT}-canary -n "$NAMESPACE" \
-o jsonpath='{.spec.replicas}')
if [ "$CANARY_REPLICAS" -ne "$CANARY_DESIRED" ]; then
echo "Waiting for canary pods to be ready..."
sleep 10
continue
fi
# Run analysis
if bash /scripts/analyze.sh "$DEPLOYMENT-canary" "$DEPLOYMENT-stable"; then
echo "Canary analysis passed! Promoting to stable..."
# Merge canary into stable
kubectl set image deployment/${DEPLOYMENT} \
${DEPLOYMENT}=myrepo/${DEPLOYMENT}:$(kubectl get deployment ${DEPLOYMENT}-canary -n "$NAMESPACE" \
-o jsonpath='{.spec.template.spec.containers[0].image}' | cut -d: -f2) \
-n "$NAMESPACE"
kubectl rollout status deployment/${DEPLOYMENT} -n "$NAMESPACE" --timeout=5m
echo "Canary promoted successfully!"
exit 0
else
echo "Canary analysis failed. Rolling back..."
exit 1
fi
done
Canary Best Practices
✅ DO
- Start with small traffic percentage (5-10%)
- Monitor key metrics continuously
- Increase gradually based on metrics
- Implement automatic rollback
- Run load tests on canary
- Test with real user traffic
- Set appropriate thresholds
- Document rollback procedures
❌ DON'T
- Rush through canary phases
- Ignore metrics
- Mix canary and stable versions
- Deploy to all users at once
- Skip rollback testing
- Use artificial load only
- Set unrealistic thresholds
- Deploy unvalidated changes
Metrics to Monitor
- Error Rate: 5xx errors increase
- Latency: P95/P99 response time
- Throughput: Requests per second
- Resource Usage: CPU, memory
- Business Metrics: Conversion rate, revenue
- User Experience: Session duration, bounce rate
Resources
Quick Install
/plugin add https://github.com/aj-geddes/useful-ai-prompts/tree/main/canary-deploymentCopy and paste this command in Claude Code to install this skill
GitHub 仓库
Related Skills
sglang
MetaSGLang is a high-performance LLM serving framework that specializes in fast, structured generation for JSON, regex, and agentic workflows using its RadixAttention prefix caching. It delivers significantly faster inference, especially for tasks with repeated prefixes, making it ideal for complex, structured outputs and multi-turn conversations. Choose SGLang over alternatives like vLLM when you need constrained decoding or are building applications with extensive prefix sharing.
Algorithmic Art Generation
MetaThis skill helps developers create algorithmic art using p5.js, focusing on generative art, computational aesthetics, and interactive visualizations. It automatically activates for topics like "generative art" or "p5.js visualization" and guides you through creating unique algorithms with features like seeded randomness, flow fields, and particle systems. Use it when you need to build reproducible, code-driven artistic patterns.
business-rule-documentation
MetaThis skill provides standardized templates for systematically documenting business logic and domain knowledge following Domain-Driven Design principles. It helps developers capture business rules, process flows, decision trees, and terminology glossaries to maintain consistency between requirements and implementation. Use it when documenting domain models, creating business rule repositories, or bridging communication between business and technical teams.
huggingface-accelerate
DevelopmentHuggingFace Accelerate provides the simplest API for adding distributed training to PyTorch scripts with just 4 lines of code. It offers a unified interface for multiple distributed training frameworks like DeepSpeed, FSDP, and DDP while handling automatic device placement and mixed precision. This makes it ideal for developers who want to quickly scale their PyTorch training across multiple GPUs or nodes without complex configuration.
