SKILL·A5148D

optimize-cloud-costs

Name: optimize-cloud-costs
Author: pjt222

pjt222

Updated 1 month ago

11 views

Othergeneral

About

This skill helps developers optimize Kubernetes cloud costs using tools like Kubecost. It provides visibility, recommends resource adjustments, implements autoscaling, and leverages spot instances for savings. Use it when cloud costs outpace business value, resource usage is inefficient, or you need to implement chargeback reporting.

Quick Install

Claude Code

Recommended

Primary

npx skills add pjt222/agent-almanac -a claude-code

Plugin CommandAlternative

/plugin add https://github.com/pjt222/agent-almanac

Git CloneAlternative

git clone https://github.com/pjt222/agent-almanac.git ~/.claude/skills/optimize-cloud-costs

Copy and paste this command in Claude Code to install this skill

Documentation

name: optimize-cloud-costs description: > 使用 Kubecost 等工具为 Kubernetes 工作负载实施云成本优化策略，包括可见性分析、资源规格调整建议、水平和垂直 Pod 自动扩缩容、 Spot/可抢占实例和资源配额。涵盖成本分配、分摊报告和持续优化实践。适用于云成本增长与业务价值不匹配、资源请求与实际使用不一致、手动扩缩容导致过度配置，或需要为内部成本问责实施分摊计费的场景。 license: MIT allowed-tools: Read Write Edit Bash Grep Glob metadata: author: Philipp Thoss version: "1.0" domain: devops complexity: intermediate language: multi tags: cost-optimization, kubecost, hpa, vpa, spot-instances, resource-management, kubernetes locale: zh-CN source_locale: en source_commit: 6f65f316 translator: claude-opus-4-6 translation_date: "2026-03-16"

优化云成本

为 Kubernetes 集群实施全面的成本优化策略，降低云计算支出。

适用场景

云基础设施成本增长但对应业务价值未同步增加
需要按团队、应用或环境了解成本分配情况
资源请求/限制与实际使用模式不一致
手动扩缩容导致过度配置和资源浪费
希望利用 Spot/可抢占实例处理非关键工作负载
需要为内部成本分配实施分摊展示或分摊计费
希望建立具有成本意识和问责制的 FinOps 文化

输入

必填：运行工作负载的 Kubernetes 集群
必填：云提供商账单 API 访问权限
必填：用于资源指标的 Metrics Server 或 Prometheus
可选：用于趋势分析的历史使用数据
可选：成本分配需求（按命名空间、标签、团队）
可选：性能约束的服务级别目标（SLO）
可选：预算限制或成本削减目标

步骤

完整配置文件和模板请参阅扩展示例。

第 1 步：部署成本可见性工具

安装 Kubecost 或 OpenCost 进行成本监控和分配。

安装 Kubecost：

# Add Kubecost Helm repository
helm repo add kubecost https://kubecost.github.io/cost-analyzer/
helm repo update

# Install Kubecost with Prometheus integration
helm install kubecost kubecost/cost-analyzer \
  --namespace kubecost \
  --create-namespace \
  --set kubecostToken="your-token-here" \
  --set prometheus.server.global.external_labels.cluster_id="production-cluster" \
  --set prometheus.nodeExporter.enabled=true \
  --set prometheus.serviceAccounts.nodeExporter.create=true

# For existing Prometheus, configure Kubecost to use it
helm install kubecost kubecost/cost-analyzer \
  --namespace kubecost \
  --create-namespace \
  --set prometheus.enabled=false \
  --set global.prometheus.fqdn="http://prometheus-server.monitoring.svc.cluster.local" \
  --set global.prometheus.enabled=true

# Verify installation
kubectl get pods -n kubecost
kubectl get svc -n kubecost

# Access Kubecost UI
kubectl port-forward -n kubecost svc/kubecost-cost-analyzer 9090:9090
# Open http://localhost:9090

配置云提供商集成：

# kubecost-cloud-integration.yaml
apiVersion: v1
kind: Secret
metadata:
  name: cloud-integration
  namespace: kubecost
type: Opaque
stringData:
  # For AWS
  cloud-integration.json: |
    {
      "aws": [
        {
          "serviceKeyName": "AWS_ACCESS_KEY_ID",
          "serviceKeySecret": "AWS_SECRET_ACCESS_KEY",
          "athenaProjectID": "cur-query-results",
          "athenaBucketName": "s3://your-cur-bucket",
          "athenaRegion": "us-east-1",
          "athenaDatabase": "athenacurcfn_my_cur",
          "athenaTable": "my_cur"
        }
      ]
    }
---
# For GCP
apiVersion: v1
kind: Secret
metadata:
  name: gcp-key
  namespace: kubecost
type: Opaque
data:
  key.json: <base64-encoded-service-account-key>
---
# For Azure
apiVersion: v1
kind: ConfigMap
metadata:
  name: azure-config
  namespace: kubecost
data:
  azure.json: |
    {
      "azureSubscriptionID": "your-subscription-id",
      "azureClientID": "your-client-id",
      "azureClientSecret": "your-client-secret",
      "azureTenantID": "your-tenant-id",
      "azureOfferDurableID": "MS-AZR-0003P"
    }

应用云集成：

kubectl apply -f kubecost-cloud-integration.yaml

# Verify cloud costs are being imported
kubectl logs -n kubecost -l app=cost-analyzer -c cost-model --tail=100 | grep -i "cloud"

# Check Kubecost API for cost data
kubectl port-forward -n kubecost svc/kubecost-cost-analyzer 9090:9090 &
curl http://localhost:9090/model/allocation\?window\=7d | jq .

预期结果： Kubecost pod 成功运行。UI 可访问并显示按命名空间、Deployment、Pod 的成本分解。云提供商成本开始导入（初次同步可能需要 24-48 小时）。API 返回分配数据。

失败处理：

检查 Prometheus 是否运行且可访问：kubectl get svc -n monitoring prometheus-server
验证云凭证是否具有账单 API 访问权限
检查 cost-model 日志：kubectl logs -n kubecost -l app=cost-analyzer -c cost-model
确保 Metrics Server 或 Prometheus node-exporter 正在收集资源指标
检查阻止访问云账单 API 的网络策略

第 2 步：分析当前资源利用率

识别过度配置的资源和优化机会。

查询资源利用率：

# Get resource requests vs usage for all pods
kubectl top pods --all-namespaces --containers | \
  awk 'NR>1 {print $1,$2,$3,$4,$5}' > current-usage.txt

# Compare requests to actual usage
cat <<'EOF' > analyze-utilization.sh
#!/bin/bash
echo "Pod,Namespace,CPU-Request,CPU-Usage,Memory-Request,Memory-Usage"
for ns in $(kubectl get ns -o jsonpath='{.items[*].metadata.name}'); do
  kubectl get pods -n $ns -o json | jq -r '
    .items[] |
    select(.status.phase == "Running") |
    {
      name: .metadata.name,
      namespace: .metadata.namespace,
      containers: [
        .spec.containers[] |
        {
          name: .name,
          cpuReq: .resources.requests.cpu,
          memReq: .resources.requests.memory
        }
      ]
    } |
    "\(.name),\(.namespace),\(.containers[].cpuReq // "none"),\(.containers[].memReq // "none")"
  ' 2>/dev/null
done
EOF

chmod +x analyze-utilization.sh
./analyze-utilization.sh > resource-requests.csv

# Get actual usage from metrics server
kubectl top pods --all-namespaces --containers > actual-usage.txt

使用 Kubecost 建议：

# Get right-sizing recommendations via API
curl "http://localhost:9090/model/savings/requestSizing?window=7d" | jq . > recommendations.json

# Extract top wasteful resources
jq '.data[] | select(.totalRecommendedSavings > 10) | {
  cluster: .clusterID,
# ... (see EXAMPLES.md for complete configuration)

创建利用率仪表板：

# grafana-utilization-dashboard.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: utilization-dashboard
  namespace: monitoring
# ... (see EXAMPLES.md for complete configuration)

预期结果： 清晰了解当前资源请求与实际使用情况。识别出利用率低于 30% 的 pod（过度配置）。列出优化机会及预估节省金额。仪表板显示随时间的利用率趋势。

失败处理：

确保 Metrics Server 正在运行：kubectl get deployment metrics-server -n kube-system
检查 Prometheus 是否有 node-exporter 指标：curl http://prometheus:9090/api/v1/query?query=node_cpu_seconds_total
验证 pod 已运行足够长时间以获取有意义的数据（至少 24 小时）
检查指标收集中的间隙：检查 Prometheus 保留期和抓取间隔
对于 Kubecost，确保已收集至少 48 小时的数据

第 3 步：实现水平 Pod 自动扩缩容（HPA）

基于 CPU、内存或自定义指标配置自动扩缩容。

创建基于 CPU 的 HPA：

# hpa-cpu.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: api-server-hpa
  namespace: production
# ... (see EXAMPLES.md for complete configuration)

部署并验证 HPA：

kubectl apply -f hpa-cpu.yaml

# Check HPA status
kubectl get hpa -n production
kubectl describe hpa api-server-hpa -n production

# Monitor scaling events
kubectl get events -n production --field-selector involvedObject.kind=HorizontalPodAutoscaler --watch

# Generate load to test autoscaling
kubectl run load-generator --rm -it --image=busybox -- /bin/sh -c \
  "while true; do wget -q -O- http://api-server.production.svc.cluster.local; done"

# Watch replicas scale
watch kubectl get hpa,deployment -n production

预期结果： HPA 已创建并显示当前/目标指标。负载下 pod 数量增加。负载减少时 pod 数量减少（在稳定窗口之后）。扩缩容事件已记录。无抖动（快速扩缩容循环）。

失败处理：

验证 Metrics Server 是否运行：kubectl get apiservice v1beta1.metrics.k8s.io
检查 Deployment 是否已设置资源请求（HPA 需要此项）
检查 HPA 事件：kubectl describe hpa api-server-hpa -n production
确保目标 Deployment 未达到最大副本数
对于自定义指标，验证指标适配器已安装和配置
检查 HPA 控制器日志：kubectl logs -n kube-system -l app=kube-controller-manager | grep horizontal-pod-autoscaler

第 4 步：配置垂直 Pod 自动扩缩容（VPA）

根据实际使用模式自动调整资源请求。

安装 VPA：

# Clone VPA repository
git clone https://github.com/kubernetes/autoscaler.git
cd autoscaler/vertical-pod-autoscaler

# Install VPA
./hack/vpa-up.sh

# Verify installation
kubectl get pods -n kube-system | grep vpa

# Check VPA CRDs
kubectl get crd | grep verticalpodautoscaler

创建 VPA 策略：

# vpa-policies.yaml
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: api-server-vpa
  namespace: production
# ... (see EXAMPLES.md for complete configuration)

部署并监控 VPA：

kubectl apply -f vpa-policies.yaml

# Check VPA recommendations
kubectl get vpa -n production
kubectl describe vpa api-server-vpa -n production

# View detailed recommendations
kubectl get vpa api-server-vpa -n production -o jsonpath='{.status.recommendation}' | jq .

# Monitor VPA-initiated pod updates
kubectl get events -n production --field-selector involvedObject.kind=VerticalPodAutoscaler --watch

# Compare recommendations to current requests
kubectl get deployment api-server -n production -o json | \
  jq '.spec.template.spec.containers[].resources.requests'

预期结果： VPA 提供建议或自动更新资源请求。建议基于百分位使用模式（通常为 P95）。使用 Auto/Recreate 模式时，pod 以新请求值重启。HPA 和 VPA 之间无冲突（使用 HPA 管理副本数，VPA 管理每个 pod 的资源）。

失败处理：

确保 Metrics Server 有足够数据（VPA 需要几天时间获取准确建议）
检查 VPA 组件是否运行：kubectl get pods -n kube-system | grep vpa
检查 VPA 准入控制器日志：kubectl logs -n kube-system -l app=vpa-admission-controller
验证 Webhook 是否已注册：kubectl get mutatingwebhookconfigurations vpa-webhook-config
不要对同一指标（CPU/内存）同时使用 VPA 和 HPA — 会产生冲突
先使用 "Off" 模式检查建议，再启用自动更新

第 5 步：利用 Spot/可抢占实例

配置工作负载调度到成本效益高的 Spot 实例。

创建 Spot 实例节点池：

# For AWS (via Karpenter)
apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
  name: spot-provisioner
spec:
# ... (see EXAMPLES.md for complete configuration)

配置工作负载使用 Spot 实例：

# spot-workload.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: batch-processor
  namespace: production
# ... (see EXAMPLES.md for complete configuration)

部署并监控 Spot 使用情况：

kubectl apply -f spot-workload.yaml

# Monitor spot node allocation
kubectl get nodes -l node-type=spot

# Check workload distribution
# ... (see EXAMPLES.md for complete configuration)

预期结果： 工作负载成功调度到 Spot 节点。显著降低成本（通常比按需实例节省 60-90%）。优雅处理 Spot 中断并重新调度 pod。监控显示 Spot 中断率和成功恢复情况。

失败处理：

验证你的地区/可用区 Spot 实例是否可用
检查节点标签和污点是否与工作负载容忍度匹配
检查 Karpenter 日志：kubectl logs -n karpenter -l app.kubernetes.io/name=karpenter
确保工作负载是无状态的或有适当的状态管理以处理中断
测试中断处理：手动隔离并驱逐 Spot 节点
监控中断率 — 如果过高，考虑回退到按需节点

第 6 步：实现资源配额和预算告警

设置硬性限制和成本控制告警。

创建资源配额：

# resource-quotas.yaml
apiVersion: v1
kind: ResourceQuota
metadata:
  name: production-quota
  namespace: production
# ... (see EXAMPLES.md for complete configuration)

配置预算告警：

# kubecost-budget-alerts.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: budget-alerts
  namespace: kubecost
# ... (see EXAMPLES.md for complete configuration)

应用并监控：

kubectl apply -f resource-quotas.yaml
kubectl apply -f kubecost-budget-alerts.yaml

# Check quota usage
kubectl get resourcequota -n production
kubectl describe resourcequota production-quota -n production
# ... (see EXAMPLES.md for complete configuration)

预期结果： 资源配额按命名空间强制执行限制。配额超出时阻止 pod 创建。预算阈值触发时发送告警。成本突增检测正常工作。定期向相关方发送报告。

失败处理：

验证 ResourceQuota 和 LimitRange 已正确应用：kubectl get resourcequota,limitrange -A
检查因配额失败的 pod：kubectl get events -n production | grep quota
检查 Kubecost 告警配置：kubectl logs -n kubecost -l app=cost-analyzer | grep alert
确保 Prometheus 有 Kubecost 指标：curl http://prometheus:9090/api/v1/query?query=kubecost_monthly_cost
测试告警路由：验证邮件/Slack Webhook 配置

验证清单

常见问题

激进的资源调整：不要立即应用 VPA 建议。先使用 "Off" 模式，观察一周的建议，然后逐步应用。突然变更可能导致 OOMKill 或 CPU 限流。
HPA + VPA 冲突：永远不要在同一指标（CPU/内存）上同时使用 HPA 和 VPA。使用 HPA 进行水平扩缩，VPA 进行每个 pod 的资源调整，或 HPA 使用自定义指标 + VPA 管理资源。
Spot 无容错能力：只在 Spot 上运行容错、无状态的工作负载。永远不要运行数据库、有状态服务或单副本关键服务。始终使用 PodDisruptionBudget。
监控周期不足：成本优化决策需要历史数据。至少等待 7 天再做变更，VPA 建议需要 30 天，趋势分析需要 90 天。
忽略突发需求：基于平均使用率设置过低的限制会在流量峰值期间导致限流。使用 P95 或 P99 百分位而非平均值进行容量规划。
网络出口成本：Kubecost 中可见计算成本，但出口（数据传输）可能很显著。监控跨可用区流量，使用拓扑感知路由，在架构中考虑数据传输成本。
忽视存储成本：PersistentVolume 成本经常被遗忘。审计未使用的 PVC，合理调整卷大小，使用卷扩展而非过度配置，实施 PV 清理策略。
配额过于严格：设置过低的配额会阻碍合理增长。每月审查配额使用情况，根据实际需求调整，在执行前向团队传达限制。
错误指标导致的虚假节省：仅使用 CPU/内存作为优化指标会忽略 I/O、网络和存储成本。考虑总拥有成本，而非仅计算成本。
信任建立前实施计费：在团队理解和信任成本数据之前实施分摊计费会产生摩擦。从分摊展示（信息性）开始，建立成本意识文化，然后再推进分摊计费。

GitHub Repository

pjt222/agent-almanac

Path: i18n/zh-CN/skills/optimize-cloud-costs

agentsagentskillsai-assisted-developmentclaude-codeskillsteams

FAQ

Frequently asked questions

What is the optimize-cloud-costs skill?

optimize-cloud-costs is a Claude Skill by pjt222. Skills package instructions and resources that Claude loads on demand, so Claude can perform optimize-cloud-costs-related tasks without extra prompting.

How do I install optimize-cloud-costs?

Use the install commands on this page: add optimize-cloud-costs to Claude Code as a plugin, or clone its repository into your skills directory, then restart Claude so it picks up the skill.

What category does optimize-cloud-costs belong to?

optimize-cloud-costs is in the Other category, tagged general.

Is optimize-cloud-costs free to use?

Yes. optimize-cloud-costs is listed on AIMCP and free to install. It runs inside Claude, so no separate service account is required to use the skill itself.

Related Skills

llamaguard

Other

LlamaGuard is Meta's 7-8B parameter model for moderating LLM inputs and outputs across six safety categories like violence and hate speech. It offers 94-95% accuracy and can be deployed using vLLM, Hugging Face, or Amazon SageMaker. Use this skill to easily integrate content filtering and safety guardrails into your AI applications.

View skill

cost-optimization

Other

This Claude Skill helps developers optimize cloud costs through resource rightsizing, tagging strategies, and spending analysis. It provides a framework for reducing cloud expenses and implementing cost governance across AWS, Azure, and GCP. Use it when you need to analyze infrastructure costs, right-size resources, or meet budget constraints.

View skill

sports-betting-analyzer

Other

This Claude Skill analyzes sports betting markets including spreads, over/unders, and prop bets by examining historical trends and situational statistics to identify value bets. It provides structured markdown output with actionable recommendations for educational purposes. Developers should use this for sports betting analysis tools while noting it's designed for entertainment/education only.

View skill

quantizing-models-bitsandbytes

Other

This skill quantizes LLMs to 8-bit or 4-bit precision using bitsandbytes, achieving 50-75% memory reduction with minimal accuracy loss. It's ideal for running larger models on limited GPU memory or accelerating inference, supporting formats like INT8, NF4, and FP4. The skill integrates with HuggingFace Transformers and enables QLoRA training and 8-bit optimizers.

View skill

optimize-cloud-costs

About

Quick Install

Claude Code

Documentation

优化云成本

适用场景

输入

步骤

第 1 步：部署成本可见性工具

第 2 步：分析当前资源利用率

第 3 步：实现水平 Pod 自动扩缩容（HPA）

第 4 步：配置垂直 Pod 自动扩缩容（VPA）

第 5 步：利用 Spot/可抢占实例

第 6 步：实现资源配额和预算告警

验证清单

常见问题

相关技能

GitHub Repository

Frequently asked questions

What is the optimize-cloud-costs skill?

How do I install optimize-cloud-costs?

What category does optimize-cloud-costs belong to?

Is optimize-cloud-costs free to use?

Related Skills