MCP HubMCP Hub
스킬 목록으로 돌아가기

deploy-ml-model-serving

pjt222
업데이트됨 2 days ago
8 조회
17
2
17
GitHub에서 보기
테스팅aitestingapi

정보

이 스킬은 MLflow, BentoML 또는 Seldon Core를 사용하여 ML 모델을 프로덕션에 배포하며, REST/gRPC 엔드포인트를 제공합니다. 대규모 고성능 추론을 위해 오토스케일링, 모니터링 및 A/B 테스트를 구현합니다. 실시간 예측 API 설정, 가변 부하 관리 또는 배치 추론에서 온라인 추론으로 전환할 때 사용하세요.

빠른 설치

Claude Code

추천
기본
npx skills add pjt222/agent-almanac -a claude-code
플러그인 명령대체
/plugin add https://github.com/pjt222/agent-almanac
Git 클론대체
git clone https://github.com/pjt222/agent-almanac.git ~/.claude/skills/deploy-ml-model-serving

Claude Code에서 이 명령을 복사하여 붙여넣어 스킬을 설치하세요

문서

Deploy ML Model Serving

See Extended Examples for complete configuration files and templates.

ML → prod. Scalable serving, monitoring, A/B.

Use When

  • Trained models → prod real-time inference
  • REST/gRPC APIs → predictions
  • Autoscale → variable load
  • A/B tests → model vers
  • Batch → real-time migrate
  • Low-latency prediction svcs
  • Multi-ver mgmt prod

In

  • Required: Registered model (MLflow Model Registry) or trained artifact
  • Required: K8s or container orchestration
  • Required: Serving framework (MLflow, BentoML, Seldon Core, TorchServe)
  • Optional: GPU → deep learning
  • Optional: Monitoring (Prometheus, Grafana)
  • Optional: LB + ingress

Do

Step 1: MLflow Models Serving

Built-in → quick sklearn/PyTorch/TF.

# Serve model locally for testing
mlflow models serve \
  --model-uri models:/customer-churn-classifier/Production \
  --port 5001 \
  --host 0.0.0.0

# Test endpoint
curl -X POST http://localhost:5001/invocations \
  -H 'Content-Type: application/json' \
  -d '{
    "dataframe_records": [
      {"feature1": 1.0, "feature2": 2.0, "feature3": 3.0}
    ]
  }'

Docker deploy:

# Dockerfile.mlflow-serving
FROM python:3.9-slim

# Install MLflow and dependencies
RUN pip install mlflow boto3 scikit-learn

# Set environment variables
ENV MLFLOW_TRACKING_URI=http://mlflow-server:5000
# ... (see EXAMPLES.md for complete implementation)

Docker Compose:

# docker-compose.mlflow-serving.yml
version: '3.8'

services:
  model-server:
    build:
      context: .
      dockerfile: Dockerfile.mlflow-serving
# ... (see EXAMPLES.md for complete implementation)

Test:

# test_mlflow_serving.py
import requests
import json

def test_prediction():
    url = "http://localhost:8080/invocations"

    # Prepare input data
# ... (see EXAMPLES.md for complete implementation)

→ Server starts, HTTP POST OK, JSON predictions, Docker runs clean.

If err: Model URI valid (mlflow models list), tracking server reachable, deps in container, port free (netstat -tulpn | grep 8080), flavor compat, docker logs <container-id>.

Step 2: BentoML → prod scale

Advanced serving, better perf.

# bentoml_service.py
import bentoml
from bentoml.io import JSON, NumpyNdarray
import numpy as np
import pandas as pd

# Load model from MLflow
import mlflow
# ... (see EXAMPLES.md for complete implementation)

Build + containerize:

# Build Bento
bentoml build

# Containerize
bentoml containerize customer_churn_classifier:latest \
  --image-tag customer-churn:v1.0

# Run container
docker run -p 3000:3000 customer-churn:v1.0

BentoML config:

# bentofile.yaml
service: "bentoml_service:ChurnPredictionService"
include:
  - "bentoml_service.py"
  - "preprocessing.py"
python:
  packages:
    - scikit-learn==1.0.2
    - pandas==1.4.0
    - numpy==1.22.0
    - mlflow==2.0.1
docker:
  distro: debian
  python_version: "3.9"
  cuda_version: null  # Set to "11.6" for GPU support

K8s deploy:

# k8s/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: churn-prediction
  labels:
    app: churn-prediction
spec:
# ... (see EXAMPLES.md for complete implementation)

Deploy → K8s:

# Apply Kubernetes manifests
kubectl apply -f k8s/deployment.yaml

# Check deployment status
kubectl get deployments
kubectl get pods
kubectl get services

# Test endpoint
EXTERNAL_IP=$(kubectl get svc churn-prediction-service -o jsonpath='{.status.loadBalancer.ingress[0].ip}')
curl -X POST http://$EXTERNAL_IP/predict \
  -H 'Content-Type: application/json' \
  -d '{"instances": [{"tenure": 12, "monthly_charges": 70.35}]}'

→ Bento builds, container serves, K8s 3 replicas, LB external EP, health OK.

If err: bentoml --version, model in store (bentoml models list), Docker running, K8s access (kubectl cluster-info), resource limits, pod logs (kubectl logs <pod-name>), svc selector matches labels.

Step 3: Seldon Core → advanced

Multi-model serving, A/B, explainability.

# seldon_wrapper.py
import logging
from typing import Dict, List, Union
import numpy as np
import mlflow

logger = logging.getLogger(__name__)

# ... (see EXAMPLES.md for complete implementation)

Seldon deploy config:

# seldon-deployment.yaml
apiVersion: machinelearning.seldon.io/v1
kind: SeldonDeployment
metadata:
  name: churn-classifier
  namespace: seldon
spec:
  name: churn-classifier
# ... (see EXAMPLES.md for complete implementation)

A/B test:

# seldon-ab-test.yaml
apiVersion: machinelearning.seldon.io/v1
kind: SeldonDeployment
metadata:
  name: churn-classifier-ab
spec:
  name: churn-classifier-ab
  predictors:
# ... (see EXAMPLES.md for complete implementation)

Deploy:

# Install Seldon Core operator
kubectl create namespace seldon-system
helm install seldon-core seldon-core-operator \
  --repo https://storage.googleapis.com/seldon-charts \
  --namespace seldon-system \
  --set usageMetrics.enabled=true

# Create namespace for models
# ... (see EXAMPLES.md for complete implementation)

→ Seldon operator OK, pods created, REST EP responds, A/B splits traffic, analytics records.

If err: Operator (kubectl get pods -n seldon-system), SeldonDeployment status (kubectl describe seldondeployment), image registry access, model URI resolution, RBAC, model container logs.

Step 4: Monitoring + observability

Comprehensive metrics.

# monitoring.py
from prometheus_client import Counter, Histogram, Gauge, start_http_server
import time
import logging

logger = logging.getLogger(__name__)

# Prometheus metrics
# ... (see EXAMPLES.md for complete implementation)

Prometheus config:

# prometheus-config.yaml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'model-serving'
    kubernetes_sd_configs:
# ... (see EXAMPLES.md for complete implementation)

Grafana JSON:

{
  "dashboard": {
    "title": "ML Model Serving Metrics",
    "panels": [
      {
        "title": "Predictions Per Second",
        "targets": [
          {
# ... (see EXAMPLES.md for complete implementation)

→ Prometheus scrapes OK, Grafana shows throughput + latency + err rates + active reqs real-time.

If err: Scrape targets UP (http://prometheus:9090/targets), metrics EP (curl http://model-pod:8000/metrics), K8s svc discovery, datasource, firewall port.

Step 5: Autoscaling

HPA by req load.

# hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: churn-prediction-hpa
  namespace: seldon
spec:
  scaleTargetRef:
# ... (see EXAMPLES.md for complete implementation)

Apply:

# Enable metrics server (if not already installed)
kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml

# Apply HPA
kubectl apply -f hpa.yaml

# Check HPA status
kubectl get hpa -n seldon
kubectl describe hpa churn-prediction-hpa -n seldon

# Load test to trigger scaling
kubectl run -it --rm load-generator --image=busybox --restart=Never -- /bin/sh -c "while sleep 0.01; do wget -q -O- http://churn-prediction-service/predict; done"

# Watch scaling
kubectl get hpa -n seldon --watch

→ HPA monitors CPU/mem/custom, scales up on load, down after stabilize, min/max respected.

If err: metrics-server (kubectl get deployment metrics-server -n kube-system), pod resource reqs defined, custom metrics available, RBAC, stabilize windows.

Step 6: Canary deploy

Traffic shift.

# canary-deployment.yaml
apiVersion: machinelearning.seldon.io/v1
kind: SeldonDeployment
metadata:
  name: churn-classifier-canary
spec:
  name: churn-classifier-canary
  predictors:
# ... (see EXAMPLES.md for complete implementation)

Gradual rollout:

# canary_rollout.py
import time
import subprocess
import logging

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

# ... (see EXAMPLES.md for complete implementation)

→ Canary 0%, gradual shift, health OK each stage, rollback if degrade, full rollout after all pass.

If err: Multi predictors, traffic sums 100%, canary image pullable, Prometheus metrics for health, rollback logic, both ver logs.

Check

  • Server responds → prediction req
  • REST/gRPC EPs OK + docs
  • Docker containers build + run
  • K8s creates expected replicas
  • LB → external EP
  • Liveness/readiness pass
  • Prometheus scraped
  • Grafana real-time
  • Autoscale on load
  • A/B splits correctly
  • Canary gradual rollout
  • Rollback works

Traps

  • Cold start: First req slow → readiness probe delay, cache model
  • Mem leaks: Accumulate → monitor, periodic restart, profile
  • Dep conflicts: → exact pinned vers, test Docker pre-deploy
  • Resource limits low: OOM/throttle → profile, set by load test
  • No health checks: K8s routes to unhealthy → liveness/readiness probes
  • No rollback: Bad deploy → canary, keep prev ver
  • Ignore latency: Only accuracy → bench, optimize, batch
  • Single replica: No HA → min 2, anti-affinity
  • No monitoring: Until complaints → metrics day 1
  • GPU unused: → CUDA visible devices, K8s alloc

  • register-ml-model — register before deploy
  • run-ab-test-models — A/B ver testing
  • deploy-to-kubernetes — K8s patterns
  • monitor-ml-model-performance — drift + degrade
  • orchestrate-ml-pipeline — auto retrain + deploy

GitHub 저장소

pjt222/agent-almanac
경로: i18n/caveman-ultra/skills/deploy-ml-model-serving
0
agentsagentskillsai-assisted-developmentclaude-codeskillsteams

연관 스킬

evaluating-llms-harness

테스팅

이 Claude Skill은 MMLU, GSM8K를 포함한 60개 이상의 표준화된 학술 과제에서 LLM 성능을 벤치마크하기 위해 lm-evaluation-harness를 실행합니다. 개발자들이 모델 품질을 비교하고, 학습 진행 상황을 추적하거나 학술 결과를 보고할 수 있도록 설계되었습니다. 이 도구는 HuggingFace와 vLLM 모델을 포함한 다양한 백엔드를 지원합니다.

스킬 보기

cloudflare-cron-triggers

테스팅

이 스킬은 cron 표현식을 사용하여 Worker를 스케줄링하기 위한 Cloudflare Cron Triggers 구현에 관한 포괄적인 지식을 제공합니다. 주기적 작업, 유지보수 작업, 자동화된 워크플로우 설정 방법을 다루며, 잘못된 cron 표현식이나 시간대 문제 같은 일반적인 이슈들을 해결하는 방법을 포함합니다. 개발자들은 이를 통해 스케줄된 핸들러 구성, cron 트리거 테스트, Workflows 및 Green Compute와의 연동 작업을 수행할 수 있습니다.

스킬 보기

webapp-testing

테스팅

이 Claude Skill은 Python 스크립트를 통해 로컬 웹 애플리케이션을 테스트하기 위한 Playwright 기반 툴킷을 제공합니다. 프론트엔드 검증, UI 디버깅, 스크린샷 캡처, 로그 확인 기능을 지원하며 서버 라이프사이클을 관리합니다. 브라우저 자동화 작업에 사용하되 컨텍스트 오염을 방지하기 위해 소스 코드를 읽지 않고 스크립트를 직접 실행하세요.

스킬 보기

finishing-a-development-branch

테스팅

이 스킬은 테스트 통과를 확인한 후 체계적인 통합 옵션을 제시하여 개발자가 완성된 작업을 마무리하도록 돕습니다. 구현이 완료된 후 머지, PR 생성, 브랜치 정리와 같은 워크플로우를 안내합니다. 코드가 준비되고 테스트가 완료되었을 때 개발 프로세스를 체계적으로 마무리하기 위해 사용하세요.

스킬 보기