MCP HubMCP Hub
스킬 목록으로 돌아가기

version-ml-data

pjt222
업데이트됨 Yesterday
1 조회
17
2
17
GitHub에서 보기
메타automationdesigndata

정보

이 스킬은 DVC를 사용하여 머신러닝 데이터셋을 버전 관리하며, 의존성 추적과 Git 연동을 통해 재현 가능한 데이터 파이프라인을 구축합니다. Git에 저장하기 어려운 대용량 데이터셋을 처리하도록 설계되었으며, 모델 재현성을 위한 데이터 계보를 보장합니다. 코드와 함께 데이터 변경 사항을 추적하거나, 팀 내에서 데이터셋을 공유하거나, 규정 준수를 위해 데이터를 감사해야 할 때 사용하세요.

빠른 설치

Claude Code

추천
기본
npx skills add pjt222/agent-almanac -a claude-code
플러그인 명령대체
/plugin add https://github.com/pjt222/agent-almanac
Git 클론대체
git clone https://github.com/pjt222/agent-almanac.git ~/.claude/skills/version-ml-data

Claude Code에서 이 명령을 복사하여 붙여넣어 스킬을 설치하세요

문서

Version ML Data

See Extended Examples for complete configuration files and templates.

Implement data version control for machine learning datasets to ensure reproducibility and track data lineage.

When to Use

  • Versioning large datasets that don't fit in Git
  • Tracking data changes alongside code changes
  • Ensuring reproducibility of ML experiments
  • Building automated data pipelines with dependency tracking
  • Sharing datasets across team members
  • Rolling back to previous data versions
  • Auditing data lineage for compliance
  • Managing multiple dataset variants (train/test splits, feature sets)

Inputs

  • Required: Git repository for metadata tracking
  • Required: DVC installation (pip install dvc)
  • Required: Raw data files or directories to version
  • Optional: Remote storage backend (S3, Azure Blob, GCS, SSH, local)
  • Optional: Data processing scripts for pipeline automation
  • Optional: CI/CD integration for automated pipeline execution

Procedure

Step 1: Initialize DVC in Git Repository

Set up DVC for data versioning alongside code versioning.

# Navigate to project root
cd /path/to/ml-project

# Initialize Git (if not already done)
git init
git add .
git commit -m "Initial commit"

# ... (see EXAMPLES.md for complete implementation)

Configure DVC settings:

# Set analytics opt-out (optional)
dvc config core.analytics false

# Configure autostage (automatically git add .dvc files)
dvc config core.autostage true

# Set default remote name
dvc config core.remote storage

# Commit configuration
git add .dvc/config
git commit -m "Configure DVC settings"

Expected: .dvc/ directory created with config files, .dvcignore file present, DVC files tracked by Git, large data files not in Git staging area.

On failure: Verify Git repository initialized (git status), check DVC installation (dvc version), ensure write permissions in project directory, check for conflicting .dvc/ directory from previous setup, verify Python environment active.

Step 2: Configure Remote Storage Backend

Set up remote storage for data sharing and backup.

# AWS S3
dvc remote add -d storage s3://my-dvc-bucket/ml-project
dvc remote modify storage region us-west-2

# Configure credentials (use IAM roles in production)
dvc remote modify storage access_key_id YOUR_ACCESS_KEY
dvc remote modify storage secret_access_key YOUR_SECRET_KEY

# ... (see EXAMPLES.md for complete implementation)

Test remote connection:

# List remote storage contents
dvc remote list storage

# Test write access
echo "test" > test.txt
dvc add test.txt
dvc push
rm test.txt test.txt.dvc .dvc/cache -rf

# Test read access
dvc pull

# Clean up test
rm test.txt test.txt.dvc
git checkout .

Expected: Remote storage configured and accessible, credentials stored securely in .dvc/config.local (git-ignored), test push/pull succeeds, remote storage shows uploaded cache files.

On failure: Verify cloud credentials (aws s3 ls or equivalent CLI), check bucket/container exists and is accessible, ensure IAM permissions for read/write, verify network connectivity to remote, check firewall rules, test SSH key authentication for SSH remotes, verify storage path has write permissions.

Step 3: Version Datasets with DVC

Add datasets to DVC tracking and push to remote storage.

# Add single file
dvc add data/raw/customers.csv

# Add directory (all files inside)
dvc add data/raw/

# DVC creates .dvc files (metadata)
ls data/raw/
# ... (see EXAMPLES.md for complete implementation)

Version management:

# version_dataset.py
import pandas as pd
import subprocess
from datetime import datetime

def version_dataset(data_path, git_message=None):
    """
    Version dataset with DVC and Git.
# ... (see EXAMPLES.md for complete implementation)

Expected: .dvc metadata files created and committed to Git, original data files git-ignored automatically, dvc push uploads data to remote storage, .dvc/cache contains data hash, remote storage has cached data files.

On failure: Check DVC remote configured (dvc remote list), verify write permissions in data directory, ensure sufficient disk space for cache, check network connectivity for push, verify no special characters in file paths, check for large file warnings from Git.

Step 4: Build Reproducible Data Pipelines

Create DVC pipelines for automated, dependency-tracked data processing.

# dvc.yaml - Pipeline definition
stages:
  download_data:
    cmd: python scripts/download_data.py
    deps:
      - scripts/download_data.py
    outs:
      - data/raw/customers.csv
# ... (see EXAMPLES.md for complete implementation)

Parameters file:

# params.yaml
preprocess:
  feature_engineering: true
  outlier_threshold: 3.0

split:
  test_size: 0.2
  random_state: 42

model:
  algorithm: random_forest
  hyperparameters:
    n_estimators: 100
    max_depth: 10
    min_samples_split: 5

Run pipeline:

# Run entire pipeline
dvc repro

# DVC automatically:
# - Detects which stages need rerun (based on deps/params changes)
# - Executes stages in correct order
# - Caches outputs
# - Tracks metrics
# ... (see EXAMPLES.md for complete implementation)

Expected: DVC pipeline executes in correct dependency order, only changed stages rerun, outputs cached efficiently, metrics tracked automatically, Git commits include dvc.yaml and dvc.lock.

On failure: Check script paths exist and are executable, verify dependencies specified correctly, ensure params.yaml keys match script usage, check for circular dependencies in pipeline, verify output paths writable, inspect script error messages in stderr, check Python environment has required packages.

Step 5: Share and Reproduce Data Versions

Enable team members to reproduce exact data versions.

# Team member clones repository
git clone https://github.com/team/ml-project.git
cd ml-project

# Install DVC
pip install dvc[s3]  # or appropriate backend

# Configure remote (if not in .dvc/config)
# ... (see EXAMPLES.md for complete implementation)

Switch between data versions:

# View data version history
git log --oneline -- data/raw/customers.csv.dvc

# Checkout previous data version
git checkout abc123 -- data/raw/customers.csv.dvc

# Pull that version's data
dvc checkout
# ... (see EXAMPLES.md for complete implementation)

Branching workflow:

# Create experiment branch
git checkout -b experiment/new-features

# Modify data pipeline
vim scripts/preprocess.py

# Add new features
dvc repro preprocess
# ... (see EXAMPLES.md for complete implementation)

Expected: git clone + dvc pull reproduces exact environment, data versions match across team, experiments isolated in branches, metrics comparable across versions.

On failure: Verify remote access configured correctly, check credentials for new team members, ensure all .dvc files committed to Git, verify dvc.lock tracked by Git (pins exact versions), check network bandwidth for large pulls, verify storage backend has all referenced cache files.

Step 6: Integrate with MLflow and CI/CD

Connect DVC data versioning with experiment tracking and automation.

# train_with_mlflow.py
import mlflow
import dvc.api
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, f1_score

# Get DVC-tracked data path and version
# ... (see EXAMPLES.md for complete implementation)

GitHub Actions CI/CD:

# .github/workflows/ml-pipeline.yml
name: ML Pipeline
locale: caveman
source_locale: en
source_commit: 82c77053
translator: "Julius Brussee homage — caveman"
translation_date: "2026-04-19"

on:
  push:
    branches: [main]
  pull_request:
    branches: [main]
# ... (see EXAMPLES.md for complete implementation)

Expected: MLflow logs DVC data versions with runs, CI/CD automatically pulls data and runs pipeline, metrics validated before deployment, reproducibility enforced by CI.

On failure: Check secrets configured in GitHub repository settings, verify DVC remote accessible from CI runners, ensure Git credentials configured for push, check Python dependencies installed, verify metrics validation logic, inspect CI logs for DVC/MLflow errors.

Validation

  • DVC initialized in Git repository
  • Remote storage configured and accessible
  • Datasets versioned and pushed to remote
  • .dvc files committed to Git
  • Large data files git-ignored automatically
  • DVC pipeline executes successfully
  • Team members can reproduce data with dvc pull
  • Data versions switchable via Git checkout
  • Metrics tracked across pipeline runs
  • Integration with MLflow working
  • CI/CD pipeline reproduces results

Common Pitfalls

  • Committing large files to Git: Forgot to run dvc add first - always use DVC for large files (>10MB), check .gitignore
  • Missing remote configuration: dvc push fails because no remote - configure remote before sharing, test with dvc remote list
  • Lost data versions: Deleted .dvc/cache without pushing - always dvc push before cleaning cache
  • Inconsistent environments: Different Python/package versions - use virtual environments, pin dependencies in requirements.txt
  • Broken pipelines: Changed script without updating dvc.yaml - keep pipeline definitions in sync with code
  • Slow pipeline: Rerunning unchanged stages - DVC caches by default, check dvc status to diagnose
  • Merge conflicts: .dvc files conflict during merges - resolve like code conflicts, use dvc checkout after resolution
  • Large pull times: Pulling all data for small experiments - use dvc pull <specific.dvc> for selective pulls
  • Credential leaks: Committing .dvc/config.local - keep credentials in config.local (git-ignored), not config
  • No data lineage: Not tracking preprocessing steps - use DVC pipelines to track all transformations

Related Skills

  • track-ml-experiments - Integrate DVC versions with MLflow experiment tracking
  • orchestrate-ml-pipeline - Combine DVC pipelines with Airflow/Prefect orchestration
  • build-feature-store - Version raw data sources for feature engineering
  • serialize-data-formats - Choose efficient formats for DVC-tracked datasets
  • design-serialization-schema - Design schemas for versioned data files

GitHub 저장소

pjt222/agent-almanac
경로: i18n/caveman/skills/version-ml-data
0
agentsagentskillsai-assisted-developmentclaude-codeskillsteams

연관 스킬

content-collections

메타

이 스킬은 콘텐츠 콜렉션(Content Collections)을 위한 프로덕션 검증된 설정을 제공합니다. 콘텐츠 콜렉션은 Markdown/MDX 파일을 Zod 검증이 포함된 타입 안전한 데이터 콜렉션으로 변환해주는 TypeScript 최우선 도구입니다. 블로그, 문서 사이트 또는 콘텐츠 중심의 Vite + React 애플리케이션을 구축할 때 타입 안전성과 자동 콘텐츠 검증을 보장하기 위해 사용하세요. Vite 플러그인 구성과 MDX 컴파일부터 배포 최적화 및 스키마 검증에 이르기까지 모든 것을 다룹니다.

스킬 보기

polymarket

메타

이 스킬은 개발자들이 Polymarket 예측 시장 플랫폼을 활용한 애플리케이션을 구축할 수 있도록 지원하며, 거래 및 시장 데이터를 위한 API 통합 기능을 포함합니다. 또한 WebSocket을 통한 실시간 데이터 스트리밍을 제공하여 실시간 거래와 시장 활동을 모니터링할 수 있습니다. 이를 통해 거래 전략을 구현하거나 실시간 시장 업데이트를 처리하는 도구를 생성하는 데 활용할 수 있습니다.

스킬 보기

creating-opencode-plugins

메타

이 스킬은 개발자들이 명령어, 파일, LSP 작업 등 25개 이상의 이벤트 유형에 연결되는 OpenCode 플러그인을 만들 수 있도록 돕습니다. JavaScript/TypeScript 모듈을 위한 플러그인 구조, 이벤트 API 명세, 구현 패턴을 제공합니다. OpenCode AI 어시스턴트의 라이프사이클을 사용자 정의 이벤트 기반 로직으로 가로채거나, 모니터링하거나, 확장해야 할 때 사용하세요.

스킬 보기

sglang

메타

SGLang은 RadixAttention 프리픽스 캐싱을 활용하여 JSON, 정규식, 에이전트 워크플로우를 위한 고속 구조화 생성에 특화된 고성능 LLM 서빙 프레임워크입니다. 특히 반복되는 프리픽스가 있는 작업에서 상당히 빠른 추론 속도를 제공하여 복잡한 구조화 출력 및 다중 턴 대화에 이상적입니다. 제약 디코딩이 필요하거나 광범위한 프리픽스 공유가 있는 애플리케이션을 구축할 때는 vLLM과 같은 대안보다 SGLang을 선택하십시오.

스킬 보기