version-ml-data
Über
Diese Fähigkeit ermöglicht Versionskontrolle für maschinelle Lern-Datensätze mittels DVC mit Remote-Speicher und integriert sich in Git-Workflows, um Datenänderungen parallel zum Code nachzuverfolgen. Sie erstellt reproduzierbare Datenpipelines mit Abhängigkeitsverfolgung und sichert die Datenherkunft für die Reproduzierbarkeit von Modellen. Nutzen Sie sie bei der Verarbeitung großer Datensätze, die nicht in Git passen, der gemeinsamen Datennutzung über Teams hinweg oder für Compliance-Audits.
Schnellinstallation
Claude Code
Empfohlennpx skills add pjt222/agent-almanac -a claude-code/plugin add https://github.com/pjt222/agent-almanacgit clone https://github.com/pjt222/agent-almanac.git ~/.claude/skills/version-ml-dataKopieren Sie diesen Befehl und fügen Sie ihn in Claude Code ein, um diese Fähigkeit zu installieren
Dokumentation
Version ML Data
See Extended Examples for complete configuration files and templates.
Implement data version control for machine learning datasets to ensure reproducibility and track data lineage.
When to Use
- Versioning large datasets that don't fit in Git
- Tracking data changes alongside code changes
- Ensuring reproducibility of ML experiments
- Building automated data pipelines with dependency tracking
- Sharing datasets across team members
- Rolling back to previous data versions
- Auditing data lineage for compliance
- Managing multiple dataset variants (train/test splits, feature sets)
Inputs
- Required: Git repository for metadata tracking
- Required: DVC installation (
pip install dvc) - Required: Raw data files or directories to version
- Optional: Remote storage backend (S3, Azure Blob, GCS, SSH, local)
- Optional: Data processing scripts for pipeline automation
- Optional: CI/CD integration for automated pipeline execution
Procedure
Step 1: Initialize DVC in Git Repository
Set up DVC for data versioning alongside code versioning.
# Navigate to project root
cd /path/to/ml-project
# Initialize Git (if not already done)
git init
git add .
git commit -m "Initial commit"
# ... (see EXAMPLES.md for complete implementation)
Configure DVC settings:
# Set analytics opt-out (optional)
dvc config core.analytics false
# Configure autostage (automatically git add .dvc files)
dvc config core.autostage true
# Set default remote name
dvc config core.remote storage
# Commit configuration
git add .dvc/config
git commit -m "Configure DVC settings"
Expected: .dvc/ directory created with config files, .dvcignore file present, DVC files tracked by Git, large data files not in Git staging area.
On failure: Verify Git repository initialized (git status), check DVC installation (dvc version), ensure write permissions in project directory, check for conflicting .dvc/ directory from previous setup, verify Python environment active.
Step 2: Configure Remote Storage Backend
Set up remote storage for data sharing and backup.
# AWS S3
dvc remote add -d storage s3://my-dvc-bucket/ml-project
dvc remote modify storage region us-west-2
# Configure credentials (use IAM roles in production)
dvc remote modify storage access_key_id YOUR_ACCESS_KEY
dvc remote modify storage secret_access_key YOUR_SECRET_KEY
# ... (see EXAMPLES.md for complete implementation)
Test remote connection:
# List remote storage contents
dvc remote list storage
# Test write access
echo "test" > test.txt
dvc add test.txt
dvc push
rm test.txt test.txt.dvc .dvc/cache -rf
# Test read access
dvc pull
# Clean up test
rm test.txt test.txt.dvc
git checkout .
Expected: Remote storage configured and accessible, credentials stored securely in .dvc/config.local (git-ignored), test push/pull succeeds, remote storage shows uploaded cache files.
On failure: Verify cloud credentials (aws s3 ls or equivalent CLI), check bucket/container exists and is accessible, ensure IAM permissions for read/write, verify network connectivity to remote, check firewall rules, test SSH key authentication for SSH remotes, verify storage path has write permissions.
Step 3: Version Datasets with DVC
Add datasets to DVC tracking and push to remote storage.
# Add single file
dvc add data/raw/customers.csv
# Add directory (all files inside)
dvc add data/raw/
# DVC creates .dvc files (metadata)
ls data/raw/
# ... (see EXAMPLES.md for complete implementation)
Version management:
# version_dataset.py
import pandas as pd
import subprocess
from datetime import datetime
def version_dataset(data_path, git_message=None):
"""
Version dataset with DVC and Git.
# ... (see EXAMPLES.md for complete implementation)
Expected: .dvc metadata files created and committed to Git, original data files git-ignored automatically, dvc push uploads data to remote storage, .dvc/cache contains data hash, remote storage has cached data files.
On failure: Check DVC remote configured (dvc remote list), verify write permissions in data directory, ensure sufficient disk space for cache, check network connectivity for push, verify no special characters in file paths, check for large file warnings from Git.
Step 4: Build Reproducible Data Pipelines
Create DVC pipelines for automated, dependency-tracked data processing.
# dvc.yaml - Pipeline definition
stages:
download_data:
cmd: python scripts/download_data.py
deps:
- scripts/download_data.py
outs:
- data/raw/customers.csv
# ... (see EXAMPLES.md for complete implementation)
Parameters file:
# params.yaml
preprocess:
feature_engineering: true
outlier_threshold: 3.0
split:
test_size: 0.2
random_state: 42
model:
algorithm: random_forest
hyperparameters:
n_estimators: 100
max_depth: 10
min_samples_split: 5
Run pipeline:
# Run entire pipeline
dvc repro
# DVC automatically:
# - Detects which stages need rerun (based on deps/params changes)
# - Executes stages in correct order
# - Caches outputs
# - Tracks metrics
# ... (see EXAMPLES.md for complete implementation)
Expected: DVC pipeline executes in correct dependency order, only changed stages rerun, outputs cached efficiently, metrics tracked automatically, Git commits include dvc.yaml and dvc.lock.
On failure: Check script paths exist and are executable, verify dependencies specified correctly, ensure params.yaml keys match script usage, check for circular dependencies in pipeline, verify output paths writable, inspect script error messages in stderr, check Python environment has required packages.
Step 5: Share and Reproduce Data Versions
Enable team members to reproduce exact data versions.
# Team member clones repository
git clone https://github.com/team/ml-project.git
cd ml-project
# Install DVC
pip install dvc[s3] # or appropriate backend
# Configure remote (if not in .dvc/config)
# ... (see EXAMPLES.md for complete implementation)
Switch between data versions:
# View data version history
git log --oneline -- data/raw/customers.csv.dvc
# Checkout previous data version
git checkout abc123 -- data/raw/customers.csv.dvc
# Pull that version's data
dvc checkout
# ... (see EXAMPLES.md for complete implementation)
Branching workflow:
# Create experiment branch
git checkout -b experiment/new-features
# Modify data pipeline
vim scripts/preprocess.py
# Add new features
dvc repro preprocess
# ... (see EXAMPLES.md for complete implementation)
Expected: git clone + dvc pull reproduces exact environment, data versions match across team, experiments isolated in branches, metrics comparable across versions.
On failure: Verify remote access configured correctly, check credentials for new team members, ensure all .dvc files committed to Git, verify dvc.lock tracked by Git (pins exact versions), check network bandwidth for large pulls, verify storage backend has all referenced cache files.
Step 6: Integrate with MLflow and CI/CD
Connect DVC data versioning with experiment tracking and automation.
# train_with_mlflow.py
import mlflow
import dvc.api
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, f1_score
# Get DVC-tracked data path and version
# ... (see EXAMPLES.md for complete implementation)
GitHub Actions CI/CD:
# .github/workflows/ml-pipeline.yml
name: ML Pipeline
locale: caveman-ultra
source_locale: en
source_commit: 82c77053
translator: "Julius Brussee homage — caveman"
translation_date: "2026-04-19"
on:
push:
branches: [main]
pull_request:
branches: [main]
# ... (see EXAMPLES.md for complete implementation)
Expected: MLflow logs DVC data versions with runs, CI/CD automatically pulls data and runs pipeline, metrics validated before deployment, reproducibility enforced by CI.
On failure: Check secrets configured in GitHub repository settings, verify DVC remote accessible from CI runners, ensure Git credentials configured for push, check Python dependencies installed, verify metrics validation logic, inspect CI logs for DVC/MLflow errors.
Validation
- DVC initialized in Git repository
- Remote storage configured and accessible
- Datasets versioned and pushed to remote
-
.dvcfiles committed to Git - Large data files git-ignored automatically
- DVC pipeline executes successfully
- Team members can reproduce data with
dvc pull - Data versions switchable via Git checkout
- Metrics tracked across pipeline runs
- Integration with MLflow working
- CI/CD pipeline reproduces results
Common Pitfalls
- Committing large files to Git: Forgot to run
dvc addfirst - always use DVC for large files (>10MB), check.gitignore - Missing remote configuration:
dvc pushfails because no remote - configure remote before sharing, test withdvc remote list - Lost data versions: Deleted
.dvc/cachewithout pushing - alwaysdvc pushbefore cleaning cache - Inconsistent environments: Different Python/package versions - use virtual environments, pin dependencies in
requirements.txt - Broken pipelines: Changed script without updating
dvc.yaml- keep pipeline definitions in sync with code - Slow pipeline: Rerunning unchanged stages - DVC caches by default, check
dvc statusto diagnose - Merge conflicts:
.dvcfiles conflict during merges - resolve like code conflicts, usedvc checkoutafter resolution - Large pull times: Pulling all data for small experiments - use
dvc pull <specific.dvc>for selective pulls - Credential leaks: Committing
.dvc/config.local- keep credentials inconfig.local(git-ignored), notconfig - No data lineage: Not tracking preprocessing steps - use DVC pipelines to track all transformations
Related Skills
track-ml-experiments- Integrate DVC versions with MLflow experiment trackingorchestrate-ml-pipeline- Combine DVC pipelines with Airflow/Prefect orchestrationbuild-feature-store- Version raw data sources for feature engineeringserialize-data-formats- Choose efficient formats for DVC-tracked datasetsdesign-serialization-schema- Design schemas for versioned data files
GitHub Repository
Verwandte Skills
content-collections
MetaDiese Skill bietet eine produktionsgetestete Einrichtung für Content Collections – ein TypeScript-first-Tool, das Markdown/MDX-Dateien in typsichere Datensammlungen mit Zod-Validierung umwandelt. Verwenden Sie ihn beim Erstellen von Blogs, Dokumentationsseiten oder inhaltsstarken Vite + React-Anwendungen, um Typsicherheit und automatische Inhaltsvalidierung zu gewährleisten. Er behandelt alles von der Vite-Plugin-Konfiguration und MDX-Kompilierung bis hin zur Deployment-Optimierung und Schema-Validierung.
polymarket
MetaDiese Fähigkeit ermöglicht es Entwicklern, Anwendungen mit der Polymarket-Prognosemärkte-Plattform zu erstellen, einschließlich API-Integration für Handel und Marktdaten. Sie bietet außerdem Echtzeit-Datenstreaming über WebSocket, um Live-Trades und Marktaktivitäten zu überwachen. Nutzen Sie sie zur Implementierung von Handelsstrategien oder zur Erstellung von Tools, die Live-Marktaktualisierungen verarbeiten.
creating-opencode-plugins
MetaDiese Fähigkeit unterstützt Entwickler dabei, OpenCode-Plugins zu erstellen, die in über 25 Ereignistypen wie Befehle, Dateien und LSP-Operationen eingreifen. Sie bietet die Plugin-Struktur, Event-API-Spezifikationen und Implementierungsmuster für JavaScript/TypeScript-Module. Nutzen Sie sie, wenn Sie den Lebenszyklus des OpenCode KI-Assistenten mit benutzerdefinierter ereignisgesteuerter Logik abfangen, überwachen oder erweitern müssen.
sglang
MetaSGLang ist ein hochperformantes LLM-Serving-Framework, das sich auf schnelle, strukturierte Generierung für JSON, Regex und agentenbasierte Workflows unter Verwendung seines RadixAttention-Prefix-Cachings spezialisiert. Es bietet deutlich schnellere Inferenz, insbesondere für Aufgaben mit wiederholten Präfixen, was es ideal für komplexe, strukturierte Ausgaben und Mehrfachdialoge macht. Wählen Sie SGLang gegenüber Alternativen wie vLLM, wenn Sie constrained decoding benötigen oder Anwendungen mit umfangreicher Präfix-Weitergabe entwickeln.
