SKILL·267A5F

version-ml-data

Name: version-ml-data
Author: pjt222

pjt222

Actualizado 1 month ago

9 vistas

Metaautomationdesigndata

Acerca de

Esta habilidad permite el control de versiones para conjuntos de datos de aprendizaje automático utilizando DVC con almacenamiento remoto, integrando flujos de trabajo de Git para rastrear cambios en los datos junto con el código. Ayuda a construir canalizaciones de datos reproducibles con seguimiento de dependencias y garantiza el linaje de los datos para la reproducibilidad de modelos. Úsela al versionar grandes conjuntos de datos que no caben en Git, compartir conjuntos de datos entre equipos o auditar datos para cumplimiento normativo.

Instalación rápida

Claude Code

Recomendado

Principal

npx skills add pjt222/agent-almanac -a claude-code

Comando PluginAlternativo

/plugin add https://github.com/pjt222/agent-almanac

Git CloneAlternativo

git clone https://github.com/pjt222/agent-almanac.git ~/.claude/skills/version-ml-data

Copia y pega este comando en Claude Code para instalar esta habilidad

Documentación

ML-Daten versionieren

See Extended Examples for complete configuration files and templates.

Implementieren data version control for maschinelles Lernen datasets to ensure reproducibility and track data lineage.

Wann verwenden

Versioning large datasets that don't fit in Git
Tracking data changes alongside code changes
Ensuring reproducibility of ML experiments
Building automated data pipelines with Abhaengigkeit tracking
Sharing datasets across team members
Rolling back to previous data versions
Auditing data lineage for compliance
Managing multiple dataset variants (train/test splits, feature sets)

Eingaben

Erforderlich: Git repository for metadata tracking
Erforderlich: DVC installation (pip install dvc)
Erforderlich: Raw data files or directories to version
Optional: Remote storage backend (S3, Azure Blob, GCS, SSH, local)
Optional: Data processing scripts for pipeline automation
Optional: CI/CD integration for automated pipeline execution

Vorgehensweise

Schritt 1: Initialize DVC in Git Repository

Einrichten DVC for data versioning alongside code versioning.

# Navigate to project root
cd /path/to/ml-project

# Initialize Git (if not already done)
git init
git add .
git commit -m "Initial commit"

# ... (see EXAMPLES.md for complete implementation)

Konfigurieren DVC settings:

# Set analytics opt-out (optional)
dvc config core.analytics false

# Configure autostage (automatically git add .dvc files)
dvc config core.autostage true

# Set default remote name
dvc config core.remote storage

# Commit configuration
git add .dvc/config
git commit -m "Configure DVC settings"

Erwartet: .dvc/ directory created with config files, .dvcignore file present, DVC files tracked by Git, large data files not in Git staging area.

Bei Fehler: Verifizieren Git repository initialized (git status), check DVC installation (dvc version), ensure write Berechtigungs in project directory, check for conflicting .dvc/ directory from previous setup, verify Python environment active.

Schritt 2: Konfigurieren Remote Storage Backend

Einrichten remote storage for data sharing and backup.

# AWS S3
dvc remote add -d storage s3://my-dvc-bucket/ml-project
dvc remote modify storage region us-west-2

# Configure credentials (use IAM roles in production)
dvc remote modify storage access_key_id YOUR_ACCESS_KEY
dvc remote modify storage secret_access_key YOUR_SECRET_KEY

# ... (see EXAMPLES.md for complete implementation)

Testen remote connection:

# List remote storage contents
dvc remote list storage

# Test write access
echo "test" > test.txt
dvc add test.txt
dvc push
rm test.txt test.txt.dvc .dvc/cache -rf

# Test read access
dvc pull

# Clean up test
rm test.txt test.txt.dvc
git checkout .

Erwartet: Remote storage configured and accessible, Zugangsdaten stored securely in .dvc/config.local (git-ignored), test push/pull succeeds, remote storage shows uploaded cache files.

Bei Fehler: Verifizieren cloud Zugangsdaten (aws s3 ls or equivalent CLI), check bucket/container exists and is accessible, ensure IAM Berechtigungs for read/write, verify network connectivity to remote, check firewall rules, test SSH key Authentifizierung for SSH remotes, verify storage path has write Berechtigungs.

Schritt 3: Version Datasets with DVC

Hinzufuegen datasets to DVC tracking and push to remote storage.

# Add single file
dvc add data/raw/customers.csv

# Add directory (all files inside)
dvc add data/raw/

# DVC creates .dvc files (metadata)
ls data/raw/
# ... (see EXAMPLES.md for complete implementation)

Version management:

# version_dataset.py
import pandas as pd
import subprocess
from datetime import datetime

def version_dataset(data_path, git_message=None):
    """
    Version dataset with DVC and Git.
# ... (see EXAMPLES.md for complete implementation)

Erwartet: .dvc metadata files created and committed to Git, original data files git-ignored automatisch, dvc push uploads data to remote storage, .dvc/cache contains data hash, remote storage has cached data files.

Bei Fehler: Check DVC remote configured (dvc remote list), verify write Berechtigungs in data directory, ensure sufficient disk space for cache, check network connectivity for push, verify no special characters in Dateipfads, check for large file warnings from Git.

Schritt 4: Erstellen Reproducible Data Pipelines

Erstellen DVC pipelines for automated, Abhaengigkeit-tracked data processing.

# dvc.yaml - Pipeline definition
stages:
  download_data:
    cmd: python scripts/download_data.py
    deps:
      - scripts/download_data.py
    outs:
      - data/raw/customers.csv
# ... (see EXAMPLES.md for complete implementation)

Parameters file:

# params.yaml
preprocess:
  feature_engineering: true
  outlier_threshold: 3.0

split:
  test_size: 0.2
  random_state: 42

model:
  algorithm: random_forest
  hyperparameters:
    n_estimators: 100
    max_depth: 10
    min_samples_split: 5

Ausfuehren pipeline:

# Run entire pipeline
dvc repro

# DVC automatically:
# - Detects which stages need rerun (based on deps/params changes)
# - Executes stages in correct order
# - Caches outputs
# - Tracks metrics
# ... (see EXAMPLES.md for complete implementation)

Erwartet: DVC pipeline executes in correct Abhaengigkeit order, only changed stages rerun, outputs cached efficiently, metrics tracked automatisch, Git commits include dvc.yaml and dvc.lock.

Bei Fehler: Check script paths exist and are executable, verify Abhaengigkeiten specified korrekt, ensure params.yaml keys match script usage, check for circular Abhaengigkeiten in pipeline, verify output paths writable, inspect script Fehlermeldungs in stderr, check Python environment has required packages.

Schritt 5: Teilen and Reproduce Data Versions

Aktivieren team members to reproduce exact data versions.

# Team member clones repository
git clone https://github.com/team/ml-project.git
cd ml-project

# Install DVC
pip install dvc[s3]  # or appropriate backend

# Configure remote (if not in .dvc/config)
# ... (see EXAMPLES.md for complete implementation)

Switch zwischen data versions:

# View data version history
git log --oneline -- data/raw/customers.csv.dvc

# Checkout previous data version
git checkout abc123 -- data/raw/customers.csv.dvc

# Pull that version's data
dvc checkout
# ... (see EXAMPLES.md for complete implementation)

Branching workflow:

# Create experiment branch
git checkout -b experiment/new-features

# Modify data pipeline
vim scripts/preprocess.py

# Add new features
dvc repro preprocess
# ... (see EXAMPLES.md for complete implementation)

Erwartet: git clone + dvc pull reproduces exact environment, data versions match across team, experiments isolated in branches, metrics comparable across versions.

Bei Fehler: Verifizieren remote access configured korrekt, check Zugangsdaten for new team members, ensure all .dvc files committed to Git, verify dvc.lock tracked by Git (pins exact versions), check network bandwidth for large pulls, verify storage backend has all referenced cache files.

Schritt 6: Integrieren with MLflow and CI/CD

Verbinden DVC data versioning with experiment tracking and automation.

# train_with_mlflow.py
import mlflow
import dvc.api
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, f1_score

# Get DVC-tracked data path and version
# ... (see EXAMPLES.md for complete implementation)

GitHub Actions CI/CD:

# .github/workflows/ml-pipeline.yml
name: ML Pipeline

on:
  push:
    branches: [main]
  pull_request:
    branches: [main]
# ... (see EXAMPLES.md for complete implementation)

Erwartet: MLflow logs DVC data versions with runs, CI/CD automatisch pulls data and runs pipeline, metrics validated vor deployment, reproducibility enforced by CI.

Bei Fehler: Check secrets configured in GitHub repository settings, verify DVC remote accessible from CI runners, ensure Git Zugangsdaten configured for push, check Python Abhaengigkeiten installed, verify metrics validation logic, inspect CI logs for DVC/MLflow errors.

Validierung

Haeufige Stolperfallen

Committing large files to Git: Forgot to run dvc add first - always use DVC for large files (>10MB), check .gitignore
Missing remote configuration: dvc push fails because no remote - configure remote vor sharing, test with dvc remote list
Lost data versions: Deleted .dvc/cache ohne pushing - always dvc push vor cleaning cache
Inconsistent environments: Different Python/package versions - use virtual environments, pin Abhaengigkeiten in requirements.txt
Broken pipelines: Changed script ohne updating dvc.yaml - keep pipeline definitions in sync with code
Slow pipeline: Rerunning unchanged stages - DVC caches by default, check dvc status to diagnose
Zusammenfuehren conflicts: .dvc files conflict waehrend merges - resolve like code conflicts, use dvc checkout nach resolution
Large pull times: Pulling all data for small experiments - use dvc pull <specific.dvc> for selective pulls
Credential leaks: Committing .dvc/config.local - keep Zugangsdaten in config.local (git-ignored), not config
No data lineage: Not tracking preprocessing steps - use DVC pipelines to track all transformations

Repositorio GitHub

pjt222/agent-almanac

Ruta: i18n/de/skills/version-ml-data

agentsagentskillsai-assisted-developmentclaude-codeskillsteams

FAQ

Frequently asked questions

What is the version-ml-data skill?

version-ml-data is a Claude Skill by pjt222. Skills package instructions and resources that Claude loads on demand, so Claude can perform version-ml-data-related tasks without extra prompting.

How do I install version-ml-data?

Use the install commands on this page: add version-ml-data to Claude Code as a plugin, or clone its repository into your skills directory, then restart Claude so it picks up the skill.

What category does version-ml-data belong to?

version-ml-data is in the Meta category, tagged automation, design and data.

Is version-ml-data free to use?

Yes. version-ml-data is listed on AIMCP and free to install. It runs inside Claude, so no separate service account is required to use the skill itself.

Habilidades relacionadas

content-collections

Meta

Esta habilidad proporciona una configuración probada en producción para Content Collections, una herramienta centrada en TypeScript que transforma archivos Markdown/MDX en colecciones de datos con tipado seguro mediante validación Zod. Úsala al construir blogs, sitios de documentación o aplicaciones Vite + React con mucho contenido para garantizar seguridad de tipos y validación automática de contenido. Abarca todo, desde la configuración del plugin de Vite y compilación MDX hasta la optimización de despliegue y validación de esquemas.

Ver habilidad

polymarket

Meta

Esta habilidad permite a los desarrolladores crear aplicaciones con la plataforma de mercados de predicción Polymarket, incluyendo la integración de API para operaciones y datos de mercado. También proporciona transmisión de datos en tiempo real a través de WebSocket para monitorear operaciones en vivo y actividad del mercado. Úsela para implementar estrategias de trading o crear herramientas que procesen actualizaciones de mercado en tiempo real.

Ver habilidad

creating-opencode-plugins

Meta

Esta habilidad ayuda a los desarrolladores a crear complementos de OpenCode que se conectan a más de 25 tipos de eventos, como comandos, archivos y operaciones LSP. Proporciona la estructura del complemento, las especificaciones de la API de eventos y los patrones de implementación para módulos en JavaScript/TypeScript. Úsala cuando necesites interceptar, monitorear o extender el ciclo de vida del asistente de IA de OpenCode con lógica personalizada basada en eventos.

Ver habilidad

sglang

Meta

SGLang es un framework de alto rendimiento para el servicio de LLM que se especializa en generación rápida y estructurada para JSON, expresiones regulares y flujos de trabajo de agentes utilizando su caché de prefijos RadixAttention. Ofrece una inferencia significativamente más rápida, especialmente para tareas con prefijos repetidos, lo que lo hace ideal para salidas complejas y estructuradas, y conversaciones multiturno. Elige SGLang sobre alternativas como vLLM cuando necesites decodificación restringida o estés construyendo aplicaciones con uso extensivo de prefijos compartidos.

Ver habilidad

version-ml-data

Acerca de

Instalación rápida

Claude Code

Documentación

ML-Daten versionieren

Wann verwenden

Eingaben

Vorgehensweise

Schritt 1: Initialize DVC in Git Repository

Schritt 2: Konfigurieren Remote Storage Backend

Schritt 3: Version Datasets with DVC

Schritt 4: Erstellen Reproducible Data Pipelines

Schritt 5: Teilen and Reproduce Data Versions

Schritt 6: Integrieren with MLflow and CI/CD

Validierung

Haeufige Stolperfallen

Verwandte Skills

Repositorio GitHub

Frequently asked questions

What is the version-ml-data skill?

How do I install version-ml-data?

What category does version-ml-data belong to?

Is version-ml-data free to use?

Habilidades relacionadas