MCP HubMCP Hub
スキル一覧に戻る

ray-data

zechenzhangAGI
更新日 Today
87 閲覧
62
2
62
GitHubで表示
その他aidata

について

Ray Dataは、MLワークロード向けにスケーラブルな分散データ処理を提供し、CPU/GPUにまたがるストリーミング実行と、Parquet、CSV、JSON、画像などの複数のファイル形式をサポートします。Ray Train、PyTorch、TensorFlowなどのMLフレームワークとシームレスに統合します。大規模なデータ前処理、バッチ推論、マルチモーダルなデータローディング、あるいは単一マシンから数百ノードにスケールする必要がある分散ETLパイプラインにご利用ください。

クイックインストール

Claude Code

推奨
プラグインコマンド推奨
/plugin add https://github.com/zechenzhangAGI/AI-research-SKILLs
Git クローン代替
git clone https://github.com/zechenzhangAGI/AI-research-SKILLs.git ~/.claude/skills/ray-data

このコマンドをClaude Codeにコピー&ペーストしてスキルをインストールします

ドキュメント

Ray Data - Scalable ML Data Processing

Distributed data processing library for ML and AI workloads.

When to use Ray Data

Use Ray Data when:

  • Processing large datasets (>100GB) for ML training
  • Need distributed data preprocessing across cluster
  • Building batch inference pipelines
  • Loading multi-modal data (images, audio, video)
  • Scaling data processing from laptop to cluster

Key features:

  • Streaming execution: Process data larger than memory
  • GPU support: Accelerate transforms with GPUs
  • Framework integration: PyTorch, TensorFlow, HuggingFace
  • Multi-modal: Images, Parquet, CSV, JSON, audio, video

Use alternatives instead:

  • Pandas: Small data (<1GB) on single machine
  • Dask: Tabular data, SQL-like operations
  • Spark: Enterprise ETL, SQL queries

Quick start

Installation

pip install -U 'ray[data]'

Load and transform data

import ray

# Read Parquet files
ds = ray.data.read_parquet("s3://bucket/data/*.parquet")

# Transform data (lazy execution)
ds = ds.map_batches(lambda batch: {"processed": batch["text"].str.lower()})

# Consume data
for batch in ds.iter_batches(batch_size=100):
    print(batch)

Integration with Ray Train

import ray
from ray.train import ScalingConfig
from ray.train.torch import TorchTrainer

# Create dataset
train_ds = ray.data.read_parquet("s3://bucket/train/*.parquet")

def train_func(config):
    # Access dataset in training
    train_ds = ray.train.get_dataset_shard("train")

    for epoch in range(10):
        for batch in train_ds.iter_batches(batch_size=32):
            # Train on batch
            pass

# Train with Ray
trainer = TorchTrainer(
    train_func,
    datasets={"train": train_ds},
    scaling_config=ScalingConfig(num_workers=4, use_gpu=True)
)
trainer.fit()

Reading data

From cloud storage

import ray

# Parquet (recommended for ML)
ds = ray.data.read_parquet("s3://bucket/data/*.parquet")

# CSV
ds = ray.data.read_csv("s3://bucket/data/*.csv")

# JSON
ds = ray.data.read_json("gs://bucket/data/*.json")

# Images
ds = ray.data.read_images("s3://bucket/images/")

From Python objects

# From list
ds = ray.data.from_items([{"id": i, "value": i * 2} for i in range(1000)])

# From range
ds = ray.data.range(1000000)  # Synthetic data

# From pandas
import pandas as pd
df = pd.DataFrame({"col1": [1, 2, 3], "col2": [4, 5, 6]})
ds = ray.data.from_pandas(df)

Transformations

Map batches (vectorized)

# Batch transformation (fast)
def process_batch(batch):
    batch["doubled"] = batch["value"] * 2
    return batch

ds = ds.map_batches(process_batch, batch_size=1000)

Row transformations

# Row-by-row (slower)
def process_row(row):
    row["squared"] = row["value"] ** 2
    return row

ds = ds.map(process_row)

Filter

# Filter rows
ds = ds.filter(lambda row: row["value"] > 100)

Group by and aggregate

# Group by column
ds = ds.groupby("category").count()

# Custom aggregation
ds = ds.groupby("category").map_groups(lambda group: {"sum": group["value"].sum()})

GPU-accelerated transforms

# Use GPU for preprocessing
def preprocess_images_gpu(batch):
    import torch
    images = torch.tensor(batch["image"]).cuda()
    # GPU preprocessing
    processed = images * 255
    return {"processed": processed.cpu().numpy()}

ds = ds.map_batches(
    preprocess_images_gpu,
    batch_size=64,
    num_gpus=1  # Request GPU
)

Writing data

# Write to Parquet
ds.write_parquet("s3://bucket/output/")

# Write to CSV
ds.write_csv("output/")

# Write to JSON
ds.write_json("output/")

Performance optimization

Repartition

# Control parallelism
ds = ds.repartition(100)  # 100 blocks for 100-core cluster

Batch size tuning

# Larger batches = faster vectorized ops
ds.map_batches(process_fn, batch_size=10000)  # vs batch_size=100

Streaming execution

# Process data larger than memory
ds = ray.data.read_parquet("s3://huge-dataset/")
for batch in ds.iter_batches(batch_size=1000):
    process(batch)  # Streamed, not loaded to memory

Common patterns

Batch inference

import ray

# Load model
def load_model():
    # Load once per worker
    return MyModel()

# Inference function
class BatchInference:
    def __init__(self):
        self.model = load_model()

    def __call__(self, batch):
        predictions = self.model(batch["input"])
        return {"prediction": predictions}

# Run distributed inference
ds = ray.data.read_parquet("s3://data/")
predictions = ds.map_batches(BatchInference, batch_size=32, num_gpus=1)
predictions.write_parquet("s3://output/")

Data preprocessing pipeline

# Multi-step pipeline
ds = (
    ray.data.read_parquet("s3://raw/")
    .map_batches(clean_data)
    .map_batches(tokenize)
    .map_batches(augment)
    .write_parquet("s3://processed/")
)

Integration with ML frameworks

PyTorch

# Convert to PyTorch
torch_ds = ds.to_torch(label_column="label", batch_size=32)

for batch in torch_ds:
    # batch is dict with tensors
    inputs, labels = batch["features"], batch["label"]

TensorFlow

# Convert to TensorFlow
tf_ds = ds.to_tf(feature_columns=["image"], label_column="label", batch_size=32)

for features, labels in tf_ds:
    # Train model
    pass

Supported data formats

FormatReadWriteUse Case
ParquetML data (recommended)
CSVTabular data
JSONSemi-structured
ImagesComputer vision
NumPyArrays
PandasDataFrames

Performance benchmarks

Scaling (processing 100GB data):

  • 1 node (16 cores): ~30 minutes
  • 4 nodes (64 cores): ~8 minutes
  • 16 nodes (256 cores): ~2 minutes

GPU acceleration (image preprocessing):

  • CPU only: 1,000 images/sec
  • 1 GPU: 5,000 images/sec
  • 4 GPUs: 18,000 images/sec

Use cases

Production deployments:

  • Pinterest: Last-mile data processing for model training
  • ByteDance: Scaling offline inference with multi-modal LLMs
  • Spotify: ML platform for batch inference

References

Resources

GitHub リポジトリ

zechenzhangAGI/AI-research-SKILLs
パス: 05-data-processing/ray-data
aiai-researchclaudeclaude-codeclaude-skillscodex

関連スキル

content-collections

メタ

This skill provides a production-tested setup for Content Collections, a TypeScript-first tool that transforms Markdown/MDX files into type-safe data collections with Zod validation. Use it when building blogs, documentation sites, or content-heavy Vite + React applications to ensure type safety and automatic content validation. It covers everything from Vite plugin configuration and MDX compilation to deployment optimization and schema validation.

スキルを見る

evaluating-llms-harness

テスト

This Claude Skill runs the lm-evaluation-harness to benchmark LLMs across 60+ standardized academic tasks like MMLU and GSM8K. It's designed for developers to compare model quality, track training progress, or report academic results. The tool supports various backends including HuggingFace and vLLM models.

スキルを見る

sglang

メタ

SGLang is a high-performance LLM serving framework that specializes in fast, structured generation for JSON, regex, and agentic workflows using its RadixAttention prefix caching. It delivers significantly faster inference, especially for tasks with repeated prefixes, making it ideal for complex, structured outputs and multi-turn conversations. Choose SGLang over alternatives like vLLM when you need constrained decoding or are building applications with extensive prefix sharing.

スキルを見る

polymarket

メタ

This skill enables developers to build applications with the Polymarket prediction markets platform, including API integration for trading and market data. It also provides real-time data streaming via WebSocket to monitor live trades and market activity. Use it for implementing trading strategies or creating tools that process live market updates.

スキルを見る