vaex
について
vaexスキルは、遅延評価を備えたアウトオブコアDataFrameを通じて、RAM容量を超える大規模な表形式データセット(数十億行)の処理を可能にします。大容量のCSV/HDF5/Arrow/Parquetファイルに対し、高速な集計処理、ビッグデータの可視化、機械学習パイプラインを提供します。メモリに収まらない大規模データセットを分析する必要がある場合や、巨大ファイルに対して効率的な統計処理を実行する場合にご利用ください。
クイックインストール
Claude Code
推奨npx skills add K-Dense-AI/claude-scientific-skills -a claude-code/plugin add https://github.com/K-Dense-AI/claude-scientific-skillsgit clone https://github.com/K-Dense-AI/claude-scientific-skills.git ~/.claude/skills/vaexこのコマンドをClaude Codeにコピー&ペーストしてスキルをインストールします
ドキュメント
Vaex
Overview
Vaex is a high-performance Python library designed for lazy, out-of-core DataFrames to process and visualize tabular datasets that are too large to fit into RAM. Vaex can process over a billion rows per second, enabling interactive data exploration and analysis on datasets with billions of rows.
Installation
Install the full meta-package (recommended):
uv pip install vaex
Minimal install (pick only what you need):
uv pip install vaex-core vaex-viz vaex-hdf5 vaex-ml
The vaex package is a meta-package that pulls in vaex-core, vaex-viz, vaex-hdf5, vaex-ml, and other sub-packages. Arrow support is built into vaex-core (the separate vaex-arrow package is deprecated). vaex-distributed is deprecated in favor of vaex-enterprise.
Version notes (vaex 4.19.0+): Python 3.12 and NumPy v2 require vaex >= 4.19.0. On Windows, you may need Python dev headers to build the annoy dependency.
When to Use This Skill
Use Vaex when:
- Processing tabular datasets larger than available RAM (gigabytes to terabytes)
- Performing fast statistical aggregations on massive datasets
- Creating visualizations and heatmaps of large datasets
- Building machine learning pipelines on big data
- Converting between data formats (CSV, HDF5, Arrow, Parquet)
- Needing lazy evaluation and virtual columns to avoid memory overhead
- Working with astronomical data, financial time series, or other large-scale scientific datasets
Vaex vs alternatives: Use polars when data fits in RAM and you need maximum in-memory speed. Use dask when you need distributed pandas/NumPy across a cluster. Use vaex for single-machine, out-of-core analytics on tabular data that exceeds RAM via memory-mapped HDF5/Arrow files.
Core Capabilities
Vaex provides six primary capability areas, each documented in detail in the references directory:
1. DataFrames and Data Loading
Load and create Vaex DataFrames from various sources including files (HDF5, CSV, Arrow, Parquet), pandas DataFrames, NumPy arrays, and dictionaries. Reference references/core_dataframes.md for:
- Opening large files efficiently
- Converting from pandas/NumPy/Arrow
- Working with example datasets
- Understanding DataFrame structure
2. Data Processing and Manipulation
Perform filtering, create virtual columns, use expressions, and aggregate data without loading everything into memory. Reference references/data_processing.md for:
- Filtering and selections
- Virtual columns and expressions
- Groupby operations and aggregations
- String operations and datetime handling
- Working with missing data
3. Performance and Optimization
Leverage Vaex's lazy evaluation, caching strategies, and memory-efficient operations. Reference references/performance.md for:
- Understanding lazy evaluation
- Using
delay=Truefor batching operations - Materializing columns when needed
- Caching strategies
- Asynchronous operations
4. Data Visualization
Create interactive visualizations of large datasets including heatmaps, histograms, and scatter plots. Reference references/visualization.md for:
- Creating 1D and 2D plots
- Heatmap visualizations
- Working with selections
- Customizing plots and subplots
5. Machine Learning Integration
Build ML pipelines with transformers, encoders, and integration with scikit-learn, XGBoost, and other frameworks. Reference references/machine_learning.md for:
- Feature scaling and encoding
- PCA and dimensionality reduction
- K-means clustering
- Integration with scikit-learn/XGBoost/CatBoost
- Model serialization and deployment
6. I/O Operations
Efficiently read and write data in various formats with optimal performance. Reference references/io_operations.md for:
- File format recommendations
- Export strategies
- Working with Apache Arrow
- CSV handling for large files
- Server and remote data access
Quick Start Pattern
For most Vaex tasks, follow this pattern:
import vaex
# 1. Open or create DataFrame
df = vaex.open('large_file.hdf5') # or .csv, .arrow, .parquet
# OR
df = vaex.from_pandas(pandas_df)
# 2. Explore the data
print(df) # Shows first/last rows and column info
df.describe() # Statistical summary
# 3. Create virtual columns (no memory overhead)
df['new_column'] = df.x ** 2 + df.y
# 4. Filter with selections
df_filtered = df[df.age > 25]
# 5. Compute statistics (fast, lazy evaluation)
mean_val = df.x.mean()
stats = df.groupby('category').agg({'value': 'sum'})
# 6. Visualize (df.viz is the recommended accessor since vaex 4.0)
df.viz.heatmap(df.x, df.y, limits='99.7%', show=True)
# Legacy: df.plot1d() and df.plot() still work on the DataFrame
# 7. Export if needed
df.export_hdf5('output.hdf5')
Working with References
The reference files contain detailed information about each capability area. Load references into context based on the specific task:
- Basic operations: Start with
references/core_dataframes.mdandreferences/data_processing.md - Performance issues: Check
references/performance.md - Visualization tasks: Use
references/visualization.md - ML pipelines: Reference
references/machine_learning.md - File I/O: Consult
references/io_operations.md
Best Practices
- Use HDF5 or Apache Arrow formats for optimal performance with large datasets
- Leverage virtual columns instead of materializing data to save memory
- Batch operations using
delay=Truewhen performing multiple calculations - Export to efficient formats rather than keeping data in CSV
- Use expressions for complex calculations without intermediate storage
- Profile with
df.describe()anddf.nbytesto understand data shape and memory usage
Common Patterns
Pattern: Converting Large CSV to HDF5
import vaex
# Open large CSV lazily (vaex 4.14+), or use from_csv to convert to HDF5
df = vaex.open('large_file.csv')
# df = vaex.from_csv('large_file.csv', convert='large_file.hdf5')
# Export to HDF5 for faster future access
df.export_hdf5('large_file.hdf5')
# Future loads are instant
df = vaex.open('large_file.hdf5')
Pattern: Efficient Aggregations
# Use delay=True to batch multiple operations
mean_x = df.x.mean(delay=True)
std_y = df.y.std(delay=True)
sum_z = df.z.sum(delay=True)
# Execute all at once
results = vaex.execute([mean_x, std_y, sum_z])
Pattern: Virtual Columns for Feature Engineering
# No memory overhead - computed on the fly
df['age_squared'] = df.age ** 2
df['full_name'] = df.first_name + ' ' + df.last_name
df['is_adult'] = df.age >= 18
Resources
This skill includes reference documentation in the references/ directory:
core_dataframes.md- DataFrame creation, loading, and basic structuredata_processing.md- Filtering, expressions, aggregations, and transformationsperformance.md- Optimization strategies and lazy evaluationvisualization.md- Plotting and interactive visualizationsmachine_learning.md- ML pipelines and model integrationio_operations.md- File formats and data import/export
GitHub リポジトリ
関連スキル
content-collections
メタこのスキルは、Content Collections(Markdown/MDXファイルを型安全なデータコレクションに変換するTypeScriptファーストのツール)の本番環境でテストされた設定を提供します。Zodバリデーションによる型安全性を実現し、ブログ、ドキュメントサイト、コンテンツ重視のVite + Reactアプリケーション構築時にご利用ください。Viteプラグインの設定、MDXコンパイルから、デプロイ最適化、スキーマバリデーションまで、すべてを網羅しています。
polymarket
メタこのスキルは、開発者がPolymarket予測市場プラットフォームを活用したアプリケーション構築を可能にします。API統合による取引や市場データの取得に加え、WebSocketを介したリアルタイムデータストリーミングにより、ライブ取引や市場活動を監視できます。取引戦略の実装や、ライブ市場更新を処理するツールの作成にご利用ください。
creating-opencode-plugins
メタこのスキルは、開発者がコマンド、ファイル、LSP操作など25種類以上のイベントタイプにフックするOpenCodeプラグインを作成することを支援します。JavaScript/TypeScriptモジュール向けに、プラグイン構造、イベントAPI仕様、および実装パターンを提供します。カスタムイベント駆動ロジックでOpenCode AIアシスタントのライフサイクルをインターセプト、監視、または拡張する必要がある場合にご利用ください。
sglang
メタSGLangは、高性能なLLMサービングフレームワークであり、RadixAttentionプレフィックスキャッシュを活用したJSON、正規表現、エージェントワークフロー向けの高速で構造化された生成を特長とします。特にプレフィックスが繰り返されるタスクにおいて、大幅に高速な推論を実現し、複雑な構造化出力やマルチターン対話に最適です。制約付きデコードが必要な場合や、広範なプレフィックス共有を伴うアプリケーションを構築する場合は、vLLMなどの代替案ではなくSGLangを選択してください。
