スキル一覧に戻る

anndata

K-Dense-AI
更新日 Today
26,534
2,743
26,534
GitHubで表示
メタautomationdata

について

anndataスキルは、単一細胞解析における注釈付き行列を扱うためのコアデータ構造を提供し、主にscverseエコシステム内で.h5adファイルを操作するために使用されます。このスキルは、観測データや変数メタデータと共に実験データを効率的に保存・操作することを可能にします。基本的なデータ入出力や構造操作にはこのスキルを使用し、分析、モデリング、大規模クエリには他の専門ツールを活用してください。

クイックインストール

Claude Code

推奨
メイン
npx skills add K-Dense-AI/claude-scientific-skills -a claude-code
プラグインコマンド代替
/plugin add https://github.com/K-Dense-AI/claude-scientific-skills
Git クローン代替
git clone https://github.com/K-Dense-AI/claude-scientific-skills.git ~/.claude/skills/anndata

このコマンドをClaude Codeにコピー&ペーストしてスキルをインストールします

ドキュメント

AnnData

Overview

AnnData is a Python package for handling annotated data matrices, storing experimental measurements (X) alongside observation metadata (obs), variable metadata (var), and multi-dimensional annotations (obsm, varm, obsp, varp, uns). Originally designed for single-cell genomics through Scanpy, it now serves as a general-purpose framework for any annotated data requiring efficient storage, manipulation, and analysis.

When to Use This Skill

Use this skill when:

  • Creating, reading, or writing AnnData objects
  • Working with h5ad, zarr, or other genomics data formats
  • Performing single-cell RNA-seq analysis
  • Managing large datasets with sparse matrices or backed mode
  • Concatenating multiple datasets or experimental batches
  • Subsetting, filtering, or transforming annotated data
  • Integrating with scanpy, scvi-tools, or other scverse ecosystem tools

Installation

Requires Python 3.11+ (anndata 0.11+ dropped 3.9). Current stable release: 0.12.x.

uv pip install anndata

# Lazy I/O and dask-backed operations
uv pip install "anndata[dask,lazy]"

# Development / docs (contributors)
uv pip install "anndata[dev,test,doc]"

Quick Start

Creating an AnnData object

import anndata as ad
import numpy as np
import pandas as pd

# Minimal creation
X = np.random.rand(100, 2000)  # 100 cells × 2000 genes
adata = ad.AnnData(X)

# With metadata
obs = pd.DataFrame({
    'cell_type': ['T cell', 'B cell'] * 50,
    'sample': ['A', 'B'] * 50
}, index=[f'cell_{i}' for i in range(100)])

var = pd.DataFrame({
    'gene_name': [f'Gene_{i}' for i in range(2000)]
}, index=[f'ENSG{i:05d}' for i in range(2000)])

adata = ad.AnnData(X=X, obs=obs, var=var)

Reading data

# Native formats (read_h5ad/read_zarr remain at top-level)
adata = ad.read_h5ad('data.h5ad')
adata = ad.read_h5ad('large_data.h5ad', backed='r')  # lazy load for large files
adata = ad.read_zarr('data.zarr')

# Other formats: prefer anndata.io (top-level imports are deprecated)
from anndata.io import read_csv, read_loom, read_mtx

adata = read_csv('data.csv')
adata = read_loom('data.loom')

# 10X Genomics: use scanpy (not anndata) — see scanpy skill
import scanpy as sc
adata = sc.read_10x_h5('filtered_feature_bc_matrix.h5')
adata = sc.read_10x_mtx('filtered_feature_bc_matrix/')

Writing data

# Write h5ad file
adata.write_h5ad('output.h5ad')

# Write with compression
adata.write_h5ad('output.h5ad', compression='gzip')

# Write other formats
adata.write_zarr('output.zarr')
adata.write_csvs('output_dir/')

Basic operations

# Subset by conditions
t_cells = adata[adata.obs['cell_type'] == 'T cell']

# Subset by indices
subset = adata[0:50, 0:100]

# Add metadata
adata.obs['quality_score'] = np.random.rand(adata.n_obs)
adata.var['highly_variable'] = np.random.rand(adata.n_vars) > 0.8

# Access dimensions
print(f"{adata.n_obs} observations × {adata.n_vars} variables")

Core Capabilities

1. Data Structure

Understand the AnnData object structure including X, obs, var, layers, obsm, varm, obsp, varp, uns, and raw components.

See: references/data_structure.md for comprehensive information on:

  • Core components (X, obs, var, layers, obsm, varm, obsp, varp, uns, raw)
  • Creating AnnData objects from various sources
  • Accessing and manipulating data components
  • Memory-efficient practices

2. Input/Output Operations

Read and write data in various formats with support for compression, backed mode, and cloud storage.

See: references/io_operations.md for details on:

  • Native formats (h5ad, zarr)
  • Alternative formats (CSV, MTX, Loom, 10X, Excel)
  • Backed mode for large datasets
  • Remote data access
  • Format conversion
  • Performance optimization

Common commands:

from anndata.io import read_mtx

# Read/write h5ad
adata = ad.read_h5ad('data.h5ad', backed='r')
adata.write_h5ad('output.h5ad', compression='gzip')

# 10X Genomics (via scanpy)
import scanpy as sc
adata = sc.read_10x_h5('filtered_feature_bc_matrix.h5')

# Read MTX format
adata = read_mtx('matrix.mtx').T

3. Concatenation

Combine multiple AnnData objects along observations or variables with flexible join strategies.

See: references/concatenation.md for comprehensive coverage of:

  • Basic concatenation (axis=0 for observations, axis=1 for variables)
  • Join types (inner, outer)
  • Merge strategies (same, unique, first, only)
  • Tracking data sources with labels
  • Lazy concatenation (AnnCollection)
  • On-disk concatenation for large datasets

Common commands:

# Concatenate observations (combine samples)
adata = ad.concat(
    [adata1, adata2, adata3],
    axis=0,
    join='inner',
    label='batch',
    keys=['batch1', 'batch2', 'batch3']
)

# Concatenate variables (combine modalities)
adata = ad.concat([adata_rna, adata_protein], axis=1)

# Lazy concatenation
from anndata.experimental import AnnCollection
collection = AnnCollection(
    ['data1.h5ad', 'data2.h5ad'],
    join_obs='outer',
    label='dataset'
)

4. Data Manipulation

Transform, subset, filter, and reorganize data efficiently.

See: references/manipulation.md for detailed guidance on:

  • Subsetting (by indices, names, boolean masks, metadata conditions)
  • Transposition
  • Copying (full copies vs views)
  • Renaming (observations, variables, categories)
  • Type conversions (strings to categoricals, sparse/dense)
  • Adding/removing data components
  • Reordering
  • Quality control filtering

Common commands:

# Subset by metadata
filtered = adata[adata.obs['quality_score'] > 0.8]
hv_genes = adata[:, adata.var['highly_variable']]

# Transpose
adata_T = adata.T

# Copy vs view
view = adata[0:100, :]  # View (lightweight reference)
copy = adata[0:100, :].copy()  # Independent copy

# Convert strings to categoricals
adata.strings_to_categoricals()

5. Best Practices

Follow recommended patterns for memory efficiency, performance, and reproducibility.

See: references/best_practices.md for guidelines on:

  • Memory management (sparse matrices, categoricals, backed mode)
  • Views vs copies
  • Data storage optimization
  • Performance optimization
  • Working with raw data
  • Metadata management
  • Reproducibility
  • Error handling
  • Integration with other tools
  • Common pitfalls and solutions

Key recommendations:

# Use sparse matrices for sparse data
from scipy.sparse import csr_matrix
adata.X = csr_matrix(adata.X)

# Convert strings to categoricals
adata.strings_to_categoricals()

# Use backed mode for large files
adata = ad.read_h5ad('large.h5ad', backed='r')

# Store raw before filtering
adata.raw = adata.copy()
adata = adata[:, adata.var['highly_variable']]

Integration with Scverse Ecosystem

AnnData serves as the foundational data structure for the scverse ecosystem:

Scanpy (Single-cell analysis)

import scanpy as sc

# Preprocessing
sc.pp.filter_cells(adata, min_genes=200)
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
sc.pp.highly_variable_genes(adata, n_top_genes=2000)

# Dimensionality reduction
sc.pp.pca(adata, n_comps=50)
sc.pp.neighbors(adata, n_neighbors=15)
sc.tl.umap(adata)
sc.tl.leiden(adata)

# Visualization
sc.pl.umap(adata, color=['cell_type', 'leiden'])

Muon (Multimodal data)

import muon as mu

# Combine RNA and protein data
mdata = mu.MuData({'rna': adata_rna, 'protein': adata_protein})

PyTorch integration

from anndata.experimental import AnnLoader

# Create DataLoader for deep learning
dataloader = AnnLoader(adata, batch_size=128, shuffle=True)

for batch in dataloader:
    X = batch.X
    # Train model

Common Workflows

Single-cell RNA-seq analysis

import anndata as ad
import scanpy as sc

# 1. Load data (10X via scanpy; anndata handles h5ad/zarr natively)
adata = sc.read_10x_h5('filtered_feature_bc_matrix.h5')

# 2. Quality control
adata.obs['n_genes'] = (adata.X > 0).sum(axis=1)
adata.obs['n_counts'] = adata.X.sum(axis=1)
adata = adata[adata.obs['n_genes'] > 200]
adata = adata[adata.obs['n_counts'] < 50000]

# 3. Store raw
adata.raw = adata.copy()

# 4. Normalize and filter
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
sc.pp.highly_variable_genes(adata, n_top_genes=2000)
adata = adata[:, adata.var['highly_variable']]

# 5. Save processed data
adata.write_h5ad('processed.h5ad')

Batch integration

# Load multiple batches
adata1 = ad.read_h5ad('batch1.h5ad')
adata2 = ad.read_h5ad('batch2.h5ad')
adata3 = ad.read_h5ad('batch3.h5ad')

# Concatenate with batch labels
adata = ad.concat(
    [adata1, adata2, adata3],
    label='batch',
    keys=['batch1', 'batch2', 'batch3'],
    join='inner'
)

# Apply batch correction
import scanpy as sc
sc.pp.combat(adata, key='batch')

# Continue analysis
sc.pp.pca(adata)
sc.pp.neighbors(adata)
sc.tl.umap(adata)

Working with large datasets

# Open in backed mode
adata = ad.read_h5ad('100GB_dataset.h5ad', backed='r')

# Filter based on metadata (no data loading)
high_quality = adata[adata.obs['quality_score'] > 0.8]

# Load filtered subset
adata_subset = high_quality.to_memory()

# Process subset
process(adata_subset)

# Or process in chunks
chunk_size = 1000
for i in range(0, adata.n_obs, chunk_size):
    chunk = adata[i:i+chunk_size, :].to_memory()
    process(chunk)

Troubleshooting

Out of memory errors

Use backed mode or convert to sparse matrices:

# Backed mode
adata = ad.read_h5ad('file.h5ad', backed='r')

# Sparse matrices
from scipy.sparse import csr_matrix
adata.X = csr_matrix(adata.X)

Slow file reading

Use compression and appropriate formats:

# Optimize for storage
adata.strings_to_categoricals()
adata.write_h5ad('file.h5ad', compression='gzip')

# Use Zarr for cloud storage (v3 optional since anndata 0.12)
import anndata
anndata.settings.zarr_write_format = 3  # default is 2
adata.write_zarr('file.zarr', chunks=(1000, 1000))

Index alignment issues

Always align external data on index:

# Wrong
adata.obs['new_col'] = external_data['values']

# Correct
adata.obs['new_col'] = external_data.set_index('cell_id').loc[adata.obs_names, 'values']

Additional Resources

GitHub リポジトリ

K-Dense-AI/claude-scientific-skills
パス: skills/anndata
0
agent-skillsai-scientistbioinformaticschemoinformaticsclaudeclaude-skills

関連スキル

content-collections

メタ

このスキルは、Content Collections(Markdown/MDXファイルを型安全なデータコレクションに変換するTypeScriptファーストのツール)の本番環境でテストされた設定を提供します。Zodバリデーションによる型安全性を実現し、ブログ、ドキュメントサイト、コンテンツ重視のVite + Reactアプリケーション構築時にご利用ください。Viteプラグインの設定、MDXコンパイルから、デプロイ最適化、スキーマバリデーションまで、すべてを網羅しています。

スキルを見る

polymarket

メタ

このスキルは、開発者がPolymarket予測市場プラットフォームを活用したアプリケーション構築を可能にします。API統合による取引や市場データの取得に加え、WebSocketを介したリアルタイムデータストリーミングにより、ライブ取引や市場活動を監視できます。取引戦略の実装や、ライブ市場更新を処理するツールの作成にご利用ください。

スキルを見る

creating-opencode-plugins

メタ

このスキルは、開発者がコマンド、ファイル、LSP操作など25種類以上のイベントタイプにフックするOpenCodeプラグインを作成することを支援します。JavaScript/TypeScriptモジュール向けに、プラグイン構造、イベントAPI仕様、および実装パターンを提供します。カスタムイベント駆動ロジックでOpenCode AIアシスタントのライフサイクルをインターセプト、監視、または拡張する必要がある場合にご利用ください。

スキルを見る

sglang

メタ

SGLangは、高性能なLLMサービングフレームワークであり、RadixAttentionプレフィックスキャッシュを活用したJSON、正規表現、エージェントワークフロー向けの高速で構造化された生成を特長とします。特にプレフィックスが繰り返されるタスクにおいて、大幅に高速な推論を実現し、複雑な構造化出力やマルチターン対話に最適です。制約付きデコードが必要な場合や、広範なプレフィックス共有を伴うアプリケーションを構築する場合は、vLLMなどの代替案ではなくSGLangを選択してください。

スキルを見る