返回技能列表

anndata

K-Dense-AI
更新于 Today
26,534
2,743
26,534
在 GitHub 上查看
automationdata

关于

The anndata skill provides the core data structure for handling annotated matrices in single-cell analysis, primarily for working with .h5ad files within the scverse ecosystem. It enables efficient storage and manipulation of experimental data alongside observation and variable metadata. Use this skill for fundamental data I/O and structure operations, while employing other specialized tools for analysis, modeling, or large-scale queries.

快速安装

Claude Code

推荐
主要方式
npx skills add K-Dense-AI/claude-scientific-skills -a claude-code
插件命令备选方式
/plugin add https://github.com/K-Dense-AI/claude-scientific-skills
Git 克隆备选方式
git clone https://github.com/K-Dense-AI/claude-scientific-skills.git ~/.claude/skills/anndata

在 Claude Code 中复制并粘贴此命令以安装该技能

技能文档

AnnData

Overview

AnnData is a Python package for handling annotated data matrices, storing experimental measurements (X) alongside observation metadata (obs), variable metadata (var), and multi-dimensional annotations (obsm, varm, obsp, varp, uns). Originally designed for single-cell genomics through Scanpy, it now serves as a general-purpose framework for any annotated data requiring efficient storage, manipulation, and analysis.

When to Use This Skill

Use this skill when:

  • Creating, reading, or writing AnnData objects
  • Working with h5ad, zarr, or other genomics data formats
  • Performing single-cell RNA-seq analysis
  • Managing large datasets with sparse matrices or backed mode
  • Concatenating multiple datasets or experimental batches
  • Subsetting, filtering, or transforming annotated data
  • Integrating with scanpy, scvi-tools, or other scverse ecosystem tools

Installation

Requires Python 3.11+ (anndata 0.11+ dropped 3.9). Current stable release: 0.12.x.

uv pip install anndata

# Lazy I/O and dask-backed operations
uv pip install "anndata[dask,lazy]"

# Development / docs (contributors)
uv pip install "anndata[dev,test,doc]"

Quick Start

Creating an AnnData object

import anndata as ad
import numpy as np
import pandas as pd

# Minimal creation
X = np.random.rand(100, 2000)  # 100 cells × 2000 genes
adata = ad.AnnData(X)

# With metadata
obs = pd.DataFrame({
    'cell_type': ['T cell', 'B cell'] * 50,
    'sample': ['A', 'B'] * 50
}, index=[f'cell_{i}' for i in range(100)])

var = pd.DataFrame({
    'gene_name': [f'Gene_{i}' for i in range(2000)]
}, index=[f'ENSG{i:05d}' for i in range(2000)])

adata = ad.AnnData(X=X, obs=obs, var=var)

Reading data

# Native formats (read_h5ad/read_zarr remain at top-level)
adata = ad.read_h5ad('data.h5ad')
adata = ad.read_h5ad('large_data.h5ad', backed='r')  # lazy load for large files
adata = ad.read_zarr('data.zarr')

# Other formats: prefer anndata.io (top-level imports are deprecated)
from anndata.io import read_csv, read_loom, read_mtx

adata = read_csv('data.csv')
adata = read_loom('data.loom')

# 10X Genomics: use scanpy (not anndata) — see scanpy skill
import scanpy as sc
adata = sc.read_10x_h5('filtered_feature_bc_matrix.h5')
adata = sc.read_10x_mtx('filtered_feature_bc_matrix/')

Writing data

# Write h5ad file
adata.write_h5ad('output.h5ad')

# Write with compression
adata.write_h5ad('output.h5ad', compression='gzip')

# Write other formats
adata.write_zarr('output.zarr')
adata.write_csvs('output_dir/')

Basic operations

# Subset by conditions
t_cells = adata[adata.obs['cell_type'] == 'T cell']

# Subset by indices
subset = adata[0:50, 0:100]

# Add metadata
adata.obs['quality_score'] = np.random.rand(adata.n_obs)
adata.var['highly_variable'] = np.random.rand(adata.n_vars) > 0.8

# Access dimensions
print(f"{adata.n_obs} observations × {adata.n_vars} variables")

Core Capabilities

1. Data Structure

Understand the AnnData object structure including X, obs, var, layers, obsm, varm, obsp, varp, uns, and raw components.

See: references/data_structure.md for comprehensive information on:

  • Core components (X, obs, var, layers, obsm, varm, obsp, varp, uns, raw)
  • Creating AnnData objects from various sources
  • Accessing and manipulating data components
  • Memory-efficient practices

2. Input/Output Operations

Read and write data in various formats with support for compression, backed mode, and cloud storage.

See: references/io_operations.md for details on:

  • Native formats (h5ad, zarr)
  • Alternative formats (CSV, MTX, Loom, 10X, Excel)
  • Backed mode for large datasets
  • Remote data access
  • Format conversion
  • Performance optimization

Common commands:

from anndata.io import read_mtx

# Read/write h5ad
adata = ad.read_h5ad('data.h5ad', backed='r')
adata.write_h5ad('output.h5ad', compression='gzip')

# 10X Genomics (via scanpy)
import scanpy as sc
adata = sc.read_10x_h5('filtered_feature_bc_matrix.h5')

# Read MTX format
adata = read_mtx('matrix.mtx').T

3. Concatenation

Combine multiple AnnData objects along observations or variables with flexible join strategies.

See: references/concatenation.md for comprehensive coverage of:

  • Basic concatenation (axis=0 for observations, axis=1 for variables)
  • Join types (inner, outer)
  • Merge strategies (same, unique, first, only)
  • Tracking data sources with labels
  • Lazy concatenation (AnnCollection)
  • On-disk concatenation for large datasets

Common commands:

# Concatenate observations (combine samples)
adata = ad.concat(
    [adata1, adata2, adata3],
    axis=0,
    join='inner',
    label='batch',
    keys=['batch1', 'batch2', 'batch3']
)

# Concatenate variables (combine modalities)
adata = ad.concat([adata_rna, adata_protein], axis=1)

# Lazy concatenation
from anndata.experimental import AnnCollection
collection = AnnCollection(
    ['data1.h5ad', 'data2.h5ad'],
    join_obs='outer',
    label='dataset'
)

4. Data Manipulation

Transform, subset, filter, and reorganize data efficiently.

See: references/manipulation.md for detailed guidance on:

  • Subsetting (by indices, names, boolean masks, metadata conditions)
  • Transposition
  • Copying (full copies vs views)
  • Renaming (observations, variables, categories)
  • Type conversions (strings to categoricals, sparse/dense)
  • Adding/removing data components
  • Reordering
  • Quality control filtering

Common commands:

# Subset by metadata
filtered = adata[adata.obs['quality_score'] > 0.8]
hv_genes = adata[:, adata.var['highly_variable']]

# Transpose
adata_T = adata.T

# Copy vs view
view = adata[0:100, :]  # View (lightweight reference)
copy = adata[0:100, :].copy()  # Independent copy

# Convert strings to categoricals
adata.strings_to_categoricals()

5. Best Practices

Follow recommended patterns for memory efficiency, performance, and reproducibility.

See: references/best_practices.md for guidelines on:

  • Memory management (sparse matrices, categoricals, backed mode)
  • Views vs copies
  • Data storage optimization
  • Performance optimization
  • Working with raw data
  • Metadata management
  • Reproducibility
  • Error handling
  • Integration with other tools
  • Common pitfalls and solutions

Key recommendations:

# Use sparse matrices for sparse data
from scipy.sparse import csr_matrix
adata.X = csr_matrix(adata.X)

# Convert strings to categoricals
adata.strings_to_categoricals()

# Use backed mode for large files
adata = ad.read_h5ad('large.h5ad', backed='r')

# Store raw before filtering
adata.raw = adata.copy()
adata = adata[:, adata.var['highly_variable']]

Integration with Scverse Ecosystem

AnnData serves as the foundational data structure for the scverse ecosystem:

Scanpy (Single-cell analysis)

import scanpy as sc

# Preprocessing
sc.pp.filter_cells(adata, min_genes=200)
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
sc.pp.highly_variable_genes(adata, n_top_genes=2000)

# Dimensionality reduction
sc.pp.pca(adata, n_comps=50)
sc.pp.neighbors(adata, n_neighbors=15)
sc.tl.umap(adata)
sc.tl.leiden(adata)

# Visualization
sc.pl.umap(adata, color=['cell_type', 'leiden'])

Muon (Multimodal data)

import muon as mu

# Combine RNA and protein data
mdata = mu.MuData({'rna': adata_rna, 'protein': adata_protein})

PyTorch integration

from anndata.experimental import AnnLoader

# Create DataLoader for deep learning
dataloader = AnnLoader(adata, batch_size=128, shuffle=True)

for batch in dataloader:
    X = batch.X
    # Train model

Common Workflows

Single-cell RNA-seq analysis

import anndata as ad
import scanpy as sc

# 1. Load data (10X via scanpy; anndata handles h5ad/zarr natively)
adata = sc.read_10x_h5('filtered_feature_bc_matrix.h5')

# 2. Quality control
adata.obs['n_genes'] = (adata.X > 0).sum(axis=1)
adata.obs['n_counts'] = adata.X.sum(axis=1)
adata = adata[adata.obs['n_genes'] > 200]
adata = adata[adata.obs['n_counts'] < 50000]

# 3. Store raw
adata.raw = adata.copy()

# 4. Normalize and filter
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
sc.pp.highly_variable_genes(adata, n_top_genes=2000)
adata = adata[:, adata.var['highly_variable']]

# 5. Save processed data
adata.write_h5ad('processed.h5ad')

Batch integration

# Load multiple batches
adata1 = ad.read_h5ad('batch1.h5ad')
adata2 = ad.read_h5ad('batch2.h5ad')
adata3 = ad.read_h5ad('batch3.h5ad')

# Concatenate with batch labels
adata = ad.concat(
    [adata1, adata2, adata3],
    label='batch',
    keys=['batch1', 'batch2', 'batch3'],
    join='inner'
)

# Apply batch correction
import scanpy as sc
sc.pp.combat(adata, key='batch')

# Continue analysis
sc.pp.pca(adata)
sc.pp.neighbors(adata)
sc.tl.umap(adata)

Working with large datasets

# Open in backed mode
adata = ad.read_h5ad('100GB_dataset.h5ad', backed='r')

# Filter based on metadata (no data loading)
high_quality = adata[adata.obs['quality_score'] > 0.8]

# Load filtered subset
adata_subset = high_quality.to_memory()

# Process subset
process(adata_subset)

# Or process in chunks
chunk_size = 1000
for i in range(0, adata.n_obs, chunk_size):
    chunk = adata[i:i+chunk_size, :].to_memory()
    process(chunk)

Troubleshooting

Out of memory errors

Use backed mode or convert to sparse matrices:

# Backed mode
adata = ad.read_h5ad('file.h5ad', backed='r')

# Sparse matrices
from scipy.sparse import csr_matrix
adata.X = csr_matrix(adata.X)

Slow file reading

Use compression and appropriate formats:

# Optimize for storage
adata.strings_to_categoricals()
adata.write_h5ad('file.h5ad', compression='gzip')

# Use Zarr for cloud storage (v3 optional since anndata 0.12)
import anndata
anndata.settings.zarr_write_format = 3  # default is 2
adata.write_zarr('file.zarr', chunks=(1000, 1000))

Index alignment issues

Always align external data on index:

# Wrong
adata.obs['new_col'] = external_data['values']

# Correct
adata.obs['new_col'] = external_data.set_index('cell_id').loc[adata.obs_names, 'values']

Additional Resources

GitHub 仓库

K-Dense-AI/claude-scientific-skills
路径: skills/anndata
0
agent-skillsai-scientistbioinformaticschemoinformaticsclaudeclaude-skills

相关推荐技能

content-collections

Content Collections 是一个 TypeScript 优先的构建工具,可将本地 Markdown/MDX 文件转换为类型安全的数据集合。它专为构建博客、文档站和内容密集型 Vite+React 应用而设计,提供基于 Zod 的自动模式验证。该工具涵盖从 Vite 插件配置、MDX 编译到生产环境部署的完整工作流。

查看技能

polymarket

这个Claude Skill为开发者提供完整的Polymarket预测市场开发支持,涵盖API调用、交易执行和市场数据分析。关键特性包括实时WebSocket数据流,可监控实时交易、订单和市场动态。开发者可用它构建预测市场应用、实施交易策略并集成实时市场预测功能。

查看技能

creating-opencode-plugins

该Skill帮助开发者创建OpenCode插件,用于接入命令、文件、LSP等25+种事件。它提供了插件结构、事件API规范和JavaScript/TypeScript实现模式,适合需要拦截操作、扩展功能或自定义事件处理的场景。开发者可通过它快速构建响应式模块来增强OpenCode AI助手的能力。

查看技能

sglang

SGLang是一个专为LLM设计的高性能推理框架,特别适用于需要结构化输出的场景。它通过RadixAttention前缀缓存技术,在处理JSON、正则表达式、工具调用等具有重复前缀的复杂工作流时,能实现极速生成。如果你正在构建智能体或多轮对话系统,并追求远超vLLM的推理性能,SGLang是理想选择。

查看技能