SKILL·D11D37

cellxgene-census

Name: cellxgene-census
Author: K-Dense-AI

K-Dense-AI

업데이트됨 1 month ago

31,025

3,113

31,025

GitHub에서 보기

기타data

정보

이 스킬은 6천1백만 개 이상의 세포를 보유한 가장 큰 정제된 단일 세포 아틀라스인 CELLxGENE Census에 프로그래밍 방식으로 쿼리할 수 있는 기능을 제공합니다. 이는 집단 규모 분석, 참조 아틀라스 비교, 조직, 질병 또는 세포 유형에 걸친 표준화된 발현 데이터 추출에 최적화되어 있습니다. 공개 아틀라스 쿼리에는 이 도구를 사용하고, 사용자 개인의 데이터 분석에는 scanpy나 scvi-tools와 같은 도구를 활용하세요.

빠른 설치

Claude Code

문서

CZ CELLxGENE Census

Overview

The CZ CELLxGENE Census provides programmatic access to a comprehensive, versioned collection of standardized single-cell genomics data from CZ CELLxGENE Discover. This skill enables efficient querying and analysis of millions of cells across thousands of datasets.

The Census includes:

61+ million cells from human and mouse
Standardized metadata (cell types, tissues, diseases, donors)
Raw gene expression matrices
Pre-calculated embeddings and statistics
Integration with PyTorch, scanpy, and other analysis tools

When to Use This Skill

This skill should be used when:

Querying single-cell expression data by cell type, tissue, or disease
Exploring available single-cell datasets and metadata
Training machine learning models on single-cell data
Performing large-scale cross-dataset analyses
Integrating Census data with scanpy or other analysis frameworks
Computing statistics across millions of cells
Accessing pre-calculated embeddings or model predictions

Installation and Setup

Install the Census API:

uv pip install cellxgene-census

For machine learning workflows, install additional dependencies:

uv pip install cellxgene-census[experimental]

Core Workflow Patterns

1. Opening the Census

Always use the context manager to ensure proper resource cleanup:

import cellxgene_census

# Open latest stable version
with cellxgene_census.open_soma() as census:
    # Work with census data

# Open specific version for reproducibility
with cellxgene_census.open_soma(census_version="2023-07-25") as census:
    # Work with census data

Key points:

Use context manager (with statement) for automatic cleanup
Specify census_version for reproducible analyses
Default opens latest "stable" release

2. Exploring Census Information

Before querying expression data, explore available datasets and metadata.

Access summary information:

# Get summary statistics
summary = census["census_info"]["summary"].read().concat().to_pandas()
print(f"Total cells: {summary['total_cell_count'][0]}")

# Get all datasets
datasets = census["census_info"]["datasets"].read().concat().to_pandas()

# Filter datasets by criteria
covid_datasets = datasets[datasets["disease"].str.contains("COVID", na=False)]

Query cell metadata to understand available data:

# Get unique cell types in a tissue
cell_metadata = cellxgene_census.get_obs(
    census,
    "homo_sapiens",
    value_filter="tissue_general == 'brain' and is_primary_data == True",
    column_names=["cell_type"]
)
unique_cell_types = cell_metadata["cell_type"].unique()
print(f"Found {len(unique_cell_types)} cell types in brain")

# Count cells by tissue
tissue_counts = cell_metadata.groupby("tissue_general").size()

Important: Always filter for is_primary_data == True to avoid counting duplicate cells unless specifically analyzing duplicates.

3. Querying Expression Data (Small to Medium Scale)

For queries returning < 100k cells that fit in memory, use get_anndata():

# Basic query with cell type and tissue filters
adata = cellxgene_census.get_anndata(
    census=census,
    organism="Homo sapiens",  # or "Mus musculus"
    obs_value_filter="cell_type == 'B cell' and tissue_general == 'lung' and is_primary_data == True",
    obs_column_names=["assay", "disease", "sex", "donor_id"],
)

# Query specific genes with multiple filters
adata = cellxgene_census.get_anndata(
    census=census,
    organism="Homo sapiens",
    var_value_filter="feature_name in ['CD4', 'CD8A', 'CD19', 'FOXP3']",
    obs_value_filter="cell_type == 'T cell' and disease == 'COVID-19' and is_primary_data == True",
    obs_column_names=["cell_type", "tissue_general", "donor_id"],
)

Filter syntax:

Use obs_value_filter for cell filtering
Use var_value_filter for gene filtering
Combine conditions with and, or
Use in for multiple values: tissue in ['lung', 'liver']
Select only needed columns with obs_column_names

Getting metadata separately:

# Query cell metadata
cell_metadata = cellxgene_census.get_obs(
    census, "homo_sapiens",
    value_filter="disease == 'COVID-19' and is_primary_data == True",
    column_names=["cell_type", "tissue_general", "donor_id"]
)

# Query gene metadata
gene_metadata = cellxgene_census.get_var(
    census, "homo_sapiens",
    value_filter="feature_name in ['CD4', 'CD8A']",
    column_names=["feature_id", "feature_name", "feature_length"]
)

4. Large-Scale Queries (Out-of-Core Processing)

For queries exceeding available RAM, use axis_query() with iterative processing:

import tiledbsoma as soma

# Create axis query
query = census["census_data"]["homo_sapiens"].axis_query(
    measurement_name="RNA",
    obs_query=soma.AxisQuery(
        value_filter="tissue_general == 'brain' and is_primary_data == True"
    ),
    var_query=soma.AxisQuery(
        value_filter="feature_name in ['FOXP2', 'TBR1', 'SATB2']"
    )
)

# Iterate through expression matrix in chunks
iterator = query.X("raw").tables()
for batch in iterator:
    # batch is a pyarrow.Table with columns:
    # - soma_data: expression value
    # - soma_dim_0: cell (obs) coordinate
    # - soma_dim_1: gene (var) coordinate
    process_batch(batch)

Computing incremental statistics:

# Example: Calculate mean expression
n_observations = 0
sum_values = 0.0

iterator = query.X("raw").tables()
for batch in iterator:
    values = batch["soma_data"].to_numpy()
    n_observations += len(values)
    sum_values += values.sum()

mean_expression = sum_values / n_observations

5. Machine Learning with PyTorch

For training models, use the experimental PyTorch integration:

from cellxgene_census.experimental.ml import experiment_dataloader

with cellxgene_census.open_soma() as census:
    # Create dataloader
    dataloader = experiment_dataloader(
        census["census_data"]["homo_sapiens"],
        measurement_name="RNA",
        X_name="raw",
        obs_value_filter="tissue_general == 'liver' and is_primary_data == True",
        obs_column_names=["cell_type"],
        batch_size=128,
        shuffle=True,
    )

    # Training loop
    for epoch in range(num_epochs):
        for batch in dataloader:
            X = batch["X"]  # Gene expression tensor
            labels = batch["obs"]["cell_type"]  # Cell type labels

            # Forward pass
            outputs = model(X)
            loss = criterion(outputs, labels)

            # Backward pass
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

Train/test splitting:

from cellxgene_census.experimental.ml import ExperimentDataset

# Create dataset from experiment
dataset = ExperimentDataset(
    experiment_axis_query,
    layer_name="raw",
    obs_column_names=["cell_type"],
    batch_size=128,
)

# Split into train and test
train_dataset, test_dataset = dataset.random_split(
    split=[0.8, 0.2],
    seed=42
)

6. Integration with Scanpy

Seamlessly integrate Census data with scanpy workflows:

import scanpy as sc

# Load data from Census
adata = cellxgene_census.get_anndata(
    census=census,
    organism="Homo sapiens",
    obs_value_filter="cell_type == 'neuron' and tissue_general == 'cortex' and is_primary_data == True",
)

# Standard scanpy workflow
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
sc.pp.highly_variable_genes(adata, n_top_genes=2000)

# Dimensionality reduction
sc.pp.pca(adata, n_comps=50)
sc.pp.neighbors(adata)
sc.tl.umap(adata)

# Visualization
sc.pl.umap(adata, color=["cell_type", "tissue", "disease"])

7. Multi-Dataset Integration

Query and integrate multiple datasets:

# Strategy 1: Query multiple tissues separately
tissues = ["lung", "liver", "kidney"]
adatas = []

for tissue in tissues:
    adata = cellxgene_census.get_anndata(
        census=census,
        organism="Homo sapiens",
        obs_value_filter=f"tissue_general == '{tissue}' and is_primary_data == True",
    )
    adata.obs["tissue"] = tissue
    adatas.append(adata)

# Concatenate
combined = adatas[0].concatenate(adatas[1:])

# Strategy 2: Query multiple datasets directly
adata = cellxgene_census.get_anndata(
    census=census,
    organism="Homo sapiens",
    obs_value_filter="tissue_general in ['lung', 'liver', 'kidney'] and is_primary_data == True",
)

Key Concepts and Best Practices

Always Filter for Primary Data

Unless analyzing duplicates, always include is_primary_data == True in queries to avoid counting cells multiple times:

obs_value_filter="cell_type == 'B cell' and is_primary_data == True"

Specify Census Version for Reproducibility

Always specify the Census version in production analyses:

census = cellxgene_census.open_soma(census_version="2023-07-25")

Estimate Query Size Before Loading

For large queries, first check the number of cells to avoid memory issues:

# Get cell count
metadata = cellxgene_census.get_obs(
    census, "homo_sapiens",
    value_filter="tissue_general == 'brain' and is_primary_data == True",
    column_names=["soma_joinid"]
)
n_cells = len(metadata)
print(f"Query will return {n_cells:,} cells")

# If too large (>100k), use out-of-core processing

Use tissue_general for Broader Groupings

The tissue_general field provides coarser categories than tissue, useful for cross-tissue analyses:

# Broader grouping
obs_value_filter="tissue_general == 'immune system'"

# Specific tissue
obs_value_filter="tissue == 'peripheral blood mononuclear cell'"

Select Only Needed Columns

Minimize data transfer by specifying only required metadata columns:

obs_column_names=["cell_type", "tissue_general", "disease"]  # Not all columns

Check Dataset Presence for Gene-Specific Queries

When analyzing specific genes, verify which datasets measured them:

presence = cellxgene_census.get_presence_matrix(
    census,
    "homo_sapiens",
    var_value_filter="feature_name in ['CD4', 'CD8A']"
)

Two-Step Workflow: Explore Then Query

First explore metadata to understand available data, then query expression:

# Step 1: Explore what's available
metadata = cellxgene_census.get_obs(
    census, "homo_sapiens",
    value_filter="disease == 'COVID-19' and is_primary_data == True",
    column_names=["cell_type", "tissue_general"]
)
print(metadata.value_counts())

# Step 2: Query based on findings
adata = cellxgene_census.get_anndata(
    census=census,
    organism="Homo sapiens",
    obs_value_filter="disease == 'COVID-19' and cell_type == 'T cell' and is_primary_data == True",
)

Available Metadata Fields

Cell Metadata (obs)

Key fields for filtering:

cell_type, cell_type_ontology_term_id
tissue, tissue_general, tissue_ontology_term_id
disease, disease_ontology_term_id
assay, assay_ontology_term_id
donor_id, sex, self_reported_ethnicity
development_stage, development_stage_ontology_term_id
dataset_id
is_primary_data (Boolean: True = unique cell)

Gene Metadata (var)

feature_id (Ensembl gene ID, e.g., "ENSG00000161798")
feature_name (Gene symbol, e.g., "FOXP2")
feature_length (Gene length in base pairs)

Reference Documentation

This skill includes detailed reference documentation:

references/census_schema.md

Comprehensive documentation of:

Census data structure and organization
All available metadata fields
Value filter syntax and operators
SOMA object types
Data inclusion criteria

When to read: When you need detailed schema information, full list of metadata fields, or complex filter syntax.

references/common_patterns.md

Examples and patterns for:

Exploratory queries (metadata only)
Small-to-medium queries (AnnData)
Large queries (out-of-core processing)
PyTorch integration
Scanpy integration workflows
Multi-dataset integration
Best practices and common pitfalls

When to read: When implementing specific query patterns, looking for code examples, or troubleshooting common issues.

Common Use Cases

Use Case 1: Explore Cell Types in a Tissue

with cellxgene_census.open_soma() as census:
    cells = cellxgene_census.get_obs(
        census, "homo_sapiens",
        value_filter="tissue_general == 'lung' and is_primary_data == True",
        column_names=["cell_type"]
    )
    print(cells["cell_type"].value_counts())

Use Case 2: Query Marker Gene Expression

with cellxgene_census.open_soma() as census:
    adata = cellxgene_census.get_anndata(
        census=census,
        organism="Homo sapiens",
        var_value_filter="feature_name in ['CD4', 'CD8A', 'CD19']",
        obs_value_filter="cell_type in ['T cell', 'B cell'] and is_primary_data == True",
    )

Use Case 3: Train Cell Type Classifier

from cellxgene_census.experimental.ml import experiment_dataloader

with cellxgene_census.open_soma() as census:
    dataloader = experiment_dataloader(
        census["census_data"]["homo_sapiens"],
        measurement_name="RNA",
        X_name="raw",
        obs_value_filter="is_primary_data == True",
        obs_column_names=["cell_type"],
        batch_size=128,
        shuffle=True,
    )

    # Train model
    for epoch in range(epochs):
        for batch in dataloader:
            # Training logic
            pass

Use Case 4: Cross-Tissue Analysis

with cellxgene_census.open_soma() as census:
    adata = cellxgene_census.get_anndata(
        census=census,
        organism="Homo sapiens",
        obs_value_filter="cell_type == 'macrophage' and tissue_general in ['lung', 'liver', 'brain'] and is_primary_data == True",
    )

    # Analyze macrophage differences across tissues
    sc.tl.rank_genes_groups(adata, groupby="tissue_general")

Troubleshooting

Query Returns Too Many Cells

Add more specific filters to reduce scope
Use tissue instead of tissue_general for finer granularity
Filter by specific dataset_id if known
Switch to out-of-core processing for large queries

Memory Errors

Reduce query scope with more restrictive filters
Select fewer genes with var_value_filter
Use out-of-core processing with axis_query()
Process data in batches

Duplicate Cells in Results

Always include is_primary_data == True in filters
Check if intentionally querying across multiple datasets

Gene Not Found

Verify gene name spelling (case-sensitive)
Try Ensembl ID with feature_id instead of feature_name
Check dataset presence matrix to see if gene was measured
Some genes may have been filtered during Census construction

Version Inconsistencies

Always specify census_version explicitly
Use same version across all analyses
Check release notes for version-specific changes

GitHub 저장소

K-Dense-AI/claude-scientific-skills

경로: skills/cellxgene-census

agent-skillsai-scientistbioinformaticschemoinformaticsclaudeclaude-skills

FAQ

Frequently asked questions

What is the cellxgene-census skill?

cellxgene-census is a Claude Skill by K-Dense-AI. Skills package instructions and resources that Claude loads on demand, so Claude can perform cellxgene-census-related tasks without extra prompting.

How do I install cellxgene-census?

Use the install commands on this page: add cellxgene-census to Claude Code as a plugin, or clone its repository into your skills directory, then restart Claude so it picks up the skill.

What category does cellxgene-census belong to?

cellxgene-census is in the Other category, tagged data.

Is cellxgene-census free to use?

Yes. cellxgene-census is listed on AIMCP and free to install. It runs inside Claude, so no separate service account is required to use the skill itself.

연관 스킬

llamaguard

기타

LlamaGuard는 폭력 및 혐오 발언 등 6가지 안전 범주에서 LLM 입력과 출력을 조정하기 위한 Meta의 70-80억 파라미터 모델입니다. 94-95% 정확도를 제공하며 vLLM, Hugging Face 또는 Amazon SageMaker를 사용해 배포할 수 있습니다. 이 기술을 사용하여 AI 애플리케이션에 콘텐츠 필터링 및 안전 가드레일을 손쉽게 통합하세요.

스킬 보기

cost-optimization

기타

이 Claude Skill은 리소스 적정화, 태깅 전략, 지출 분석을 통해 개발자들이 클라우드 비용을 최적화할 수 있도록 지원합니다. AWS, Azure, GCP에서 클라우드 비용을 절감하고 비용 거버넌스를 구현하기 위한 프레임워크를 제공합니다. 인프라 비용을 분석하거나, 리소스를 적정화하거나, 예산 제약을 충족해야 할 때 사용하세요.

스킬 보기

sports-betting-analyzer

기타

이 Claude Skill은 스프레드, 오버/언더, 프로프 베트를 포함한 스포츠 베팅 시장을 분석합니다. 역사적 추이와 상황별 통계를 검토하여 가치 베트를 발견하고, 교육적 목적으로 실행 가능한 권장 사항이 담긴 구조화된 마크다운 결과를 제공합니다. 개발자는 이 기능을 스포츠 베팅 분석 도구에 활용할 수 있으며, 단순히 엔터테인먼트/교육 목적으로만 설계되었음을 유의해야 합니다.

스킬 보기

quantizing-models-bitsandbytes

기타

이 스킬은 bitsandbytes를 사용하여 LLM을 8비트 또는 4비트 정밀도로 양자화하며, 최소한의 정확도 손실로 50-75%의 메모리 감소를 달성합니다. 제한된 GPU 메모리에서 더 큰 모델을 실행하거나 추론을 가속화하는 데 이상적이며, INT8, NF4, FP4와 같은 형식을 지원합니다. 이 스킬은 HuggingFace Transformers와 통합되어 QLoRA 학습 및 8비트 옵티마이저를 가능하게 합니다.

스킬 보기