scikit-bio
정보
scikit-bio는 생물정보학을 위한 Python 라이브러리로, 서열 분석, 계통학 작업, 마이크로바이옴 연구를 가능하게 합니다. 다양성 지표 계산(UniFrac), 정량화 분석 수행(PCoA), 생물학적 파일 형식 처리(FASTA, Newick)와 같은 작업에 사용하세요. DNA/RNA 서열, 정렬, 또는 미생물 생태학 데이터를 다루는 개발자에게 이상적입니다.
빠른 설치
Claude Code
추천npx skills add K-Dense-AI/claude-scientific-skills -a claude-code/plugin add https://github.com/K-Dense-AI/claude-scientific-skillsgit clone https://github.com/K-Dense-AI/claude-scientific-skills.git ~/.claude/skills/scikit-bioClaude Code에서 이 명령을 복사하여 붙여넣어 스킬을 설치하세요
문서
scikit-bio
Overview
scikit-bio is a comprehensive Python library for working with biological data. Apply this skill for bioinformatics analyses spanning sequence manipulation, alignment, phylogenetics, microbial ecology, and multivariate statistics.
When to Use This Skill
This skill should be used when the user:
- Works with biological sequences (DNA, RNA, protein)
- Needs to read/write biological file formats (FASTA, FASTQ, GenBank, Newick, BIOM, etc.)
- Performs sequence alignments or searches for motifs
- Constructs or analyzes phylogenetic trees
- Calculates diversity metrics (alpha/beta diversity, UniFrac distances)
- Performs ordination analysis (PCoA, CCA, RDA)
- Runs statistical tests on biological/ecological data (PERMANOVA, ANOSIM, Mantel)
- Analyzes microbiome or community ecology data
- Works with protein embeddings from language models
- Needs to manipulate biological data tables
Core Capabilities
1. Sequence Manipulation
Work with biological sequences using specialized classes for DNA, RNA, and protein data.
Key operations:
- Read/write sequences from FASTA, FASTQ, GenBank, EMBL formats
- Sequence slicing, concatenation, and searching
- Reverse complement, transcription (DNA→RNA), and translation (RNA→protein)
- Find motifs and patterns using regex
- Calculate distances (Hamming, k-mer based)
- Handle sequence quality scores and metadata
Common patterns:
import skbio
# Read sequences from file
seq = skbio.DNA.read('input.fasta')
# Sequence operations
rc = seq.reverse_complement()
rna = seq.transcribe()
protein = rna.translate()
# Find motifs
motif_positions = seq.find_with_regex('ATG[ACGT]{3}')
# Check for properties
has_degens = seq.has_degenerates()
seq_no_gaps = seq.degap()
Important notes:
- Use
DNA,RNA,Proteinclasses for grammared sequences with validation - Use
Sequenceclass for generic sequences without alphabet restrictions - Quality scores automatically loaded from FASTQ files into positional metadata
- Metadata types: sequence-level (ID, description), positional (per-base), interval (regions/features)
2. Sequence Alignment
Perform pairwise and multiple sequence alignments using dynamic programming algorithms.
Key capabilities:
- Global alignment (Needleman-Wunsch with semi-global variant)
- Local alignment (Smith-Waterman)
- Configurable scoring schemes (match/mismatch, gap penalties, substitution matrices)
- CIGAR string conversion
- Multiple sequence alignment storage and manipulation with
TabularMSA
Common patterns:
from skbio.alignment import local_pairwise_align_ssw, TabularMSA
# Pairwise alignment
alignment = local_pairwise_align_ssw(seq1, seq2)
# Access aligned sequences
msa = alignment.aligned_sequences
# Read multiple alignment from file
msa = TabularMSA.read('alignment.fasta', constructor=skbio.DNA)
# Calculate consensus
consensus = msa.consensus()
Important notes:
- Use
local_pairwise_align_sswfor local alignments (faster, SSW-based) - Use
StripedSmithWatermanfor protein alignments - Affine gap penalties recommended for biological sequences
- Can convert between scikit-bio, BioPython, and Biotite alignment formats
3. Phylogenetic Trees
Construct, manipulate, and analyze phylogenetic trees representing evolutionary relationships.
Key capabilities:
- Tree construction from distance matrices (UPGMA, WPGMA, Neighbor Joining, GME, BME)
- Tree manipulation (pruning, rerooting, traversal)
- Distance calculations (patristic, cophenetic, Robinson-Foulds)
- ASCII visualization
- Newick format I/O
Common patterns:
from skbio import TreeNode
from skbio.tree import nj
# Read tree from file
tree = TreeNode.read('tree.nwk')
# Construct tree from distance matrix
tree = nj(distance_matrix)
# Tree operations
subtree = tree.shear(['taxon1', 'taxon2', 'taxon3'])
tips = [node for node in tree.tips()]
lca = tree.lowest_common_ancestor(['taxon1', 'taxon2'])
# Calculate distances
patristic_dist = tree.find('taxon1').distance(tree.find('taxon2'))
cophenetic_matrix = tree.cophenetic_matrix()
# Compare trees
rf_distance = tree.robinson_foulds(other_tree)
Important notes:
- Use
nj()for neighbor joining (classic phylogenetic method) - Use
upgma()for UPGMA (assumes molecular clock) - GME and BME are highly scalable for large trees
- Trees can be rooted or unrooted; some metrics require specific rooting
4. Diversity Analysis
Calculate alpha and beta diversity metrics for microbial ecology and community analysis.
Key capabilities:
- Alpha diversity: richness, Shannon entropy, Simpson index, Faith's PD, Pielou's evenness
- Beta diversity: Bray-Curtis, Jaccard, weighted/unweighted UniFrac, Euclidean distances
- Phylogenetic diversity metrics (require tree input)
- Rarefaction and subsampling
- Integration with ordination and statistical tests
Common patterns:
from skbio.diversity import alpha_diversity, beta_diversity
import skbio
# Alpha diversity
alpha = alpha_diversity('shannon', counts_matrix, ids=sample_ids)
faith_pd = alpha_diversity('faith_pd', counts_matrix, ids=sample_ids,
tree=tree, otu_ids=feature_ids)
# Beta diversity
bc_dm = beta_diversity('braycurtis', counts_matrix, ids=sample_ids)
unifrac_dm = beta_diversity('unweighted_unifrac', counts_matrix,
ids=sample_ids, tree=tree, otu_ids=feature_ids)
# Get available metrics
from skbio.diversity import get_alpha_diversity_metrics
print(get_alpha_diversity_metrics())
Important notes:
- Counts must be integers representing abundances, not relative frequencies
- Phylogenetic metrics (Faith's PD, UniFrac) require tree and OTU ID mapping
- Use
partial_beta_diversity()for computing specific sample pairs only - Alpha diversity returns Series, beta diversity returns DistanceMatrix
5. Ordination Methods
Reduce high-dimensional biological data to visualizable lower-dimensional spaces.
Key capabilities:
- PCoA (Principal Coordinate Analysis) from distance matrices
- CA (Correspondence Analysis) for contingency tables
- CCA (Canonical Correspondence Analysis) with environmental constraints
- RDA (Redundancy Analysis) for linear relationships
- Biplot projection for feature interpretation
Common patterns:
from skbio.stats.ordination import pcoa, cca
# PCoA from distance matrix
pcoa_results = pcoa(distance_matrix)
pc1 = pcoa_results.samples['PC1']
pc2 = pcoa_results.samples['PC2']
# CCA with environmental variables
cca_results = cca(species_matrix, environmental_matrix)
# Save/load ordination results
pcoa_results.write('ordination.txt')
results = skbio.OrdinationResults.read('ordination.txt')
Important notes:
- PCoA works with any distance/dissimilarity matrix
- CCA reveals environmental drivers of community composition
- Ordination results include eigenvalues, proportion explained, and sample/feature coordinates
- Results integrate with plotting libraries (matplotlib, seaborn, plotly)
6. Statistical Testing
Perform hypothesis tests specific to ecological and biological data.
Key capabilities:
- PERMANOVA: test group differences using distance matrices
- ANOSIM: alternative test for group differences
- PERMDISP: test homogeneity of group dispersions
- Mantel test: correlation between distance matrices
- Bioenv: find environmental variables correlated with distances
Common patterns:
from skbio.stats.distance import permanova, anosim, mantel
# Test if groups differ significantly
permanova_results = permanova(distance_matrix, grouping, permutations=999)
print(f"p-value: {permanova_results['p-value']}")
# ANOSIM test
anosim_results = anosim(distance_matrix, grouping, permutations=999)
# Mantel test between two distance matrices
mantel_results = mantel(dm1, dm2, method='pearson', permutations=999)
print(f"Correlation: {mantel_results[0]}, p-value: {mantel_results[1]}")
Important notes:
- Permutation tests provide non-parametric significance testing
- Use 999+ permutations for robust p-values
- PERMANOVA sensitive to dispersion differences; pair with PERMDISP
- Mantel tests assess matrix correlation (e.g., geographic vs genetic distance)
7. File I/O and Format Conversion
Read and write 19+ biological file formats with automatic format detection.
Supported formats:
- Sequences: FASTA, FASTQ, GenBank, EMBL, QSeq
- Alignments: Clustal, PHYLIP, Stockholm
- Trees: Newick
- Tables: BIOM (HDF5 and JSON)
- Distances: delimited square matrices
- Analysis: BLAST+6/7, GFF3, Ordination results
- Metadata: TSV/CSV with validation
Common patterns:
import skbio
# Read with automatic format detection
seq = skbio.DNA.read('file.fasta', format='fasta')
tree = skbio.TreeNode.read('tree.nwk')
# Write to file
seq.write('output.fasta', format='fasta')
# Generator for large files (memory efficient)
for seq in skbio.io.read('large.fasta', format='fasta', constructor=skbio.DNA):
process(seq)
# Convert formats
seqs = list(skbio.io.read('input.fastq', format='fastq', constructor=skbio.DNA))
skbio.io.write(seqs, format='fasta', into='output.fasta')
Important notes:
- Use generators for large files to avoid memory issues
- Format can be auto-detected when
intoparameter specified - Some objects can be written to multiple formats
- Support for stdin/stdout piping with
verify=False
8. Distance Matrices
Create and manipulate distance/dissimilarity matrices with statistical methods.
Key capabilities:
- Store symmetric (DistanceMatrix) or asymmetric (DissimilarityMatrix) data
- ID-based indexing and slicing
- Integration with diversity, ordination, and statistical tests
- Read/write delimited text format
Common patterns:
from skbio import DistanceMatrix
import numpy as np
# Create from array
data = np.array([[0, 1, 2], [1, 0, 3], [2, 3, 0]])
dm = DistanceMatrix(data, ids=['A', 'B', 'C'])
# Access distances
dist_ab = dm['A', 'B']
row_a = dm['A']
# Read from file
dm = DistanceMatrix.read('distances.txt')
# Use in downstream analyses
pcoa_results = pcoa(dm)
permanova_results = permanova(dm, grouping)
Important notes:
- DistanceMatrix enforces symmetry and zero diagonal
- DissimilarityMatrix allows asymmetric values
- IDs enable integration with metadata and biological knowledge
- Compatible with pandas, numpy, and scikit-learn
9. Biological Tables
Work with feature tables (OTU/ASV tables) common in microbiome research.
Key capabilities:
- BIOM format I/O (HDF5 and JSON)
- Integration with pandas, polars, AnnData, numpy
- Data augmentation techniques (phylomix, mixup, compositional methods)
- Sample/feature filtering and normalization
- Metadata integration
Common patterns:
from skbio import Table
# Read BIOM table
table = Table.read('table.biom')
# Access data
sample_ids = table.ids(axis='sample')
feature_ids = table.ids(axis='observation')
counts = table.matrix_data
# Filter
filtered = table.filter(sample_ids_to_keep, axis='sample')
# Convert to/from pandas
df = table.to_dataframe()
table = Table.from_dataframe(df)
Important notes:
- BIOM tables are standard in QIIME 2 workflows
- Rows typically represent samples, columns represent features (OTUs/ASVs)
- Supports sparse and dense representations
- Output format configurable (pandas/polars/numpy)
10. Protein Embeddings
Work with protein language model embeddings for downstream analysis.
Key capabilities:
- Store embeddings from protein language models (ESM, ProtTrans, etc.)
- Convert embeddings to distance matrices
- Generate ordination objects for visualization
- Export to numpy/pandas for ML workflows
Common patterns:
from skbio.embedding import ProteinEmbedding, ProteinVector
# Create embedding from array
embedding = ProteinEmbedding(embedding_array, sequence_ids)
# Convert to distance matrix for analysis
dm = embedding.to_distances(metric='euclidean')
# PCoA visualization of embedding space
pcoa_results = embedding.to_ordination(metric='euclidean', method='pcoa')
# Export for machine learning
array = embedding.to_array()
df = embedding.to_dataframe()
Important notes:
- Embeddings bridge protein language models with traditional bioinformatics
- Compatible with scikit-bio's distance/ordination/statistics ecosystem
- SequenceEmbedding and ProteinEmbedding provide specialized functionality
- Useful for sequence clustering, classification, and visualization
Best Practices
Installation
uv pip install scikit-bio
Performance Considerations
- Use generators for large sequence files to minimize memory usage
- For massive phylogenetic trees, prefer GME or BME over NJ
- Beta diversity calculations can be parallelized with
partial_beta_diversity() - BIOM format (HDF5) more efficient than JSON for large tables
Integration with Ecosystem
- Sequences interoperate with Biopython via standard formats
- Tables integrate with pandas, polars, and AnnData
- Distance matrices compatible with scikit-learn
- Ordination results visualizable with matplotlib/seaborn/plotly
- Works seamlessly with QIIME 2 artifacts (BIOM, trees, distance matrices)
Common Workflows
- Microbiome diversity analysis: Read BIOM table → Calculate alpha/beta diversity → Ordination (PCoA) → Statistical testing (PERMANOVA)
- Phylogenetic analysis: Read sequences → Align → Build distance matrix → Construct tree → Calculate phylogenetic distances
- Sequence processing: Read FASTQ → Quality filter → Trim/clean → Find motifs → Translate → Write FASTA
- Comparative genomics: Read sequences → Pairwise alignment → Calculate distances → Build tree → Analyze clades
Reference Documentation
For detailed API information, parameter specifications, and advanced usage examples, refer to references/api_reference.md which contains comprehensive documentation on:
- Complete method signatures and parameters for all capabilities
- Extended code examples for complex workflows
- Troubleshooting common issues
- Performance optimization tips
- Integration patterns with other libraries
Additional Resources
- Official documentation: https://scikit.bio/docs/latest/
- GitHub repository: https://github.com/scikit-bio/scikit-bio
- Forum support: https://forum.qiime2.org (scikit-bio is part of QIIME 2 ecosystem)
GitHub 저장소
연관 스킬
llamaguard
기타LlamaGuard는 폭력 및 혐오 발언 등 6가지 안전 범주에서 LLM 입력과 출력을 조정하기 위한 Meta의 70-80억 파라미터 모델입니다. 94-95% 정확도를 제공하며 vLLM, Hugging Face 또는 Amazon SageMaker를 사용해 배포할 수 있습니다. 이 기술을 사용하여 AI 애플리케이션에 콘텐츠 필터링 및 안전 가드레일을 손쉽게 통합하세요.
cost-optimization
기타이 Claude Skill은 리소스 적정화, 태깅 전략, 지출 분석을 통해 개발자들이 클라우드 비용을 최적화할 수 있도록 지원합니다. AWS, Azure, GCP에서 클라우드 비용을 절감하고 비용 거버넌스를 구현하기 위한 프레임워크를 제공합니다. 인프라 비용을 분석하거나, 리소스를 적정화하거나, 예산 제약을 충족해야 할 때 사용하세요.
quantizing-models-bitsandbytes
기타이 스킬은 bitsandbytes를 사용하여 LLM을 8비트 또는 4비트 정밀도로 양자화하며, 최소한의 정확도 손실로 50-75%의 메모리 감소를 달성합니다. 제한된 GPU 메모리에서 더 큰 모델을 실행하거나 추론을 가속화하는 데 이상적이며, INT8, NF4, FP4와 같은 형식을 지원합니다. 이 스킬은 HuggingFace Transformers와 통합되어 QLoRA 학습 및 8비트 옵티마이저를 가능하게 합니다.
dispatching-parallel-agents
기타이 Claude Skill은 3개 이상의 독립적인 문제를 동시에 조사하고 해결하기 위해 다중 에이전트를 배치합니다. 공유 상태나 의존성 없이 해결 가능한 무관련 장애 시나리오에 맞게 설계되었습니다. 핵심 기능은 병렬 문제 해결로, 각 독립 문제 영역마다 하나의 에이전트를 할당하여 효율성을 극대화합니다.
