pysam
关于
Pysam is a Python genomic file toolkit for reading, writing, and analyzing SAM/BAM/CRAM, VCF/BCF, and FASTA/FASTQ files. It provides a Pythonic interface to htslib for tasks like region extraction, coverage calculation, and executing samtools commands. Use this skill in NGS data processing pipelines for alignment analysis, variant calling, and sequencing data quality control.
快速安装
Claude Code
推荐npx skills add K-Dense-AI/claude-scientific-skills -a claude-code/plugin add https://github.com/K-Dense-AI/claude-scientific-skillsgit clone https://github.com/K-Dense-AI/claude-scientific-skills.git ~/.claude/skills/pysam在 Claude Code 中复制并粘贴此命令以安装该技能
技能文档
Pysam
Overview
Pysam is a Python module for reading, manipulating, and writing genomic datasets. Read/write SAM/BAM/CRAM alignment files, VCF/BCF variant files, and FASTA/FASTQ sequences with a Pythonic interface to htslib. Query tabix-indexed files, perform pileup analysis for coverage, and execute samtools/bcftools commands.
When to Use This Skill
This skill should be used when:
- Working with sequencing alignment files (BAM/CRAM)
- Analyzing genetic variants (VCF/BCF)
- Extracting reference sequences or gene regions
- Processing raw sequencing data (FASTQ)
- Calculating coverage or read depth
- Implementing bioinformatics analysis pipelines
- Quality control of sequencing data
- Variant calling and annotation workflows
Quick Start
Installation
uv pip install pysam
Basic Examples
Read alignment file:
import pysam
# Open BAM file and fetch reads in region
samfile = pysam.AlignmentFile("example.bam", "rb")
for read in samfile.fetch("chr1", 1000, 2000):
print(f"{read.query_name}: {read.reference_start}")
samfile.close()
Read variant file:
# Open VCF file and iterate variants
vcf = pysam.VariantFile("variants.vcf")
for variant in vcf:
print(f"{variant.chrom}:{variant.pos} {variant.ref}>{variant.alts}")
vcf.close()
Query reference sequence:
# Open FASTA and extract sequence
fasta = pysam.FastaFile("reference.fasta")
sequence = fasta.fetch("chr1", 1000, 2000)
print(sequence)
fasta.close()
Core Capabilities
1. Alignment File Operations (SAM/BAM/CRAM)
Use the AlignmentFile class to work with aligned sequencing reads. This is appropriate for analyzing mapping results, calculating coverage, extracting reads, or quality control.
Common operations:
- Open and read BAM/SAM/CRAM files
- Fetch reads from specific genomic regions
- Filter reads by mapping quality, flags, or other criteria
- Write filtered or modified alignments
- Calculate coverage statistics
- Perform pileup analysis (base-by-base coverage)
- Access read sequences, quality scores, and alignment information
Reference: See references/alignment_files.md for detailed documentation on:
- Opening and reading alignment files
- AlignedSegment attributes and methods
- Region-based fetching with
fetch() - Pileup analysis for coverage
- Writing and creating BAM files
- Coordinate systems and indexing
- Performance optimization tips
2. Variant File Operations (VCF/BCF)
Use the VariantFile class to work with genetic variants from variant calling pipelines. This is appropriate for variant analysis, filtering, annotation, or population genetics.
Common operations:
- Read and write VCF/BCF files
- Query variants in specific regions
- Access variant information (position, alleles, quality)
- Extract genotype data for samples
- Filter variants by quality, allele frequency, or other criteria
- Annotate variants with additional information
- Subset samples or regions
Reference: See references/variant_files.md for detailed documentation on:
- Opening and reading variant files
- VariantRecord attributes and methods
- Accessing INFO and FORMAT fields
- Working with genotypes and samples
- Creating and writing VCF files
- Filtering and subsetting variants
- Multi-sample VCF operations
3. Sequence File Operations (FASTA/FASTQ)
Use FastaFile for random access to reference sequences and FastxFile for reading raw sequencing data. This is appropriate for extracting gene sequences, validating variants against reference, or processing raw reads.
Common operations:
- Query reference sequences by genomic coordinates
- Extract sequences for genes or regions of interest
- Read FASTQ files with quality scores
- Validate variant reference alleles
- Calculate sequence statistics
- Filter reads by quality or length
- Convert between FASTA and FASTQ formats
Reference: See references/sequence_files.md for detailed documentation on:
- FASTA file access and indexing
- Extracting sequences by region
- Handling reverse complement for genes
- Reading FASTQ files sequentially
- Quality score conversion and filtering
- Working with tabix-indexed files (BED, GTF, GFF)
- Common sequence processing patterns
4. Integrated Bioinformatics Workflows
Pysam excels at integrating multiple file types for comprehensive genomic analyses. Common workflows combine alignment files, variant files, and reference sequences.
Common workflows:
- Calculate coverage statistics for specific regions
- Validate variants against aligned reads
- Annotate variants with coverage information
- Extract sequences around variant positions
- Filter alignments or variants based on multiple criteria
- Generate coverage tracks for visualization
- Quality control across multiple data types
Reference: See references/common_workflows.md for detailed examples of:
- Quality control workflows (BAM statistics, reference consistency)
- Coverage analysis (per-base coverage, low coverage detection)
- Variant analysis (annotation, filtering by read support)
- Sequence extraction (variant contexts, gene sequences)
- Read filtering and subsetting
- Integration patterns (BAM+VCF, VCF+BED, etc.)
- Performance optimization for complex workflows
Key Concepts
Coordinate Systems
Critical: Pysam uses 0-based, half-open coordinates (Python convention):
- Start positions are 0-based (first base is position 0)
- End positions are exclusive (not included in the range)
- Region 1000-2000 includes bases 1000-1999 (1000 bases total)
Exception: Region strings in fetch() follow samtools convention (1-based):
samfile.fetch("chr1", 999, 2000) # 0-based: positions 999-1999
samfile.fetch("chr1:1000-2000") # 1-based string: positions 1000-2000
VCF files: Use 1-based coordinates in the file format, but VariantRecord.start is 0-based.
Indexing Requirements
Random access to specific genomic regions requires index files:
- BAM files: Require
.baiindex (create withpysam.index()) - CRAM files: Require
.craiindex - FASTA files: Require
.faiindex (create withpysam.faidx()) - VCF.gz files: Require
.tbitabix index (create withpysam.tabix_index()) - BCF files: Require
.csiindex
Without an index, use fetch(until_eof=True) for sequential reading.
File Modes
Specify format when opening files:
"rb"- Read BAM (binary)"r"- Read SAM (text)"rc"- Read CRAM"wb"- Write BAM"w"- Write SAM"wc"- Write CRAM
Performance Considerations
- Always use indexed files for random access operations
- Use
pileup()for column-wise analysis instead of repeated fetch operations - Use
count()for counting instead of iterating and counting manually - Process regions in parallel when analyzing independent genomic regions
- Close files explicitly to free resources
- Use
until_eof=Truefor sequential processing without index - Avoid multiple iterators unless necessary (use
multiple_iterators=Trueif needed)
Common Pitfalls
- Coordinate confusion: Remember 0-based vs 1-based systems in different contexts
- Missing indices: Many operations require index files—create them first
- Partial overlaps:
fetch()returns reads overlapping region boundaries, not just those fully contained - Iterator scope: Keep pileup iterator references alive to avoid "PileupProxy accessed after iterator finished" errors
- Quality score editing: Cannot modify
query_qualitiesin place after changingquery_sequence—create a copy first - Stream limitations: Only stdin/stdout are supported for streaming, not arbitrary Python file objects
- Thread safety: While GIL is released during I/O, comprehensive thread-safety hasn't been fully validated
Command-Line Tools
Pysam provides access to samtools and bcftools commands:
# Sort BAM file
pysam.samtools.sort("-o", "sorted.bam", "input.bam")
# Index BAM
pysam.samtools.index("sorted.bam")
# View specific region
pysam.samtools.view("-b", "-o", "region.bam", "input.bam", "chr1:1000-2000")
# BCF tools
pysam.bcftools.view("-O", "z", "-o", "output.vcf.gz", "input.vcf")
Error handling:
try:
pysam.samtools.sort("-o", "output.bam", "input.bam")
except pysam.SamtoolsError as e:
print(f"Error: {e}")
Resources
references/
Detailed documentation for each major capability:
-
alignment_files.md - Complete guide to SAM/BAM/CRAM operations, including AlignmentFile class, AlignedSegment attributes, fetch operations, pileup analysis, and writing alignments
-
variant_files.md - Complete guide to VCF/BCF operations, including VariantFile class, VariantRecord attributes, genotype handling, INFO/FORMAT fields, and multi-sample operations
-
sequence_files.md - Complete guide to FASTA/FASTQ operations, including FastaFile and FastxFile classes, sequence extraction, quality score handling, and tabix-indexed file access
-
common_workflows.md - Practical examples of integrated bioinformatics workflows combining multiple file types, including quality control, coverage analysis, variant validation, and sequence extraction
Getting Help
For detailed information on specific operations, refer to the appropriate reference document:
- Working with BAM files or calculating coverage →
alignment_files.md - Analyzing variants or genotypes →
variant_files.md - Extracting sequences or processing FASTQ →
sequence_files.md - Complex workflows integrating multiple file types →
common_workflows.md
Official documentation: https://pysam.readthedocs.io/
GitHub 仓库
相关推荐技能
railway-docs
文档Railway Docs Skill可实时获取最新的Railway官方文档,确保回答的准确性。当开发者询问Railway功能特性、工作原理或分享docs.railway.com链接时,应优先使用此技能。它通过专门的LLM优化文档源提供最新信息,避免依赖过时记忆来回答技术问题。
n8n-code-python
文档该Skill为在n8n平台的Python代码节点中编写代码提供专家指导,特别适用于需要使用_input/_json/_node语法、Python标准库或了解n8n中Python限制的场景。它强调JavaScript应作为首选方案,仅当需要特定Python功能或对Python语法更熟悉时才使用Python。Skill提供了快速入门模板和关键注意事项,帮助开发者在n8n中高效编写Python代码。
archon
文档Archon Skill为开发者提供了基于RAG的语义搜索和项目任务管理功能,可通过REST API访问知识库。它支持文档搜索、网站爬取、文件上传和版本控制,适用于技术文档查询和项目管理场景。首次使用时需要配置Archon主机地址,建议在处理外部文档时优先使用该Skill。
n8n-code-javascript
文档这个Skill为n8n工作流中的JavaScript代码节点提供专业指导,涵盖数据处理、HTTP请求和日期操作等核心场景。它详细解释了如何正确使用n8n特有的`$input`/`$json`语法、`$helpers`工具以及DateTime对象,并包含关键的错误排查和模式选择建议。开发者通过该Skill能快速掌握Code节点的正确返回格式、数据访问方法和常见陷阱解决方案。
