polars-bio
정보
polars-bio는 Polars DataFrame을 활용한 유전체 구간 연산 및 생정보학 파일 입출력을 위한 고성능 Python 라이브러리입니다. BED/VCF/BAM/GFF 형식에 대해 중첩, 최근접, 병합, 커버리지 등의 핵심 기능을 제공하며, 스트리밍 및 클라우드 네이티브 스토리지도 지원합니다. 대규모 유전체 데이터 처리 시 bioframe보다 더 빠르고 확장성 있는 대안으로 사용할 수 있습니다.
빠른 설치
Claude Code
추천npx skills add K-Dense-AI/claude-scientific-skills -a claude-code/plugin add https://github.com/K-Dense-AI/claude-scientific-skillsgit clone https://github.com/K-Dense-AI/claude-scientific-skills.git ~/.claude/skills/polars-bioClaude Code에서 이 명령을 복사하여 붙여넣어 스킬을 설치하세요
문서
polars-bio
Overview
polars-bio is a high-performance Python library for genomic interval operations and bioinformatics file I/O, built on Polars, Apache Arrow, and Apache DataFusion. It provides a familiar DataFrame-centric API for interval arithmetic (overlap, nearest, merge, coverage, complement, subtract) and reading/writing common bioinformatics formats (BED, VCF, BAM, CRAM, GFF/GTF, FASTA, FASTQ).
Key value propositions:
- 6-38x faster than bioframe on real-world genomic benchmarks
- Streaming/out-of-core support for large genomes via DataFusion
- Cloud-native file I/O (S3, GCS, Azure) with predicate pushdown
- Two API styles: functional (
pb.overlap(df1, df2)) and method-chaining (df1.lazy().pb.overlap(df2)) - SQL interface for genomic data via DataFusion SQL engine
When to Use This Skill
Use this skill when:
- Performing genomic interval operations (overlap, nearest, merge, coverage, complement, subtract)
- Reading/writing bioinformatics file formats (BED, VCF, BAM, CRAM, GFF/GTF, FASTA, FASTQ)
- Processing large genomic datasets that don't fit in memory (streaming mode)
- Running SQL queries on genomic data files
- Migrating from bioframe to a faster alternative
- Computing read depth/pileup from BAM/CRAM files
- Working with Polars DataFrames containing genomic intervals
Quick Start
Installation
Requires Python 3.11–3.14 (see PyPI).
uv pip install "polars-bio==0.31.0"
For pandas compatibility (pandas ≥3.0):
uv pip install "polars-bio[pandas]==0.31.0"
Basic Overlap Example
import polars as pl
import polars_bio as pb
# Create two interval DataFrames
df1 = pl.DataFrame({
"chrom": ["chr1", "chr1", "chr1"],
"start": [1, 5, 22],
"end": [6, 9, 30],
})
df2 = pl.DataFrame({
"chrom": ["chr1", "chr1"],
"start": [3, 25],
"end": [8, 28],
})
# Functional API (returns LazyFrame by default)
result = pb.overlap(df1, df2)
result_df = result.collect()
# Get a DataFrame directly
result_df = pb.overlap(df1, df2, output_type="polars.DataFrame")
# Method-chaining API (via .pb accessor on LazyFrame)
result = df1.lazy().pb.overlap(df2)
result_df = result.collect()
Reading a BED File
import polars_bio as pb
# Eager read (loads entire file)
df = pb.read_bed("regions.bed")
# Lazy scan (streaming, for large files)
lf = pb.scan_bed("regions.bed")
result = lf.collect()
Core Capabilities
1. Genomic Interval Operations
polars-bio provides 8 core interval operations for genomic range arithmetic. All operations accept Polars DataFrames with chrom, start, end columns (configurable). All operations return a LazyFrame by default (use output_type="polars.DataFrame" for eager results).
Operations:
overlap/count_overlaps- Find or count overlapping intervals between two sets (overlap_output="left"returns df1-only hits since 0.30.0)nearest- Find nearest intervals (with configurablek,overlap,distanceparams)merge- Merge overlapping/bookended intervals within a setcluster- Assign cluster IDs to overlapping intervalscoverage- Compute per-interval coverage counts (two-input operation)complement- Find gaps between intervals within a genomesubtract- Remove portions of intervals that overlap another set
Example:
import polars_bio as pb
# Find overlapping intervals (returns LazyFrame)
result = pb.overlap(df1, df2, suffixes=("_1", "_2"))
# Count overlaps per interval
counts = pb.count_overlaps(df1, df2)
# Merge overlapping intervals
merged = pb.merge(df1)
# Find nearest intervals
nearest = pb.nearest(df1, df2)
# Collect any LazyFrame result to DataFrame
result_df = result.collect()
Reference: See references/interval_operations.md for detailed documentation on all operations, parameters, output schemas, and performance considerations.
2. Bioinformatics File I/O
Read and write common bioinformatics formats with read_*, scan_*, write_*, and sink_* functions. Supports cloud storage (S3, GCS, Azure) and compression (GZIP, BGZF).
Supported formats:
- BED - Genomic intervals (
read_bed,scan_bed,write_*via generic) - VCF - Genetic variants (
read_vcf,scan_vcf,write_vcf,sink_vcf) - VCF Zarr - Analysis-ready Zarr stores (
read_vcf_zarr,scan_vcf_zarr; local directory paths) - BAM - Aligned reads (
read_bam,scan_bam,write_bam,sink_bam) - CRAM - Compressed alignments (
read_cram,scan_cram,write_cram,sink_cram) - GFF - Gene annotations (
read_gff,scan_gff) - GTF - Gene annotations (
read_gtf,scan_gtf) - FASTA - Reference sequences (
read_fasta,scan_fasta,write_fasta,sink_fasta) - FASTQ - Sequencing reads (
read_fastq,scan_fastq,write_fastq,sink_fastq) - SAM - Text alignments (
read_sam,scan_sam,write_sam,sink_sam) - Hi-C pairs - Chromatin contacts (
read_pairs,scan_pairs)
Example:
import polars_bio as pb
# Read VCF file
variants = pb.read_vcf("samples.vcf.gz")
# Lazy scan BAM file (streaming)
alignments = pb.scan_bam("aligned.bam")
# Read GFF annotations
genes = pb.read_gff("annotations.gff3")
# Cloud storage (individual params, not a dict)
df = pb.read_bed("s3://bucket/regions.bed",
allow_anonymous=True)
Reference: See references/file_io.md for per-format column schemas, parameters, cloud storage options, and compression support.
3. SQL Data Processing
Register bioinformatics files as tables and query them using DataFusion SQL. Combines the power of SQL with polars-bio's genomic-aware readers.
import polars as pl
import polars_bio as pb
# Register files as SQL tables (path first, name= keyword)
pb.register_vcf("samples.vcf.gz", name="variants")
pb.register_bed("target_regions.bed", name="regions")
# Query with SQL (returns LazyFrame)
result = pb.sql("SELECT chrom, start, end, ref, alt FROM variants WHERE qual > 30")
result_df = result.collect()
# Register a Polars DataFrame as a SQL table
pb.from_polars("my_intervals", df)
result = pb.sql("SELECT * FROM my_intervals WHERE chrom = 'chr1'").collect()
Reference: See references/sql_processing.md for register functions, SQL syntax, and examples.
4. Pileup Operations
Compute per-base read depth from BAM/CRAM files with CIGAR-aware depth calculation.
import polars_bio as pb
# Compute depth across a BAM file
depth_lf = pb.depth("aligned.bam")
depth_df = depth_lf.collect()
# With quality filter
depth_lf = pb.depth("aligned.bam", min_mapping_quality=20)
Reference: See references/pileup_operations.md for parameters and integration patterns.
Key Concepts
Coordinate Systems
polars-bio defaults to 1-based coordinates (genomic convention). This can be changed globally:
import polars_bio as pb
# Switch to 0-based half-open coordinates (default is 1-based / False)
pb.set_option("datafusion.bio.coordinate_system_zero_based", True)
# Switch back to 1-based (default)
pb.set_option("datafusion.bio.coordinate_system_zero_based", False)
I/O functions also accept use_zero_based to set coordinate metadata on the resulting DataFrame:
# Read BED with explicit 0-based metadata
df = pb.read_bed("regions.bed", use_zero_based=True)
Important: BED files are always 0-based half-open in the file format. polars-bio handles the conversion automatically when reading BED files. Coordinate metadata is attached to DataFrames by I/O functions and propagated through operations.
Two API Styles
Functional API - standalone functions, explicit inputs:
result = pb.overlap(df1, df2, suffixes=("_1", "_2"))
merged = pb.merge(df)
Method-chaining API - via .pb accessor on LazyFrames (not DataFrames):
result = df1.lazy().pb.overlap(df2)
merged = df.lazy().pb.merge()
Important: The .pb accessor for interval operations is only available on LazyFrame. On DataFrame, .pb provides write operations only (write_bam, write_vcf, etc.).
Method-chaining enables fluent pipelines:
# Chain interval operations (note: overlap outputs suffixed columns,
# so rename before merge which expects chrom/start/end)
result = (
df1.lazy()
.pb.overlap(df2)
.filter(pl.col("start_2") > 1000)
.select(
pl.col("chrom_1").alias("chrom"),
pl.col("start_1").alias("start"),
pl.col("end_1").alias("end"),
)
.pb.merge()
.collect()
)
Probe-Build Architecture
For two-input operations (overlap, nearest, count_overlaps, coverage), polars-bio uses a probe-build join strategy:
- The first DataFrame is the probe (iterated over)
- The second DataFrame is the build (indexed for lookup)
For best performance, pass the larger DataFrame as the first argument (probe) and the smaller one as the second (build).
Column Conventions
By default, polars-bio expects columns named chrom, start, end. Custom column names can be specified via lists:
result = pb.overlap(
df1, df2,
cols1=["chromosome", "begin", "finish"],
cols2=["chr", "pos_start", "pos_end"],
)
Return Types and Collecting Results
All interval operations and pb.sql() return a LazyFrame by default. Use .collect() to materialize results, or pass output_type="polars.DataFrame" for eager evaluation:
# Lazy (default) - collect when needed
result_lf = pb.overlap(df1, df2)
result_df = result_lf.collect()
# Eager - get DataFrame directly
result_df = pb.overlap(df1, df2, output_type="polars.DataFrame")
Streaming and Out-of-Core Processing
For datasets larger than available RAM, use scan_* functions and streaming execution:
# Scan files lazily
lf = pb.scan_bed("large_intervals.bed")
# Process with Polars streaming (requires polars ≥1.37, bundled with polars-bio)
result = lf.collect(engine="streaming")
DataFusion streaming is enabled by default for interval operations, processing data in batches without loading the full dataset into memory.
Common Pitfalls
-
.pbaccessor on DataFrame vs LazyFrame: Interval operations (overlap, merge, etc.) are only onLazyFrame.pb.DataFrame.pbonly has write methods. Use.lazy()to convert before chaining interval ops. -
LazyFrame returns: All interval operations and
pb.sql()returnLazyFrameby default. Don't forget.collect()or useoutput_type="polars.DataFrame". -
Column name mismatches: polars-bio expects
chrom,start,endby default. Usecols1/cols2parameters (as lists) if your columns have different names. -
Coordinate system metadata: Interval operations read coordinate metadata from I/O functions or DataFrame
config_meta. For manually built DataFrames, setdf.config_meta.set(coordinate_system_zero_based=True)(0-based) orFalse(1-based). If metadata is missing, polars-bio falls back to the globaldatafusion.bio.coordinate_system_zero_basedsetting (with a warning). Setpb.set_option("datafusion.bio.coordinate_system_check", True)to raiseMissingCoordinateSystemErrorinstead. Mismatched systems between inputs raiseCoordinateSystemMismatchError. -
Probe-build order matters: For overlap, nearest, and coverage, the first DataFrame is probed against the second. Swapping arguments changes which intervals appear in the left vs right output columns, and can affect performance.
-
INT32 position limit: Genomic positions are stored as 32-bit integers, limiting coordinates to ~2.1 billion. This is sufficient for all known genomes but may be an issue with custom coordinate spaces.
-
BAM index requirements:
read_bamandscan_bamrequire a.baiindex file alongside the BAM. Create one withsamtools indexif missing. -
Parallel execution disabled by default: DataFusion parallelism defaults to 1 partition. Enable for large datasets:
pb.set_option("datafusion.execution.target_partitions", 8) -
CRAM has separate functions: Use
read_cram/scan_cram/register_cramfor CRAM files (notread_bam). CRAM functions require areference_pathparameter.
Best Practices
-
Use
scan_*for large files: Preferscan_bed,scan_vcf, etc. overread_*for files larger than available RAM. Scan functions enable streaming and predicate pushdown. -
Configure parallelism for large datasets:
import os pb.set_option("datafusion.execution.target_partitions", os.cpu_count()) -
Use BGZF compression: BGZF-compressed files (
.bed.gz,.vcf.gz) support parallel block decompression, significantly faster than plain GZIP. -
Select columns early: When only specific columns are needed, select them early to reduce memory usage:
df = pb.read_vcf("large.vcf.gz").select("chrom", "start", "end", "ref", "alt") -
Use cloud paths directly: Pass S3/GCS/Azure URIs directly to read/scan/register functions instead of downloading files first. Authenticated access uses your cloud SDK credentials (
AWS_ACCESS_KEY_ID/AWS_SECRET_ACCESS_KEY,GOOGLE_APPLICATION_CREDENTIALS, Azure defaults) only when those cloud paths are accessed:df = pb.read_bed("s3://my-bucket/regions.bed", allow_anonymous=True) -
Prefer functional API for single operations, method-chaining for pipelines: Use
pb.overlap()for one-off operations and.lazy().pb.overlap()when building multi-step pipelines.
Resources
references/
Detailed documentation for each major capability:
-
interval_operations.md - All 8 interval operations with parameters, examples, output schemas, and performance tips. Core reference for genomic range arithmetic.
-
file_io.md - Supported formats table, per-format column schemas, cloud storage configuration, compression support, and common parameters.
-
sql_processing.md - Register functions, DataFusion SQL syntax, combining SQL with interval operations, and example queries.
-
pileup_operations.md - Per-base read depth computation from BAM/CRAM files, parameters, and integration with interval operations.
-
configuration.md - Global settings (parallelism, coordinate systems, streaming modes), logging, and metadata management.
-
bioframe_migration.md - Operation mapping table, API differences, performance comparison, migration code examples, and pandas compatibility mode.
GitHub 저장소
연관 스킬
llamaguard
기타LlamaGuard는 폭력 및 혐오 발언 등 6가지 안전 범주에서 LLM 입력과 출력을 조정하기 위한 Meta의 70-80억 파라미터 모델입니다. 94-95% 정확도를 제공하며 vLLM, Hugging Face 또는 Amazon SageMaker를 사용해 배포할 수 있습니다. 이 기술을 사용하여 AI 애플리케이션에 콘텐츠 필터링 및 안전 가드레일을 손쉽게 통합하세요.
cost-optimization
기타이 Claude Skill은 리소스 적정화, 태깅 전략, 지출 분석을 통해 개발자들이 클라우드 비용을 최적화할 수 있도록 지원합니다. AWS, Azure, GCP에서 클라우드 비용을 절감하고 비용 거버넌스를 구현하기 위한 프레임워크를 제공합니다. 인프라 비용을 분석하거나, 리소스를 적정화하거나, 예산 제약을 충족해야 할 때 사용하세요.
quantizing-models-bitsandbytes
기타이 스킬은 bitsandbytes를 사용하여 LLM을 8비트 또는 4비트 정밀도로 양자화하며, 최소한의 정확도 손실로 50-75%의 메모리 감소를 달성합니다. 제한된 GPU 메모리에서 더 큰 모델을 실행하거나 추론을 가속화하는 데 이상적이며, INT8, NF4, FP4와 같은 형식을 지원합니다. 이 스킬은 HuggingFace Transformers와 통합되어 QLoRA 학습 및 8비트 옵티마이저를 가능하게 합니다.
dispatching-parallel-agents
기타이 Claude Skill은 3개 이상의 독립적인 문제를 동시에 조사하고 해결하기 위해 다중 에이전트를 배치합니다. 공유 상태나 의존성 없이 해결 가능한 무관련 장애 시나리오에 맞게 설계되었습니다. 핵심 기능은 병렬 문제 해결로, 각 독립 문제 영역마다 하나의 에이전트를 할당하여 효율성을 극대화합니다.
