MCP HubMCP Hub
스킬 목록으로 돌아가기

geniml

K-Dense-AI
업데이트됨 Today
26,534
2,743
26,534
GitHub에서 보기
메타aidesigndata

정보

geniml 스킬은 BED 파일의 유전체 구간 데이터에 대한 머신 러닝을 가능하게 하며, 영역 임베딩 훈련 및 단일세포 ATAC-seq 데이터 분석을 포함합니다. 이 스킬은 합의 피크 구축, 유전체 특징 표현 학습, 유사성 검색 또는 클러스터링 수행과 같은 작업을 지원합니다. 유전체 영역, 크로마틴 접근성 데이터셋 또는 scATAC-seq 데이터와 관련된 머신 러닝 기반 분석에 이 스킬을 사용하세요.

빠른 설치

Claude Code

추천
기본
npx skills add K-Dense-AI/claude-scientific-skills -a claude-code
플러그인 명령대체
/plugin add https://github.com/K-Dense-AI/claude-scientific-skills
Git 클론대체
git clone https://github.com/K-Dense-AI/claude-scientific-skills.git ~/.claude/skills/geniml

Claude Code에서 이 명령을 복사하여 붙여넣어 스킬을 설치하세요

문서

Geniml: Genomic Interval Machine Learning

Overview

Geniml is a Python package for building machine learning models on genomic interval data from BED files. It provides unsupervised methods for learning embeddings of genomic regions, single cells, and metadata labels, enabling similarity searches, clustering, and downstream ML tasks.

Installation

Install geniml using uv:

uv pip install geniml

For ML dependencies (PyTorch, etc.):

uv pip install 'geniml[ml]'

Development version from GitHub:

uv pip install git+https://github.com/databio/geniml.git

Core Capabilities

Geniml provides five primary capabilities, each detailed in dedicated reference files:

1. Region2Vec: Genomic Region Embeddings

Train unsupervised embeddings of genomic regions using word2vec-style learning.

Use for: Dimensionality reduction of BED files, region similarity analysis, feature vectors for downstream ML.

Workflow:

  1. Tokenize BED files using a universe reference
  2. Train Region2Vec model on tokens
  3. Generate embeddings for regions

Reference: See references/region2vec.md for detailed workflow, parameters, and examples.

2. BEDspace: Joint Region and Metadata Embeddings

Train shared embeddings for region sets and metadata labels using StarSpace.

Use for: Metadata-aware searches, cross-modal queries (region→label or label→region), joint analysis of genomic content and experimental conditions.

Workflow:

  1. Preprocess regions and metadata
  2. Train BEDspace model
  3. Compute distances
  4. Query across regions and labels

Reference: See references/bedspace.md for detailed workflow, search types, and examples.

3. scEmbed: Single-Cell Chromatin Accessibility Embeddings

Train Region2Vec models on single-cell ATAC-seq data for cell-level embeddings.

Use for: scATAC-seq clustering, cell-type annotation, dimensionality reduction of single cells, integration with scanpy workflows.

Workflow:

  1. Prepare AnnData with peak coordinates
  2. Pre-tokenize cells
  3. Train scEmbed model
  4. Generate cell embeddings
  5. Cluster and visualize with scanpy

Reference: See references/scembed.md for detailed workflow, parameters, and examples.

4. Consensus Peaks: Universe Building

Build reference peak sets (universes) from BED file collections using multiple statistical methods.

Use for: Creating tokenization references, standardizing regions across datasets, defining consensus features with statistical rigor.

Workflow:

  1. Combine BED files
  2. Generate coverage tracks
  3. Build universe using CC, CCF, ML, or HMM method

Methods:

  • CC (Coverage Cutoff): Simple threshold-based
  • CCF (Coverage Cutoff Flexible): Confidence intervals for boundaries
  • ML (Maximum Likelihood): Probabilistic modeling of positions
  • HMM (Hidden Markov Model): Complex state modeling

Reference: See references/consensus_peaks.md for method comparison, parameters, and examples.

5. Utilities: Supporting Tools

Additional tools for caching, randomization, evaluation, and search.

Available utilities:

  • BBClient: BED file caching for repeated access
  • BEDshift: Randomization preserving genomic context
  • Evaluation: Metrics for embedding quality (silhouette, Davies-Bouldin, etc.)
  • Tokenization: Region tokenization utilities (hard, soft, universe-based)
  • Text2BedNN: Neural search backends for genomic queries

Reference: See references/utilities.md for detailed usage of each utility.

Common Workflows

Basic Region Embedding Pipeline

from geniml.tokenization import hard_tokenization
from geniml.region2vec import region2vec
from geniml.evaluation import evaluate_embeddings

# Step 1: Tokenize BED files
hard_tokenization(
    src_folder='bed_files/',
    dst_folder='tokens/',
    universe_file='universe.bed',
    p_value_threshold=1e-9
)

# Step 2: Train Region2Vec
region2vec(
    token_folder='tokens/',
    save_dir='model/',
    num_shufflings=1000,
    embedding_dim=100
)

# Step 3: Evaluate
metrics = evaluate_embeddings(
    embeddings_file='model/embeddings.npy',
    labels_file='metadata.csv'
)

scATAC-seq Analysis Pipeline

import scanpy as sc
from geniml.scembed import ScEmbed
from geniml.io import tokenize_cells

# Step 1: Load data
adata = sc.read_h5ad('scatac_data.h5ad')

# Step 2: Tokenize cells
tokenize_cells(
    adata='scatac_data.h5ad',
    universe_file='universe.bed',
    output='tokens.parquet'
)

# Step 3: Train scEmbed
model = ScEmbed(embedding_dim=100)
model.train(dataset='tokens.parquet', epochs=100)

# Step 4: Generate embeddings
embeddings = model.encode(adata)
adata.obsm['scembed_X'] = embeddings

# Step 5: Cluster with scanpy
sc.pp.neighbors(adata, use_rep='scembed_X')
sc.tl.leiden(adata)
sc.tl.umap(adata)

Universe Building and Evaluation

# Generate coverage
cat bed_files/*.bed > combined.bed
uniwig -m 25 combined.bed chrom.sizes coverage/

# Build universe with coverage cutoff
geniml universe build cc \
  --coverage-folder coverage/ \
  --output-file universe.bed \
  --cutoff 5 \
  --merge 100 \
  --filter-size 50

# Evaluate universe quality
geniml universe evaluate \
  --universe universe.bed \
  --coverage-folder coverage/ \
  --bed-folder bed_files/

CLI Reference

Geniml provides command-line interfaces for major operations:

# Region2Vec training
geniml region2vec --token-folder tokens/ --save-dir model/ --num-shuffle 1000

# BEDspace preprocessing
geniml bedspace preprocess --input regions/ --metadata labels.csv --universe universe.bed

# BEDspace training
geniml bedspace train --input preprocessed.txt --output model/ --dim 100

# BEDspace search
geniml bedspace search -t r2l -d distances.pkl -q query.bed -n 10

# Universe building
geniml universe build cc --coverage-folder coverage/ --output universe.bed --cutoff 5

# BEDshift randomization
geniml bedshift --input peaks.bed --genome hg38 --preserve-chrom --iterations 100

When to Use Which Tool

Use Region2Vec when:

  • Working with bulk genomic data (ChIP-seq, ATAC-seq, etc.)
  • Need unsupervised embeddings without metadata
  • Comparing region sets across experiments
  • Building features for downstream supervised learning

Use BEDspace when:

  • Metadata labels available (cell types, tissues, conditions)
  • Need to query regions by metadata or vice versa
  • Want joint embedding space for regions and labels
  • Building searchable genomic databases

Use scEmbed when:

  • Analyzing single-cell ATAC-seq data
  • Clustering cells by chromatin accessibility
  • Annotating cell types from scATAC-seq
  • Integration with scanpy is desired

Use Universe Building when:

  • Need reference peak sets for tokenization
  • Combining multiple experiments into consensus
  • Want statistically rigorous region definitions
  • Building standard references for a project

Use Utilities when:

  • Need to cache remote BED files (BBClient)
  • Generating null models for statistics (BEDshift)
  • Evaluating embedding quality (Evaluation)
  • Building search interfaces (Text2BedNN)

Best Practices

General Guidelines

  • Universe quality is critical: Invest time in building comprehensive, well-constructed universes
  • Tokenization validation: Check coverage (>80% ideal) before training
  • Parameter tuning: Experiment with embedding dimensions, learning rates, and training epochs
  • Evaluation: Always validate embeddings with multiple metrics and visualizations
  • Documentation: Record parameters and random seeds for reproducibility

Performance Considerations

  • Pre-tokenization: For scEmbed, always pre-tokenize cells for faster training
  • Memory management: Large datasets may require batch processing or downsampling
  • Computational resources: ML/HMM universe methods are computationally intensive
  • Model caching: Use BBClient to avoid repeated downloads

Integration Patterns

  • With scanpy: scEmbed embeddings integrate seamlessly as adata.obsm entries
  • With BEDbase: Use BBClient for accessing remote BED repositories
  • With Hugging Face: Export trained models for sharing and reproducibility
  • With R: Use reticulate for R integration (see utilities reference)

Related Projects

Geniml is part of the BEDbase ecosystem:

  • BEDbase: Unified platform for genomic regions
  • BEDboss: Processing pipeline for BED files
  • Gtars: Genomic tools and utilities
  • BBClient: Client for BEDbase repositories

Additional Resources

Troubleshooting

"Tokenization coverage too low":

  • Check universe quality and completeness
  • Adjust p-value threshold (try 1e-6 instead of 1e-9)
  • Ensure universe matches genome assembly

"Training not converging":

  • Adjust learning rate (try 0.01-0.05 range)
  • Increase training epochs
  • Check data quality and preprocessing

"Out of memory errors":

  • Reduce batch size for scEmbed
  • Process data in chunks
  • Use pre-tokenization for single-cell data

"StarSpace not found" (BEDspace):

For detailed troubleshooting and method-specific issues, consult the appropriate reference file.

GitHub 저장소

K-Dense-AI/claude-scientific-skills
경로: skills/geniml
0
agent-skillsai-scientistbioinformaticschemoinformaticsclaudeclaude-skills

연관 스킬

content-collections

메타

이 스킬은 콘텐츠 콜렉션(Content Collections)을 위한 프로덕션 검증된 설정을 제공합니다. 콘텐츠 콜렉션은 Markdown/MDX 파일을 Zod 검증이 포함된 타입 안전한 데이터 콜렉션으로 변환해주는 TypeScript 최우선 도구입니다. 블로그, 문서 사이트 또는 콘텐츠 중심의 Vite + React 애플리케이션을 구축할 때 타입 안전성과 자동 콘텐츠 검증을 보장하기 위해 사용하세요. Vite 플러그인 구성과 MDX 컴파일부터 배포 최적화 및 스키마 검증에 이르기까지 모든 것을 다룹니다.

스킬 보기

polymarket

메타

이 스킬은 개발자들이 Polymarket 예측 시장 플랫폼을 활용한 애플리케이션을 구축할 수 있도록 지원하며, 거래 및 시장 데이터를 위한 API 통합 기능을 포함합니다. 또한 WebSocket을 통한 실시간 데이터 스트리밍을 제공하여 실시간 거래와 시장 활동을 모니터링할 수 있습니다. 이를 통해 거래 전략을 구현하거나 실시간 시장 업데이트를 처리하는 도구를 생성하는 데 활용할 수 있습니다.

스킬 보기

creating-opencode-plugins

메타

이 스킬은 개발자들이 명령어, 파일, LSP 작업 등 25개 이상의 이벤트 유형에 연결되는 OpenCode 플러그인을 만들 수 있도록 돕습니다. JavaScript/TypeScript 모듈을 위한 플러그인 구조, 이벤트 API 명세, 구현 패턴을 제공합니다. OpenCode AI 어시스턴트의 라이프사이클을 사용자 정의 이벤트 기반 로직으로 가로채거나, 모니터링하거나, 확장해야 할 때 사용하세요.

스킬 보기

sglang

메타

SGLang은 RadixAttention 프리픽스 캐싱을 활용하여 JSON, 정규식, 에이전트 워크플로우를 위한 고속 구조화 생성에 특화된 고성능 LLM 서빙 프레임워크입니다. 특히 반복되는 프리픽스가 있는 작업에서 상당히 빠른 추론 속도를 제공하여 복잡한 구조화 출력 및 다중 턴 대화에 이상적입니다. 제약 디코딩이 필요하거나 광범위한 프리픽스 공유가 있는 애플리케이션을 구축할 때는 vLLM과 같은 대안보다 SGLang을 선택하십시오.

스킬 보기