bio-read-qc-contamination-screening
About
This skill screens FASTQ files for contamination using FastQ Screen by aligning reads against multiple reference genomes. It detects cross-species contamination, bacterial/viral sequences, adapters, and sample swaps. Use it when suspecting contamination or working with samples prone to microbial contamination.
Quick Install
Claude Code
Recommended/plugin add https://github.com/majiayu000/claude-skill-registrygit clone https://github.com/majiayu000/claude-skill-registry.git ~/.claude/skills/bio-read-qc-contamination-screeningCopy and paste this command in Claude Code to install this skill
Documentation
Contamination Screening
Screen FASTQ files against multiple genomes to identify contamination sources using FastQ Screen.
FastQ Screen Overview
FastQ Screen aligns a subset of reads against multiple reference genomes to identify:
- Cross-species contamination
- Bacterial/viral contamination
- Adapter sequences
- PhiX spike-in
- Sample swaps
Basic Usage
# Screen against configured genomes
fastq_screen sample.fastq.gz
# Multiple files
fastq_screen *.fastq.gz
# Specify output directory
fastq_screen --outdir qc_results/ sample.fastq.gz
# Custom config file
fastq_screen --conf my_screen.conf sample.fastq.gz
Configuration File
Create fastq_screen.conf:
# Database locations
DATABASE Human /path/to/human/genome
DATABASE Mouse /path/to/mouse/genome
DATABASE Ecoli /path/to/ecoli/genome
DATABASE PhiX /path/to/phix/genome
DATABASE Adapters /path/to/adapters
DATABASE rRNA /path/to/rrna
# Aligner (bowtie2 recommended)
BOWTIE2 /path/to/bowtie2
# Or use BWA
# BWA /path/to/bwa
# Threads
THREADS 8
Pre-built Databases
# Download common screening databases
fastq_screen --get_genomes
# Downloads to ~/fastq_screen_databases/
# Includes: Human, Mouse, Rat, E.coli, PhiX, Adapters, etc.
Screening Options
# Number of reads to sample (default 100000)
fastq_screen --subset 200000 sample.fastq.gz
# Use all reads (slow)
fastq_screen --subset 0 sample.fastq.gz
# Set threads
fastq_screen --threads 8 sample.fastq.gz
# Paired-end (screen R1 only by default)
fastq_screen sample_R1.fastq.gz
# Force screening both pairs
fastq_screen --paired sample_R1.fastq.gz sample_R2.fastq.gz
Output Options
# Generate PNG plot (default)
fastq_screen sample.fastq.gz
# No plot (text only)
fastq_screen --nograph sample.fastq.gz
# Generate additional mapping statistics
fastq_screen --tag sample.fastq.gz
# Filter reads by mapping (keep unmapped to all genomes)
fastq_screen --filter 0000 sample.fastq.gz
# Keep only reads mapping to first genome (e.g., Human)
fastq_screen --filter 1--- sample.fastq.gz
Filter Codes
Use --filter to select reads based on mapping status:
| Code | Meaning |
|---|---|
| 0 | Did not map to genome |
| 1 | Mapped uniquely |
| 2 | Mapped more than once |
| 3 | Mapped (unique or multi) |
| - | Ignore this genome |
# Example: Keep reads mapping only to Human (first genome)
# Human:1, all others:0
fastq_screen --filter 10000 sample.fastq.gz
# Keep reads NOT mapping to anything (clean reads)
fastq_screen --filter 00000 sample.fastq.gz
Output Files
| File | Description |
|---|---|
*_screen.txt | Tab-delimited results |
*_screen.png | Visualization |
*_screen.html | HTML report |
Results Format
#Fastq_screen version: 0.15.3
Genome #Reads_processed #Unmapped %Unmapped #One_hit_one_genome %One_hit_one_genome #Multiple_hits_one_genome %Multiple_hits_one_genome #One_hit_multiple_genomes %One_hit_multiple_genomes Multiple_hits_multiple_genomes %Multiple_hits_multiple_genomes
Human 100000 2000 2.00 95000 95.00 1000 1.00 1500 1.50 500 0.50
Mouse 100000 98000 98.00 100 0.10 50 0.05 1500 1.50 350 0.35
Interpreting Results
Expected Results by Sample Type
| Sample Type | Expected Pattern |
|---|---|
| Human sample | >90% Human, <1% others |
| Mouse sample | >90% Mouse, <1% others |
| Human + PhiX | >80% Human, ~10% PhiX |
| Contaminated | Significant % to unexpected genome |
Common Issues
| Pattern | Likely Cause |
|---|---|
| High adapter % | Library prep issue |
| High PhiX % | Spike-in not removed |
| High E.coli % | Bacterial contamination |
| High rRNA % | rRNA depletion failed |
| Multiple species | Sample swap or contamination |
MultiQC Integration
FastQ Screen results are automatically detected by MultiQC:
# Screen all samples
for f in *.fastq.gz; do
fastq_screen --outdir screen_results/ "$f"
done
# Aggregate with MultiQC
multiqc screen_results/
Custom Database Setup
Create Bowtie2 Index
# Index a FASTA file
bowtie2-build reference.fa reference
# Add to config
# DATABASE MyGenome /path/to/reference
Common Databases to Include
| Genome | Purpose |
|---|---|
| Human (GRCh38) | Human samples |
| Mouse (GRCm39) | Mouse samples |
| E. coli | Bacterial contamination |
| PhiX | Illumina spike-in |
| Adapters | Library prep |
| rRNA | Ribosomal RNA |
| Vectors | Cloning vectors |
| Mycoplasma | Cell culture contamination |
Example Workflows
Standard Screening
# Download databases
fastq_screen --get_genomes
# Screen samples
fastq_screen --outdir screen_results/ --threads 8 *.fastq.gz
# Check results
multiqc screen_results/
Remove Contamination
# Screen and tag reads
fastq_screen --tag sample.fastq.gz
# Filter to keep only Human reads (assuming Human is first database)
fastq_screen --filter 3----- --tag sample.fastq.gz
# Or use BBDuk for removal
bbduk.sh in=sample.fastq.gz out=clean.fastq.gz \
ref=contaminants.fa k=31 hdist=1
Related Skills
- quality-reports - FastQC shows overrepresented sequences
- adapter-trimming - Remove adapter contamination
- metagenomics - Deeper taxonomic analysis
GitHub Repository
Related Skills
sglang
MetaSGLang is a high-performance LLM serving framework that specializes in fast, structured generation for JSON, regex, and agentic workflows using its RadixAttention prefix caching. It delivers significantly faster inference, especially for tasks with repeated prefixes, making it ideal for complex, structured outputs and multi-turn conversations. Choose SGLang over alternatives like vLLM when you need constrained decoding or are building applications with extensive prefix sharing.
evaluating-llms-harness
TestingThis Claude Skill runs the lm-evaluation-harness to benchmark LLMs across 60+ standardized academic tasks like MMLU and GSM8K. It's designed for developers to compare model quality, track training progress, or report academic results. The tool supports various backends including HuggingFace and vLLM models.
langchain
MetaLangChain is a framework for building LLM applications using agents, chains, and RAG pipelines. It supports multiple LLM providers, offers 500+ integrations, and includes features like tool calling and memory management. Use it for rapid prototyping and deploying production systems like chatbots, autonomous agents, and question-answering services.
llamaguard
OtherLlamaGuard is Meta's 7-8B parameter model for moderating LLM inputs and outputs across six safety categories like violence and hate speech. It offers 94-95% accuracy and can be deployed using vLLM, Hugging Face, or Amazon SageMaker. Use this skill to easily integrate content filtering and safety guardrails into your AI applications.
