phylogenetics
Acerca de
Esta habilidad permite la construcción y análisis de árboles filogenéticos mediante un flujo de trabajo estándar de bioinformática utilizando MAFFT, IQ-TREE 2 y FastTree. Proporciona herramientas para alineamiento de secuencias, inferencia de árboles de máxima verosimilitud y visualización con ETE3 o FigTree. Úsela para estudios evolutivos, genómica microbiana, filodinámica viral y análisis de familias de proteínas.
Instalación rápida
Claude Code
Recomendadonpx skills add K-Dense-AI/claude-scientific-skills -a claude-code/plugin add https://github.com/K-Dense-AI/claude-scientific-skillsgit clone https://github.com/K-Dense-AI/claude-scientific-skills.git ~/.claude/skills/phylogeneticsCopia y pega este comando en Claude Code para instalar esta habilidad
Documentación
Phylogenetics
Overview
Phylogenetic analysis reconstructs the evolutionary history of biological sequences (genes, proteins, genomes) by inferring the branching pattern of descent. This skill covers the standard pipeline:
- MAFFT — Multiple sequence alignment
- IQ-TREE 2 — Maximum likelihood tree inference with model selection
- FastTree — Fast approximate maximum likelihood (for large datasets)
- ETE3 — Python library for tree manipulation and visualization
Installation:
# Conda (recommended for CLI tools)
conda install -c bioconda mafft iqtree fasttree
pip install ete3
When to Use This Skill
Use phylogenetics when:
- Evolutionary relationships: Which organism/gene is most closely related to my sequence?
- Viral phylodynamics: Trace outbreak spread and estimate transmission dates
- Protein family analysis: Infer evolutionary relationships within a gene family
- Horizontal gene transfer detection: Identify genes with discordant species/gene trees
- Ancestral sequence reconstruction: Infer ancestral protein sequences
- Molecular clock analysis: Estimate divergence dates using temporal sampling
- GWAS companion: Place variants in evolutionary context (e.g., SARS-CoV-2 variants)
- Microbiology: Species phylogeny from 16S rRNA or core genome phylogeny
Standard Workflow
1. Multiple Sequence Alignment with MAFFT
import subprocess
import os
def run_mafft(input_fasta: str, output_fasta: str, method: str = "auto",
n_threads: int = 4) -> str:
"""
Align sequences with MAFFT.
Args:
input_fasta: Path to unaligned FASTA file
output_fasta: Path for aligned output
method: 'auto' (auto-select), 'einsi' (accurate), 'linsi' (accurate, slow),
'fftnsi' (medium), 'fftns' (fast), 'retree2' (fast)
n_threads: Number of CPU threads
Returns:
Path to aligned FASTA file
"""
methods = {
"auto": ["mafft", "--auto"],
"einsi": ["mafft", "--genafpair", "--maxiterate", "1000"],
"linsi": ["mafft", "--localpair", "--maxiterate", "1000"],
"fftnsi": ["mafft", "--fftnsi"],
"fftns": ["mafft", "--fftns"],
"retree2": ["mafft", "--retree", "2"],
}
cmd = methods.get(method, methods["auto"])
cmd += ["--thread", str(n_threads), "--inputorder", input_fasta]
with open(output_fasta, 'w') as out:
result = subprocess.run(cmd, stdout=out, stderr=subprocess.PIPE, text=True)
if result.returncode != 0:
raise RuntimeError(f"MAFFT failed:\n{result.stderr}")
# Count aligned sequences
with open(output_fasta) as f:
n_seqs = sum(1 for line in f if line.startswith('>'))
print(f"MAFFT: aligned {n_seqs} sequences → {output_fasta}")
return output_fasta
# MAFFT method selection guide:
# Few sequences (<200), accurate: linsi or einsi
# Many sequences (<1000), moderate: fftnsi
# Large datasets (>1000): fftns or auto
# Ultra-fast (>10000): mafft --retree 1
2. Trim Alignment (Optional but Recommended)
def trim_alignment_trimal(aligned_fasta: str, output_fasta: str,
method: str = "automated1") -> str:
"""
Trim poorly aligned columns with TrimAl.
Methods:
- 'automated1': Automatic heuristic (recommended)
- 'gappyout': Remove gappy columns
- 'strict': Strict gap threshold
"""
cmd = ["trimal", f"-{method}", "-in", aligned_fasta, "-out", output_fasta, "-fasta"]
result = subprocess.run(cmd, capture_output=True, text=True)
if result.returncode != 0:
print(f"TrimAl warning: {result.stderr}")
# Fall back to using the untrimmed alignment
import shutil
shutil.copy(aligned_fasta, output_fasta)
return output_fasta
3. IQ-TREE 2 — Maximum Likelihood Tree
def run_iqtree(aligned_fasta: str, output_prefix: str,
model: str = "TEST", bootstrap: int = 1000,
n_threads: int = 4, extra_args: list = None) -> dict:
"""
Build a maximum likelihood tree with IQ-TREE 2.
Args:
aligned_fasta: Aligned FASTA file
output_prefix: Prefix for output files
model: 'TEST' for automatic model selection, or specify (e.g., 'GTR+G' for DNA,
'LG+G4' for proteins, 'JTT+G' for proteins)
bootstrap: Number of ultrafast bootstrap replicates (1000 recommended)
n_threads: Number of threads ('AUTO' to auto-detect)
extra_args: Additional IQ-TREE arguments
Returns:
Dict with paths to output files
"""
cmd = [
"iqtree2",
"-s", aligned_fasta,
"--prefix", output_prefix,
"-m", model,
"-B", str(bootstrap), # Ultrafast bootstrap
"-T", str(n_threads),
"--redo" # Overwrite existing results
]
if extra_args:
cmd.extend(extra_args)
result = subprocess.run(cmd, capture_output=True, text=True)
if result.returncode != 0:
raise RuntimeError(f"IQ-TREE failed:\n{result.stderr}")
# Print model selection result
log_file = f"{output_prefix}.log"
if os.path.exists(log_file):
with open(log_file) as f:
for line in f:
if "Best-fit model" in line:
print(f"IQ-TREE: {line.strip()}")
output_files = {
"tree": f"{output_prefix}.treefile",
"log": f"{output_prefix}.log",
"iqtree": f"{output_prefix}.iqtree", # Full report
"model": f"{output_prefix}.model.gz",
}
print(f"IQ-TREE: Tree saved to {output_files['tree']}")
return output_files
# IQ-TREE model selection guide:
# DNA: TEST → GTR+G, HKY+G, TrN+G
# Protein: TEST → LG+G4, WAG+G, JTT+G, Q.pfam+G
# Codon: TEST → MG+F3X4
# For temporal (molecular clock) analysis, add:
# extra_args = ["--date", "dates.txt", "--clock-test", "--date-CI", "95"]
4. FastTree — Fast Approximate ML
For large datasets (>1000 sequences) where IQ-TREE is too slow:
def run_fasttree(aligned_fasta: str, output_tree: str,
sequence_type: str = "nt", model: str = "gtr",
n_threads: int = 4) -> str:
"""
Build a fast approximate ML tree with FastTree.
Args:
sequence_type: 'nt' for nucleotide or 'aa' for amino acid
model: For nt: 'gtr' (recommended) or 'jc'; for aa: 'lg', 'wag', 'jtt'
"""
if sequence_type == "nt":
cmd = ["FastTree", "-nt", "-gtr"]
else:
cmd = ["FastTree", f"-{model}"]
cmd += [aligned_fasta]
with open(output_tree, 'w') as out:
result = subprocess.run(cmd, stdout=out, stderr=subprocess.PIPE, text=True)
if result.returncode != 0:
raise RuntimeError(f"FastTree failed:\n{result.stderr}")
print(f"FastTree: Tree saved to {output_tree}")
return output_tree
5. Tree Analysis and Visualization with ETE3
from ete3 import Tree, TreeStyle, NodeStyle, TextFace, PhyloTree
import matplotlib.pyplot as plt
def load_tree(tree_file: str) -> Tree:
"""Load a Newick tree file."""
t = Tree(tree_file)
print(f"Tree: {len(t)} leaves, {len(list(t.traverse()))} nodes")
return t
def basic_tree_stats(t: Tree) -> dict:
"""Compute basic tree statistics."""
leaves = t.get_leaves()
distances = [t.get_distance(l1, l2) for l1 in leaves[:min(50, len(leaves))]
for l2 in leaves[:min(50, len(leaves))] if l1 != l2]
stats = {
"n_leaves": len(leaves),
"n_internal_nodes": len(t) - len(leaves),
"total_branch_length": sum(n.dist for n in t.traverse()),
"max_leaf_distance": max(distances) if distances else 0,
"mean_leaf_distance": sum(distances)/len(distances) if distances else 0,
}
return stats
def find_mrca(t: Tree, leaf_names: list) -> Tree:
"""Find the most recent common ancestor of a set of leaves."""
return t.get_common_ancestor(*leaf_names)
def visualize_tree(t: Tree, output_file: str = "tree.png",
show_branch_support: bool = True,
color_groups: dict = None,
width: int = 800) -> None:
"""
Render phylogenetic tree to image.
Args:
t: ETE3 Tree object
color_groups: Dict mapping leaf_name → color (for coloring taxa)
show_branch_support: Show bootstrap values
"""
ts = TreeStyle()
ts.show_leaf_name = True
ts.show_branch_support = show_branch_support
ts.mode = "r" # 'r' = rectangular, 'c' = circular
if color_groups:
for node in t.traverse():
if node.is_leaf() and node.name in color_groups:
nstyle = NodeStyle()
nstyle["fgcolor"] = color_groups[node.name]
nstyle["size"] = 8
node.set_style(nstyle)
t.render(output_file, tree_style=ts, w=width, units="px")
print(f"Tree saved to: {output_file}")
def midpoint_root(t: Tree) -> Tree:
"""Root tree at midpoint (use when outgroup unknown)."""
t.set_outgroup(t.get_midpoint_outgroup())
return t
def prune_tree(t: Tree, keep_leaves: list) -> Tree:
"""Prune tree to keep only specified leaves."""
t.prune(keep_leaves, preserve_branch_length=True)
return t
6. Complete Analysis Script
import subprocess, os
from ete3 import Tree
def full_phylogenetic_analysis(
input_fasta: str,
output_dir: str = "phylo_results",
sequence_type: str = "nt",
n_threads: int = 4,
bootstrap: int = 1000,
use_fasttree: bool = False
) -> dict:
"""
Complete phylogenetic pipeline: align → trim → tree → visualize.
Args:
input_fasta: Unaligned FASTA
sequence_type: 'nt' (nucleotide) or 'aa' (amino acid/protein)
use_fasttree: Use FastTree instead of IQ-TREE (faster for large datasets)
"""
os.makedirs(output_dir, exist_ok=True)
prefix = os.path.join(output_dir, "phylo")
print("=" * 50)
print("Step 1: Multiple Sequence Alignment (MAFFT)")
aligned = run_mafft(input_fasta, f"{prefix}_aligned.fasta",
method="auto", n_threads=n_threads)
print("\nStep 2: Tree Inference")
if use_fasttree:
tree_file = run_fasttree(
aligned, f"{prefix}.tree",
sequence_type=sequence_type,
model="gtr" if sequence_type == "nt" else "lg"
)
else:
model = "TEST" if sequence_type == "nt" else "TEST"
iqtree_files = run_iqtree(
aligned, prefix,
model=model,
bootstrap=bootstrap,
n_threads=n_threads
)
tree_file = iqtree_files["tree"]
print("\nStep 3: Tree Analysis")
t = Tree(tree_file)
t = midpoint_root(t)
stats = basic_tree_stats(t)
print(f"Tree statistics: {stats}")
print("\nStep 4: Visualization")
visualize_tree(t, f"{prefix}_tree.png", show_branch_support=True)
# Save rooted tree
rooted_tree_file = f"{prefix}_rooted.nwk"
t.write(format=1, outfile=rooted_tree_file)
results = {
"aligned_fasta": aligned,
"tree_file": tree_file,
"rooted_tree": rooted_tree_file,
"visualization": f"{prefix}_tree.png",
"stats": stats
}
print("\n" + "=" * 50)
print("Phylogenetic analysis complete!")
print(f"Results in: {output_dir}/")
return results
IQ-TREE Model Guide
DNA Models
| Model | Description | Use case |
|---|---|---|
GTR+G4 | General Time Reversible + Gamma | Most flexible DNA model |
HKY+G4 | Hasegawa-Kishino-Yano + Gamma | Two-rate model (common) |
TrN+G4 | Tamura-Nei | Unequal transitions |
JC | Jukes-Cantor | Simplest; all rates equal |
Protein Models
| Model | Description | Use case |
|---|---|---|
LG+G4 | Le-Gascuel + Gamma | Best average protein model |
WAG+G4 | Whelan-Goldman | Widely used |
JTT+G4 | Jones-Taylor-Thornton | Classical model |
Q.pfam+G4 | pfam-trained | For Pfam-like protein families |
Q.bird+G4 | Bird-specific | Vertebrate proteins |
Tip: Use -m TEST to let IQ-TREE automatically select the best model.
Best Practices
- Alignment quality first: Poor alignment → unreliable trees; check alignment manually
- Use
linsifor small (<200 seq),fftnsorautofor large alignments - Model selection: Always use
-m TESTfor IQ-TREE unless you have a specific reason - Bootstrap: Use ≥1000 ultrafast bootstraps (
-B 1000) for branch support - Root the tree: Unrooted trees can be misleading; use outgroup or midpoint rooting
- FastTree for >5000 sequences: IQ-TREE becomes slow; FastTree is 10–100× faster
- Trim long alignments: TrimAl removes unreliable columns; improves tree accuracy
- Check for recombination in viral/bacterial sequences before building trees (
RDP4,GARD)
Additional Resources
- MAFFT: https://mafft.cbrc.jp/alignment/software/
- IQ-TREE 2: http://www.iqtree.org/ | Tutorial: https://www.iqtree.org/workshop/molevol2022
- FastTree: http://www.microbesonline.org/fasttree/
- ETE3: http://etetoolkit.org/
- FigTree (GUI visualization): https://tree.bio.ed.ac.uk/software/figtree/
- iTOL (web visualization): https://itol.embl.de/
- MUSCLE (alternative aligner): https://www.drive5.com/muscle/
- TrimAl (alignment trimming): https://vicfero.github.io/trimal/
Repositorio GitHub
Habilidades relacionadas
content-collections
MetaEsta habilidad proporciona una configuración probada en producción para Content Collections, una herramienta centrada en TypeScript que transforma archivos Markdown/MDX en colecciones de datos con tipado seguro mediante validación Zod. Úsala al construir blogs, sitios de documentación o aplicaciones Vite + React con mucho contenido para garantizar seguridad de tipos y validación automática de contenido. Abarca todo, desde la configuración del plugin de Vite y compilación MDX hasta la optimización de despliegue y validación de esquemas.
polymarket
MetaEsta habilidad permite a los desarrolladores crear aplicaciones con la plataforma de mercados de predicción Polymarket, incluyendo la integración de API para operaciones y datos de mercado. También proporciona transmisión de datos en tiempo real a través de WebSocket para monitorear operaciones en vivo y actividad del mercado. Úsela para implementar estrategias de trading o crear herramientas que procesen actualizaciones de mercado en tiempo real.
creating-opencode-plugins
MetaEsta habilidad ayuda a los desarrolladores a crear complementos de OpenCode que se conectan a más de 25 tipos de eventos, como comandos, archivos y operaciones LSP. Proporciona la estructura del complemento, las especificaciones de la API de eventos y los patrones de implementación para módulos en JavaScript/TypeScript. Úsala cuando necesites interceptar, monitorear o extender el ciclo de vida del asistente de IA de OpenCode con lógica personalizada basada en eventos.
sglang
MetaSGLang es un framework de alto rendimiento para el servicio de LLM que se especializa en generación rápida y estructurada para JSON, expresiones regulares y flujos de trabajo de agentes utilizando su caché de prefijos RadixAttention. Ofrece una inferencia significativamente más rápida, especialmente para tareas con prefijos repetidos, lo que lo hace ideal para salidas complejas y estructuradas, y conversaciones multiturno. Elige SGLang sobre alternativas como vLLM cuando necesites decodificación restringida o estés construyendo aplicaciones con uso extensivo de prefijos compartidos.
