datamol
정보
Datamol은 RDKit을 위한 파이썬 스타일 래퍼로, SMILES 파싱, 분자 표준화, 디스크립터 계산과 같은 일반적인 신약개발 작업을 합리적인 기본값으로 간소화합니다. 표준 워크플로우에 이상적이며, 병렬 처리와 현대적인 I/O를 제공하면서도 RDKit의 네이티브 분자 객체를 반환합니다. 고급 맞춤화가 필요한 개발자는 RDKit을 직접 사용해야 합니다.
빠른 설치
Claude Code
추천npx skills add K-Dense-AI/claude-scientific-skills -a claude-code/plugin add https://github.com/K-Dense-AI/claude-scientific-skillsgit clone https://github.com/K-Dense-AI/claude-scientific-skills.git ~/.claude/skills/datamolClaude Code에서 이 명령을 복사하여 붙여넣어 스킬을 설치하세요
문서
Datamol Cheminformatics Skill
Overview
Datamol is a Python library that provides a lightweight, Pythonic abstraction layer over RDKit for molecular cheminformatics. Simplify complex molecular operations with sensible defaults, efficient parallelization, and modern I/O capabilities. All molecular objects are native rdkit.Chem.Mol instances, ensuring full compatibility with the RDKit ecosystem.
Version note: Examples target datamol 0.12.x (PyPI stable: 0.12.5, June 2024). Since 0.10.0, modules are lazy-loaded by default (set DATAMOL_DISABLE_LAZY_LOADING=1 to disable). Since 0.12.2, RDKit is a direct PyPI dependency of datamol. Fingerprints use RDKit's rdFingerprintGenerator API (0.12.5+).
Key capabilities:
- Molecular format conversion (SMILES, SELFIES, InChI)
- Structure standardization and sanitization
- Molecular descriptors and fingerprints
- 3D conformer generation and analysis
- Clustering and diversity selection
- Scaffold and fragment analysis
- Chemical reaction application
- Visualization and alignment
- Batch processing with parallelization
- Cloud storage support via fsspec
Installation and Setup
Guide users to install datamol:
uv pip install datamol
RDKit is installed automatically with datamol. For remote file paths (S3, GCS, HTTP), install the matching fsspec backend:
uv pip install s3fs # AWS S3
uv pip install gcsfs # Google Cloud Storage
Import convention:
import datamol as dm
Core Workflows
1. Basic Molecule Handling
Creating molecules from SMILES:
import datamol as dm
# Single molecule
mol = dm.to_mol("CCO") # Ethanol
# From list of SMILES
smiles_list = ["CCO", "c1ccccc1", "CC(=O)O"]
mols = [dm.to_mol(smi) for smi in smiles_list]
# Error handling
mol = dm.to_mol("invalid_smiles") # Returns None
if mol is None:
print("Failed to parse SMILES")
Converting molecules to SMILES:
# Canonical SMILES
smiles = dm.to_smiles(mol)
# Isomeric SMILES (includes stereochemistry)
smiles = dm.to_smiles(mol, isomeric=True)
# Other formats
inchi = dm.to_inchi(mol)
inchikey = dm.to_inchikey(mol)
selfies = dm.to_selfies(mol)
Standardization and sanitization (always recommend for user-provided molecules):
# Sanitize molecule
mol = dm.sanitize_mol(mol)
# Full standardization (recommended for datasets)
mol = dm.standardize_mol(
mol,
disconnect_metals=True,
normalize=True,
reionize=True
)
# For SMILES strings directly
clean_smiles = dm.standardize_smiles(smiles)
2. Reading and Writing Molecular Files
Refer to references/io_module.md for comprehensive I/O documentation.
Reading files:
# SDF files (most common in chemistry)
df = dm.read_sdf("compounds.sdf", mol_column='mol')
# SMILES files
df = dm.read_smi("molecules.smi", smiles_column='smiles', mol_column='mol')
# CSV with SMILES column
df = dm.read_csv("data.csv", smiles_column="SMILES", mol_column="mol")
# Excel files
df = dm.read_excel("compounds.xlsx", sheet_name=0, mol_column="mol")
# Universal reader/writer (auto-detects format; supports compression)
df = dm.open_df("file.sdf") # .sdf, .csv, .xlsx, .parquet, .json, .gz, etc.
dm.save_df(df, "output.parquet")
Writing files:
# Save as SDF
dm.to_sdf(mols, "output.sdf")
# Or from DataFrame
dm.to_sdf(df, "output.sdf", mol_column="mol")
# Save as SMILES file
dm.to_smi(mols, "output.smi")
# Excel with rendered molecule images
dm.to_xlsx(df, "output.xlsx", mol_columns=["mol"])
Remote file support (S3, GCS, HTTP via fsspec):
Only use cloud paths when the user explicitly requests them. Confirm the destination before writing.
# Read from cloud storage or HTTPS (user-provided URLs only)
df = dm.read_sdf("s3://bucket/compounds.sdf")
df = dm.read_csv("https://example.com/data.csv")
# Write to cloud storage — confirm path with user first
dm.to_sdf(mols, "s3://bucket/output.sdf")
Cloud backends read credentials from the standard provider environment (for example AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_DEFAULT_REGION, or GOOGLE_APPLICATION_CREDENTIALS). Datamol passes these to fsspec locally; it does not collect or transmit environment variables to third-party endpoints. Scope credential access to the named provider variables only.
3. Molecular Descriptors and Properties
Refer to references/descriptors_viz.md for detailed descriptor documentation.
Computing descriptors for a single molecule:
# Get standard descriptor set
descriptors = dm.descriptors.compute_many_descriptors(mol)
# Returns: {'mw': 46.07, 'logp': -0.03, 'hbd': 1, 'hba': 1,
# 'tpsa': 20.23, 'n_aromatic_atoms': 0, ...}
Batch descriptor computation (recommended for datasets):
# Compute for all molecules in parallel
desc_df = dm.descriptors.batch_compute_many_descriptors(
mols,
n_jobs=-1, # Use all CPU cores
progress=True # Show progress bar
)
Specific descriptors:
# Aromaticity
n_aromatic = dm.descriptors.n_aromatic_atoms(mol)
aromatic_ratio = dm.descriptors.n_aromatic_atoms_proportion(mol)
# Stereochemistry
n_stereo = dm.descriptors.n_stereo_centers(mol)
n_unspec = dm.descriptors.n_stereo_centers_unspecified(mol)
# Flexibility
n_rigid = dm.descriptors.n_rigid_bonds(mol)
Drug-likeness filtering (Lipinski's Rule of Five):
# Filter compounds
def is_druglike(mol):
desc = dm.descriptors.compute_many_descriptors(mol)
return (
desc['mw'] <= 500 and
desc['logp'] <= 5 and
desc['hbd'] <= 5 and
desc['hba'] <= 10
)
druglike_mols = [mol for mol in mols if is_druglike(mol)]
4. Molecular Fingerprints and Similarity
Generating fingerprints:
Datamol defaults to ECFP6 (radius=3, n_bits=2048). Pass radius=2 explicitly for ECFP4.
# ECFP4 (common in similarity screening)
fp = dm.to_fp(mol, fp_type='ecfp', radius=2, n_bits=2048)
# Other fingerprint types
fp_maccs = dm.to_fp(mol, fp_type='maccs')
fp_topological = dm.to_fp(mol, fp_type='topological')
fp_atompair = dm.to_fp(mol, fp_type='atompair')
fp_rdkit = dm.to_fp(mol, fp_type='rdkit')
Similarity calculations:
# Pairwise distances within a set
distance_matrix = dm.pdist(mols, n_jobs=-1)
# Distances between two sets
distances = dm.cdist(query_mols, library_mols, n_jobs=-1)
# Find most similar molecules (scipy is a PyPI package, not a file in this skill)
from scipy.spatial.distance import squareform # third-party library
dist_matrix = squareform(dm.pdist(mols))
# Lower distance = higher similarity (Tanimoto distance = 1 - Tanimoto similarity)
5. Clustering and Diversity Selection
Refer to references/core_api.md for clustering details.
Butina clustering:
# Cluster molecules by structural similarity
clusters = dm.cluster_mols(
mols,
cutoff=0.2, # Tanimoto distance threshold (0=identical, 1=completely different)
n_jobs=-1 # Parallel processing
)
# Each cluster is a list of molecule indices
for i, cluster in enumerate(clusters):
print(f"Cluster {i}: {len(cluster)} molecules")
cluster_mols = [mols[idx] for idx in cluster]
Important: Butina clustering builds a full distance matrix - suitable for ~1000 molecules, not for 10,000+.
Diversity selection:
# Pick diverse subset
diverse_mols = dm.pick_diverse(
mols,
npick=100 # Select 100 diverse molecules
)
# Pick cluster centroids
centroids = dm.pick_centroids(
mols,
npick=50 # Select 50 representative molecules
)
6. Scaffold Analysis
Refer to references/fragments_scaffolds.md for complete scaffold documentation.
Extracting Murcko scaffolds:
# Get Bemis-Murcko scaffold (core structure)
scaffold = dm.to_scaffold_murcko(mol)
scaffold_smiles = dm.to_smiles(scaffold)
Scaffold-based analysis:
# Group compounds by scaffold
from collections import Counter
scaffolds = [dm.to_scaffold_murcko(mol) for mol in mols]
scaffold_smiles = [dm.to_smiles(s) for s in scaffolds]
# Count scaffold frequency
scaffold_counts = Counter(scaffold_smiles)
most_common = scaffold_counts.most_common(10)
# Create scaffold-to-molecules mapping
scaffold_groups = {}
for mol, scaf_smi in zip(mols, scaffold_smiles):
if scaf_smi not in scaffold_groups:
scaffold_groups[scaf_smi] = []
scaffold_groups[scaf_smi].append(mol)
Scaffold-based train/test splitting (for ML):
# Ensure train and test sets have different scaffolds
scaffold_to_mols = {}
for mol, scaf in zip(mols, scaffold_smiles):
if scaf not in scaffold_to_mols:
scaffold_to_mols[scaf] = []
scaffold_to_mols[scaf].append(mol)
# Split scaffolds into train/test
import random
scaffolds = list(scaffold_to_mols.keys())
random.shuffle(scaffolds)
split_idx = int(0.8 * len(scaffolds))
train_scaffolds = scaffolds[:split_idx]
test_scaffolds = scaffolds[split_idx:]
# Get molecules for each split
train_mols = [mol for scaf in train_scaffolds for mol in scaffold_to_mols[scaf]]
test_mols = [mol for scaf in test_scaffolds for mol in scaffold_to_mols[scaf]]
7. Molecular Fragmentation
Refer to references/fragments_scaffolds.md for fragmentation details.
BRICS fragmentation (16 bond types):
# Fragment molecule
fragments = dm.fragment.brics(mol)
# Returns: set of fragment SMILES with attachment points like '[1*]CCN'
RECAP fragmentation (11 bond types):
fragments = dm.fragment.recap(mol)
Fragment analysis:
# Find common fragments across compound library
from collections import Counter
all_fragments = []
for mol in mols:
frags = dm.fragment.brics(mol)
all_fragments.extend(frags)
fragment_counts = Counter(all_fragments)
common_frags = fragment_counts.most_common(20)
# Fragment-based scoring
def fragment_score(mol, reference_fragments):
mol_frags = dm.fragment.brics(mol)
overlap = mol_frags.intersection(reference_fragments)
return len(overlap) / len(mol_frags) if mol_frags else 0
8. 3D Conformer Generation
Refer to references/conformers_module.md for detailed conformer documentation.
Generating conformers:
# Generate 3D conformers
mol_3d = dm.conformers.generate(
mol,
n_confs=50, # Number to generate (auto if None)
rms_cutoff=0.5, # Filter similar conformers (Ångströms)
minimize_energy=True, # Minimize with UFF force field
method='ETKDGv3' # Embedding method (recommended)
)
# Access conformers
n_conformers = mol_3d.GetNumConformers()
conf = mol_3d.GetConformer(0) # Get first conformer
positions = conf.GetPositions() # Nx3 array of atom coordinates
Conformer clustering:
# Cluster conformers by RMSD
clusters = dm.conformers.cluster(
mol_3d,
rms_cutoff=1.0,
centroids=False
)
# Get representative conformers
centroids = dm.conformers.return_centroids(mol_3d, clusters)
SASA calculation:
# Calculate solvent accessible surface area
sasa_values = dm.conformers.sasa(mol_3d, n_jobs=-1)
# Access SASA from conformer properties
conf = mol_3d.GetConformer(0)
sasa = conf.GetDoubleProp('rdkit_free_sasa')
9. Visualization
Refer to references/descriptors_viz.md for visualization documentation.
Basic molecule grid:
# Visualize molecules
dm.viz.to_image(
mols[:20],
legends=[dm.to_smiles(m) for m in mols[:20]],
n_cols=5,
mol_size=(300, 300)
)
# Save to file
dm.viz.to_image(mols, outfile="molecules.png")
# SVG for publications
dm.viz.to_image(mols, outfile="molecules.svg", use_svg=True)
Aligned visualization (for SAR analysis):
# Align molecules by common substructure
dm.viz.to_image(
similar_mols,
align=True, # Enable MCS alignment
legends=activity_labels,
n_cols=4
)
Highlighting substructures:
# Highlight specific atoms and bonds
dm.viz.to_image(
mol,
highlight_atom=[0, 1, 2, 3], # Atom indices
highlight_bond=[0, 1, 2] # Bond indices
)
Conformer visualization:
# Display multiple conformers
dm.viz.conformers(
mol_3d,
n_confs=10,
align_conf=True,
n_cols=3
)
10. Chemical Reactions
Refer to references/reactions_data.md for reactions documentation.
Applying reactions:
from rdkit.Chem import rdChemReactions
# Define reaction from SMARTS
rxn_smarts = '[C:1](=[O:2])[OH:3]>>[C:1](=[O:2])[Cl:3]'
rxn = rdChemReactions.ReactionFromSmarts(rxn_smarts)
# Apply to molecule
reactant = dm.to_mol("CC(=O)O") # Acetic acid
product = dm.reactions.apply_reaction(
rxn,
(reactant,),
sanitize=True
)
# Convert to SMILES
product_smiles = dm.to_smiles(product)
Batch reaction application:
# Apply reaction to library
products = []
for mol in reactant_mols:
try:
prod = dm.reactions.apply_reaction(rxn, (mol,))
if prod is not None:
products.append(prod)
except Exception as e:
print(f"Reaction failed: {e}")
Parallelization
Datamol includes built-in parallelization for many operations. Use n_jobs parameter:
n_jobs=1: Sequential (no parallelization)n_jobs=-1: Use all available CPU coresn_jobs=4: Use 4 cores
Functions supporting parallelization:
dm.read_sdf(..., n_jobs=-1)dm.descriptors.batch_compute_many_descriptors(..., n_jobs=-1)dm.cluster_mols(..., n_jobs=-1)dm.pdist(..., n_jobs=-1)dm.conformers.sasa(..., n_jobs=-1)
Progress bars: Many batch operations support progress=True parameter.
Common Workflows and Patterns
Complete Pipeline: Data Loading → Filtering → Analysis
import datamol as dm
import pandas as pd
# 1. Load molecules
df = dm.read_sdf("compounds.sdf")
# 2. Standardize
df['mol'] = df['mol'].apply(lambda m: dm.standardize_mol(m) if m else None)
df = df[df['mol'].notna()] # Remove failed molecules
# 3. Compute descriptors
desc_df = dm.descriptors.batch_compute_many_descriptors(
df['mol'].tolist(),
n_jobs=-1,
progress=True
)
# 4. Filter by drug-likeness
druglike = (
(desc_df['mw'] <= 500) &
(desc_df['logp'] <= 5) &
(desc_df['hbd'] <= 5) &
(desc_df['hba'] <= 10)
)
filtered_df = df[druglike]
# 5. Cluster and select diverse subset
diverse_mols = dm.pick_diverse(
filtered_df['mol'].tolist(),
npick=100
)
# 6. Visualize results
dm.viz.to_image(
diverse_mols,
legends=[dm.to_smiles(m) for m in diverse_mols],
outfile="diverse_compounds.png",
n_cols=10
)
Structure-Activity Relationship (SAR) Analysis
# Group by scaffold
scaffolds = [dm.to_scaffold_murcko(mol) for mol in mols]
scaffold_smiles = [dm.to_smiles(s) for s in scaffolds]
# Create DataFrame with activities
sar_df = pd.DataFrame({
'mol': mols,
'scaffold': scaffold_smiles,
'activity': activities # User-provided activity data
})
# Analyze each scaffold series
for scaffold, group in sar_df.groupby('scaffold'):
if len(group) >= 3: # Need multiple examples
print(f"\nScaffold: {scaffold}")
print(f"Count: {len(group)}")
print(f"Activity range: {group['activity'].min():.2f} - {group['activity'].max():.2f}")
# Visualize with activities as legends
dm.viz.to_image(
group['mol'].tolist(),
legends=[f"Activity: {act:.2f}" for act in group['activity']],
align=True # Align by common substructure
)
Virtual Screening Pipeline
import numpy as np
# 1. Calculate Tanimoto distances between query actives and library
distances = dm.cdist(query_actives, library_mols, n_jobs=-1)
# 3. Find closest matches (min distance to any query)
min_distances = distances.min(axis=0)
similarities = 1 - min_distances # Convert distance to similarity
# 4. Rank and select top hits
top_indices = np.argsort(similarities)[::-1][:100] # Top 100
top_hits = [library_mols[i] for i in top_indices]
top_scores = [similarities[i] for i in top_indices]
# 5. Visualize hits
dm.viz.to_image(
top_hits[:20],
legends=[f"Sim: {score:.3f}" for score in top_scores[:20]],
outfile="screening_hits.png"
)
Reference Documentation
For detailed API documentation, consult these reference files:
references/core_api.md: Core namespace functions (conversions, standardization, fingerprints, clustering)references/io_module.md: File I/O operations (read/write SDF, CSV, Excel, remote files)references/conformers_module.md: 3D conformer generation, clustering, SASA calculationsreferences/descriptors_viz.md: Molecular descriptors and visualization functionsreferences/fragments_scaffolds.md: Scaffold extraction, BRICS/RECAP fragmentationreferences/reactions_data.md: Chemical reactions and toy datasets
Best Practices
-
Always standardize molecules from external sources:
mol = dm.standardize_mol(mol, disconnect_metals=True, normalize=True, reionize=True) -
Check for None values after molecule parsing:
mol = dm.to_mol(smiles) if mol is None: # Handle invalid SMILES -
Use parallel processing for large datasets:
result = dm.operation(..., n_jobs=-1, progress=True) -
Use cloud I/O only when requested — confirm remote write paths; install
s3fs/gcsfsas needed:df = dm.read_sdf("s3://bucket/compounds.sdf") -
Use appropriate fingerprints for similarity:
- ECFP (Morgan): General purpose, structural similarity
- MACCS: Fast, smaller feature space
- Atom pairs: Considers atom pairs and distances
-
Consider scale limitations:
- Butina clustering: ~1,000 molecules (full distance matrix)
- For larger datasets: Use diversity selection or hierarchical methods
-
Scaffold splitting for ML: Ensure proper train/test separation by scaffold
-
Align molecules when visualizing SAR series
Error Handling
# Safe molecule creation
def safe_to_mol(smiles):
try:
mol = dm.to_mol(smiles)
if mol is not None:
mol = dm.standardize_mol(mol)
return mol
except Exception as e:
print(f"Failed to process {smiles}: {e}")
return None
# Safe batch processing
valid_mols = []
for smiles in smiles_list:
mol = safe_to_mol(smiles)
if mol is not None:
valid_mols.append(mol)
Integration with Machine Learning
Datamol ships with scipy and scikit-learn as dependencies. Import them as normal PyPI packages — they are not scripts bundled in this skill.
import numpy as np
# Feature generation
X = np.array([dm.to_fp(mol) for mol in mols])
# Or descriptors
desc_df = dm.descriptors.batch_compute_many_descriptors(mols, n_jobs=-1)
X = desc_df.values
# Train model (scikit-learn PyPI package)
from sklearn.ensemble import RandomForestRegressor # third-party library
model = RandomForestRegressor()
model.fit(X, y_target)
# Predict
predictions = model.predict(X_test)
Troubleshooting
Issue: Molecule parsing fails
- Solution: Use
dm.standardize_smiles()first or trydm.fix_mol()
Issue: Memory errors with clustering
- Solution: Use
dm.pick_diverse()instead of full clustering for large sets
Issue: Slow conformer generation
- Solution: Reduce
n_confsor increaserms_cutoffto generate fewer conformers
Issue: Remote file access fails
- Solution: Install the matching fsspec backend (
uv pip install s3fsorgcsfs) and verify only the provider credentials needed for that backend are set (see Remote file support above)
Additional Resources
- Datamol Documentation: https://docs.datamol.io/
- RDKit Documentation: https://www.rdkit.org/docs/
- GitHub Repository: https://github.com/datamol-io/datamol
GitHub 저장소
연관 스킬
llamaguard
기타LlamaGuard는 폭력 및 혐오 발언 등 6가지 안전 범주에서 LLM 입력과 출력을 조정하기 위한 Meta의 70-80억 파라미터 모델입니다. 94-95% 정확도를 제공하며 vLLM, Hugging Face 또는 Amazon SageMaker를 사용해 배포할 수 있습니다. 이 기술을 사용하여 AI 애플리케이션에 콘텐츠 필터링 및 안전 가드레일을 손쉽게 통합하세요.
cost-optimization
기타이 Claude Skill은 리소스 적정화, 태깅 전략, 지출 분석을 통해 개발자들이 클라우드 비용을 최적화할 수 있도록 지원합니다. AWS, Azure, GCP에서 클라우드 비용을 절감하고 비용 거버넌스를 구현하기 위한 프레임워크를 제공합니다. 인프라 비용을 분석하거나, 리소스를 적정화하거나, 예산 제약을 충족해야 할 때 사용하세요.
quantizing-models-bitsandbytes
기타이 스킬은 bitsandbytes를 사용하여 LLM을 8비트 또는 4비트 정밀도로 양자화하며, 최소한의 정확도 손실로 50-75%의 메모리 감소를 달성합니다. 제한된 GPU 메모리에서 더 큰 모델을 실행하거나 추론을 가속화하는 데 이상적이며, INT8, NF4, FP4와 같은 형식을 지원합니다. 이 스킬은 HuggingFace Transformers와 통합되어 QLoRA 학습 및 8비트 옵티마이저를 가능하게 합니다.
dispatching-parallel-agents
기타이 Claude Skill은 3개 이상의 독립적인 문제를 동시에 조사하고 해결하기 위해 다중 에이전트를 배치합니다. 공유 상태나 의존성 없이 해결 가능한 무관련 장애 시나리오에 맞게 설계되었습니다. 핵심 기능은 병렬 문제 해결로, 각 독립 문제 영역마다 하나의 에이전트를 할당하여 효율성을 극대화합니다.
