umap-learn
关于
This skill provides UMAP, a fast nonlinear dimensionality reduction technique for visualizing high-dimensional data in 2D/3D. It's useful for preserving data structure in embeddings, preprocessing for clustering algorithms like HDBSCAN, and supports both supervised and parametric modes. It follows scikit-learn conventions for easy integration into machine learning workflows.
快速安装
Claude Code
推荐npx skills add K-Dense-AI/claude-scientific-skills -a claude-code/plugin add https://github.com/K-Dense-AI/claude-scientific-skillsgit clone https://github.com/K-Dense-AI/claude-scientific-skills.git ~/.claude/skills/umap-learn在 Claude Code 中复制并粘贴此命令以安装该技能
技能文档
UMAP-Learn
Overview
UMAP (Uniform Manifold Approximation and Projection) is a dimensionality reduction technique for visualization and general non-linear dimensionality reduction. Apply this skill for fast, scalable embeddings that preserve local and global structure, supervised learning, and clustering preprocessing.
Quick Start
Installation
Requires Python 3.9+. Pin to a verified release:
uv pip install umap-learn==0.5.12
Basic Usage
UMAP follows scikit-learn conventions and can be used as a drop-in replacement for t-SNE or PCA.
import umap
from sklearn.preprocessing import StandardScaler
# Prepare data (standardization is essential)
scaled_data = StandardScaler().fit_transform(data)
# Method 1: Single step (fit and transform)
embedding = umap.UMAP().fit_transform(scaled_data)
# Method 2: Separate steps (for reusing trained model)
reducer = umap.UMAP(random_state=42)
reducer.fit(scaled_data)
embedding = reducer.embedding_ # Access the trained embedding
Critical preprocessing requirement: Always standardize features to comparable scales before applying UMAP to ensure equal weighting across dimensions.
Typical Workflow
import umap
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
# 1. Preprocess data
scaler = StandardScaler()
scaled_data = scaler.fit_transform(raw_data)
# 2. Create and fit UMAP
reducer = umap.UMAP(
n_neighbors=15,
min_dist=0.1,
n_components=2,
metric='euclidean',
random_state=42
)
embedding = reducer.fit_transform(scaled_data)
# 3. Visualize
plt.scatter(embedding[:, 0], embedding[:, 1], c=labels, cmap='Spectral', s=5)
plt.colorbar()
plt.title('UMAP Embedding')
plt.show()
Parameter Tuning Guide
UMAP has four primary parameters that control the embedding behavior. Understanding these is crucial for effective usage.
n_neighbors (default: 15)
Purpose: Balances local versus global structure in the embedding.
How it works: Controls the size of the local neighborhood UMAP examines when learning manifold structure.
Effects by value:
- Low values (2-5): Emphasizes fine local detail but may fragment data into disconnected components
- Medium values (15-20): Balanced view of both local structure and global relationships (recommended starting point)
- High values (50-200): Prioritizes broad topological structure at the expense of fine-grained details
Recommendation: Start with 15 and adjust based on results. Increase for more global structure, decrease for more local detail.
min_dist (default: 0.1)
Purpose: Controls how tightly points cluster in the low-dimensional space.
How it works: Sets the minimum distance apart that points are allowed to be in the output representation.
Effects by value:
- Low values (0.0-0.1): Creates clumped embeddings useful for clustering; reveals fine topological details
- High values (0.5-0.99): Prevents tight packing; emphasizes broad topological preservation over local structure
Recommendation: Use 0.0 for clustering applications, 0.1-0.3 for visualization, 0.5+ for loose structure.
n_components (default: 2)
Purpose: Determines the dimensionality of the embedded output space.
Key feature: Unlike t-SNE, UMAP scales well in the embedding dimension, enabling use beyond visualization.
Common uses:
- 2-3 dimensions: Visualization
- 5-10 dimensions: Clustering preprocessing (better preserves density than 2D)
- 10-50 dimensions: Feature engineering for downstream ML models
Recommendation: Use 2 for visualization, 5-10 for clustering, higher for ML pipelines.
metric (default: 'euclidean')
Purpose: Specifies how distance is calculated between input data points.
Supported metrics:
- Minkowski variants: euclidean, manhattan, chebyshev
- Spatial metrics: canberra, braycurtis, haversine
- Correlation metrics: cosine, correlation (good for text/document embeddings)
- Binary data metrics: hamming, jaccard, dice, russellrao, kulsinski, rogerstanimoto, sokalmichener, sokalsneath, yule
- Custom metrics: User-defined distance functions via Numba
Recommendation: Use euclidean for numeric data, cosine for text/document vectors, hamming for binary data.
Parameter Tuning Example
# For visualization with emphasis on local structure
umap.UMAP(n_neighbors=15, min_dist=0.1, n_components=2, metric='euclidean')
# For clustering preprocessing
umap.UMAP(n_neighbors=30, min_dist=0.0, n_components=10, metric='euclidean')
# For document embeddings
umap.UMAP(n_neighbors=15, min_dist=0.1, n_components=2, metric='cosine')
# For preserving global structure
umap.UMAP(n_neighbors=100, min_dist=0.5, n_components=2, metric='euclidean')
Supervised and Semi-Supervised Dimension Reduction
UMAP supports incorporating label information to guide the embedding process, enabling class separation while preserving internal structure.
Supervised UMAP
Pass target labels via the y parameter when fitting:
# Supervised dimension reduction
embedding = umap.UMAP().fit_transform(data, y=labels)
Key benefits:
- Achieves cleanly separated classes
- Preserves internal structure within each class
- Maintains global relationships between classes
When to use: When you have labeled data and want to separate known classes while keeping meaningful point embeddings.
Semi-Supervised UMAP
For partial labels, mark unlabeled points with -1 following scikit-learn convention:
# Create semi-supervised labels
semi_labels = labels.copy()
semi_labels[unlabeled_indices] = -1
# Fit with partial labels
embedding = umap.UMAP().fit_transform(data, y=semi_labels)
When to use: When labeling is expensive or you have more data than labels available.
Metric Learning with UMAP
Train a supervised embedding on labeled data, then apply to new unlabeled data:
# Train on labeled data
mapper = umap.UMAP().fit(train_data, train_labels)
# Transform unlabeled test data
test_embedding = mapper.transform(test_data)
# Use as feature engineering for downstream classifier
from sklearn.svm import SVC
clf = SVC().fit(mapper.embedding_, train_labels)
predictions = clf.predict(test_embedding)
When to use: For supervised feature engineering in machine learning pipelines.
UMAP for Clustering
UMAP serves as effective preprocessing for density-based clustering algorithms like HDBSCAN, overcoming the curse of dimensionality.
Best Practices for Clustering
Key principle: Configure UMAP differently for clustering than for visualization.
Recommended parameters:
- n_neighbors: Increase to ~30 (default 15 is too local and can create artificial fine-grained clusters)
- min_dist: Set to 0.0 (pack points densely within clusters for clearer boundaries)
- n_components: Use 5-10 dimensions (maintains performance while improving density preservation vs. 2D)
Clustering Workflow
Install HDBSCAN separately for density-based clustering:
uv pip install hdbscan
import umap
import hdbscan
from sklearn.preprocessing import StandardScaler
# 1. Preprocess data
scaled_data = StandardScaler().fit_transform(data)
# 2. UMAP with clustering-optimized parameters
reducer = umap.UMAP(
n_neighbors=30,
min_dist=0.0,
n_components=10, # Higher than 2 for better density preservation
metric='euclidean',
random_state=42
)
embedding = reducer.fit_transform(scaled_data)
# 3. Apply HDBSCAN clustering
clusterer = hdbscan.HDBSCAN(
min_cluster_size=15,
min_samples=5,
metric='euclidean'
)
labels = clusterer.fit_predict(embedding)
# 4. Evaluate
from sklearn.metrics import adjusted_rand_score
score = adjusted_rand_score(true_labels, labels)
print(f"Adjusted Rand Score: {score:.3f}")
print(f"Number of clusters: {len(set(labels)) - (1 if -1 in labels else 0)}")
print(f"Noise points: {sum(labels == -1)}")
Visualization After Clustering
# Create 2D embedding for visualization (separate from clustering)
vis_reducer = umap.UMAP(n_neighbors=15, min_dist=0.1, n_components=2, random_state=42)
vis_embedding = vis_reducer.fit_transform(scaled_data)
# Plot with cluster labels
import matplotlib.pyplot as plt
plt.scatter(vis_embedding[:, 0], vis_embedding[:, 1], c=labels, cmap='Spectral', s=5)
plt.colorbar()
plt.title('UMAP Visualization with HDBSCAN Clusters')
plt.show()
Important caveat: UMAP does not completely preserve density and can create artificial cluster divisions. Always validate and explore resulting clusters.
Transforming New Data
UMAP enables preprocessing of new data through its transform() method, allowing trained models to project unseen data into the learned embedding space.
Basic Transform Usage
# Train on training data
trans = umap.UMAP(n_neighbors=15, random_state=42).fit(X_train)
# Transform test data
test_embedding = trans.transform(X_test)
Integration with Machine Learning Pipelines
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import umap
# Split data
X_train, X_test, y_train, y_test = train_test_split(data, labels, test_size=0.2)
# Preprocess
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Train UMAP
reducer = umap.UMAP(n_components=10, random_state=42)
X_train_embedded = reducer.fit_transform(X_train_scaled)
X_test_embedded = reducer.transform(X_test_scaled)
# Train classifier on embeddings
clf = SVC()
clf.fit(X_train_embedded, y_train)
accuracy = clf.score(X_test_embedded, y_test)
print(f"Test accuracy: {accuracy:.3f}")
Important Considerations
Data consistency: The transform method assumes the overall distribution in the higher-dimensional space is consistent between training and test data. When this assumption fails, consider using Parametric UMAP instead.
Performance: Transform operations are efficient (typically <1 second), though initial calls may be slower due to Numba JIT compilation.
Scikit-learn compatibility: UMAP follows standard sklearn conventions and works seamlessly in pipelines. Since 0.5.x, UMAP implements get_feature_names_out() for sklearn column-transformer pipelines:
from sklearn.pipeline import Pipeline
pipeline = Pipeline([
('scaler', StandardScaler()),
('umap', umap.UMAP(n_components=10)),
('classifier', SVC())
])
pipeline.fit(X_train, y_train)
predictions = pipeline.predict(X_test)
feature_names = pipeline.named_steps['umap'].get_feature_names_out()
Advanced Features
Parametric UMAP
Parametric UMAP replaces direct embedding optimization with a learned neural network mapping function.
Key differences from standard UMAP:
- Uses TensorFlow/Keras to train encoder networks
- Enables efficient transformation of new data
- Supports reconstruction via decoder networks (inverse transform)
- Allows custom architectures (CNNs for images, RNNs for sequences)
Installation:
uv pip install "umap-learn[parametric-umap]==0.5.12"
# Requires TensorFlow 2.x (install separately if needed)
Basic usage:
from umap.parametric_umap import ParametricUMAP
# Default architecture (3-layer 100-neuron fully-connected network)
embedder = ParametricUMAP()
embedding = embedder.fit_transform(data)
# Transform new data efficiently
new_embedding = embedder.transform(new_data)
Custom architecture:
import tensorflow as tf
# Define custom encoder
encoder = tf.keras.Sequential([
tf.keras.layers.InputLayer(input_shape=(input_dim,)),
tf.keras.layers.Dense(128, activation='relu'),
tf.keras.layers.Dense(64, activation='relu'),
tf.keras.layers.Dense(2) # Output dimension
])
embedder = ParametricUMAP(encoder=encoder, dims=(input_dim,))
embedding = embedder.fit_transform(data)
When to use Parametric UMAP:
- Need efficient transformation of new data after training
- Require reconstruction capabilities (inverse transforms)
- Want to combine UMAP with autoencoders
- Working with complex data types (images, sequences) benefiting from specialized architectures
When to use standard UMAP:
- Need simplicity and quick prototyping
- Dataset is small and computational efficiency isn't critical
- Don't require learned transformations for future data
Inverse Transforms
Inverse transforms enable reconstruction of high-dimensional data from low-dimensional embeddings.
Basic usage:
reducer = umap.UMAP()
embedding = reducer.fit_transform(data)
# Reconstruct high-dimensional data from embedding coordinates
reconstructed = reducer.inverse_transform(embedding)
Important limitations:
- Computationally expensive operation
- Works poorly outside the convex hull of the embedding
- Accuracy decreases in regions with gaps between clusters
Use cases:
- Understanding structure of embedded data
- Visualizing smooth transitions between clusters
- Exploring interpolations between data points
- Generating synthetic samples in embedding space
Example: Exploring embedding space:
import numpy as np
# Create grid of points in embedding space
x = np.linspace(embedding[:, 0].min(), embedding[:, 0].max(), 10)
y = np.linspace(embedding[:, 1].min(), embedding[:, 1].max(), 10)
xx, yy = np.meshgrid(x, y)
grid_points = np.c_[xx.ravel(), yy.ravel()]
# Reconstruct samples from grid
reconstructed_samples = reducer.inverse_transform(grid_points)
AlignedUMAP
For analyzing temporal or related datasets (e.g., time-series experiments, batch data):
from umap import AlignedUMAP
# List of related datasets
datasets = [day1_data, day2_data, day3_data]
# Create aligned embeddings
mapper = AlignedUMAP().fit(datasets)
aligned_embeddings = mapper.embeddings_ # List of embeddings
When to use: Comparing embeddings across related datasets while maintaining consistent coordinate systems.
Reproducibility
To ensure reproducible results, always set the random_state parameter:
reducer = umap.UMAP(random_state=42)
UMAP uses stochastic optimization, so results will vary slightly between runs without a fixed random state.
Common Issues and Solutions
Issue: Disconnected components or fragmented clusters
- Solution: Increase
n_neighborsto emphasize more global structure
Issue: Clusters too spread out or not well separated
- Solution: Decrease
min_distto allow tighter packing
Issue: Poor clustering results
- Solution: Use clustering-specific parameters (n_neighbors=30, min_dist=0.0, n_components=5-10)
Issue: Transform results differ significantly from training
- Solution: Ensure test data distribution matches training, or use Parametric UMAP
Issue: Slow performance on large datasets
- Solution: Set
low_memory=True(default), or consider dimensionality reduction with PCA first
Issue: NaN or inf values in input data
- Solution: Impute or drop invalid rows before fitting (0.5.6+ accepts NaN/inf in some paths, but clean numeric input is still recommended)
Issue: All points collapsed to single cluster
- Solution: Check data preprocessing (ensure proper scaling), increase
min_dist
Resources
Official documentation
- UMAP user guide
- Release notes
- PyPI package (current stable: 0.5.12)
- GitHub repository
references/
Contains detailed API documentation:
api_reference.md: Complete UMAP class parameters and methods
Load these references when detailed parameter information or advanced method usage is needed.
GitHub 仓库
相关推荐技能
executing-plans
设计该Skill用于当开发者提供完整实施计划时,以受控批次方式执行代码实现。它会先审阅计划并提出疑问,然后分批次执行任务(默认每批3个任务),并在批次间暂停等待审查。关键特性包括分批次执行、内置检查点和架构师审查机制,确保复杂系统实现的可控性。
requesting-code-review
设计该Skill可在完成任务、实现主要功能或合并代码前自动调度代码审查子代理,确保实现符合需求和计划。它支持通过指定git SHA范围进行精准的代码变更审查,帮助开发者在关键节点及时发现潜在问题。核心原则是"早审查、勤审查",适用于开发流程的各个关键阶段。
connect-mcp-server
设计这个Skill指导开发者如何将MCP服务器连接到Claude Code,支持HTTP、stdio和SSE三种传输协议。它涵盖了从安装配置到认证安全的完整流程,适用于集成GitHub、Notion、数据库等外部服务。当开发者需要添加集成、配置外部工具或提及MCP相关功能时,这个Skill能提供实用的操作指南。
web-cli-teleport
设计该Skill帮助开发者根据任务特性选择Claude Code的Web或CLI界面,并指导如何在两种环境间无缝迁移会话。它能分析任务复杂度、迭代需求等要素,推荐最优工作界面和工作流。关键特性包括会话状态管理、环境切换指导和上下文优化建议。
