Classification Modeling
About
This skill enables developers to build classification models using algorithms like logistic regression, decision trees, and ensemble methods. It supports binary and multiclass problems for predicting categorical outcomes. Use it when you need to categorize data into discrete classes and evaluate performance with metrics like accuracy, precision, and recall.
Documentation
Classification Modeling
Classification modeling predicts categorical target values, assigning observations to discrete classes or categories based on input features.
Classification Types
- Binary Classification: Two classes (yes/no, success/failure)
- Multiclass: More than two classes
- Multi-label: Multiple classes per observation
Common Algorithms
- Logistic Regression: Linear classification
- Decision Trees: Rule-based non-linear
- Random Forest: Ensemble of decision trees
- Gradient Boosting: Sequential tree building
- SVM: Support Vector Machines
- Naive Bayes: Probabilistic classifier
Key Metrics
- Accuracy: Overall correct predictions
- Precision: True positives / (true + false positives)
- Recall: True positives / (true + false negatives)
- F1-Score: Harmonic mean of precision/recall
- AUC-ROC: Area under receiver operating characteristic curve
Implementation with Python
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.metrics import (
confusion_matrix, classification_report, roc_auc_score, roc_curve,
precision_recall_curve, f1_score, accuracy_score
)
import seaborn as sns
# Generate sample binary classification data
np.random.seed(42)
from sklearn.datasets import make_classification
X, y = make_classification(
n_samples=1000, n_features=20, n_informative=10,
n_redundant=5, random_state=42
)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Standardize features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Logistic Regression
lr_model = LogisticRegression(max_iter=1000)
lr_model.fit(X_train_scaled, y_train)
y_pred_lr = lr_model.predict(X_test_scaled)
y_proba_lr = lr_model.predict_proba(X_test_scaled)[:, 1]
print("Logistic Regression:")
print(classification_report(y_test, y_pred_lr))
print(f"AUC-ROC: {roc_auc_score(y_test, y_proba_lr):.4f}\n")
# Decision Tree
dt_model = DecisionTreeClassifier(max_depth=10, random_state=42)
dt_model.fit(X_train, y_train)
y_pred_dt = dt_model.predict(X_test)
y_proba_dt = dt_model.predict_proba(X_test)[:, 1]
print("Decision Tree:")
print(classification_report(y_test, y_pred_dt))
print(f"AUC-ROC: {roc_auc_score(y_test, y_proba_dt):.4f}\n")
# Random Forest
rf_model = RandomForestClassifier(n_estimators=100, max_depth=10, random_state=42)
rf_model.fit(X_train, y_train)
y_pred_rf = rf_model.predict(X_test)
y_proba_rf = rf_model.predict_proba(X_test)[:, 1]
print("Random Forest:")
print(classification_report(y_test, y_pred_rf))
print(f"AUC-ROC: {roc_auc_score(y_test, y_proba_rf):.4f}\n")
# Gradient Boosting
gb_model = GradientBoostingClassifier(n_estimators=100, max_depth=5, random_state=42)
gb_model.fit(X_train, y_train)
y_pred_gb = gb_model.predict(X_test)
y_proba_gb = gb_model.predict_proba(X_test)[:, 1]
print("Gradient Boosting:")
print(classification_report(y_test, y_pred_gb))
print(f"AUC-ROC: {roc_auc_score(y_test, y_proba_gb):.4f}\n")
# Confusion matrices
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
models = [
(y_pred_lr, 'Logistic Regression'),
(y_pred_dt, 'Decision Tree'),
(y_pred_rf, 'Random Forest'),
(y_pred_gb, 'Gradient Boosting'),
]
for idx, (y_pred, title) in enumerate(models):
cm = confusion_matrix(y_test, y_pred)
ax = axes[idx // 2, idx % 2]
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', ax=ax)
ax.set_title(title)
ax.set_ylabel('True Label')
ax.set_xlabel('Predicted Label')
plt.tight_layout()
plt.show()
# ROC Curves
plt.figure(figsize=(10, 8))
probas = [
(y_proba_lr, 'Logistic Regression'),
(y_proba_dt, 'Decision Tree'),
(y_proba_rf, 'Random Forest'),
(y_proba_gb, 'Gradient Boosting'),
]
for y_proba, label in probas:
fpr, tpr, _ = roc_curve(y_test, y_proba)
auc = roc_auc_score(y_test, y_proba)
plt.plot(fpr, tpr, label=f'{label} (AUC={auc:.4f})')
plt.plot([0, 1], [0, 1], 'k--', label='Random Classifier')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curves Comparison')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()
# Precision-Recall Curves
plt.figure(figsize=(10, 8))
for y_proba, label in probas:
precision, recall, _ = precision_recall_curve(y_test, y_proba)
f1 = f1_score(y_test, (y_proba > 0.5).astype(int))
plt.plot(recall, precision, label=f'{label} (F1={f1:.4f})')
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title('Precision-Recall Curves')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()
# Feature importance
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
# Tree-based feature importance
feature_importance_rf = pd.Series(
rf_model.feature_importances_, index=range(X.shape[1])
).sort_values(ascending=False)
axes[0].barh(range(10), feature_importance_rf.values[:10])
axes[0].set_yticks(range(10))
axes[0].set_yticklabels([f'Feature {i}' for i in feature_importance_rf.index[:10]])
axes[0].set_title('Random Forest - Top 10 Features')
axes[0].set_xlabel('Importance')
# Logistic regression coefficients
lr_coef = pd.Series(lr_model.coef_[0], index=range(X.shape[1])).abs().sort_values(ascending=False)
axes[1].barh(range(10), lr_coef.values[:10])
axes[1].set_yticks(range(10))
axes[1].set_yticklabels([f'Feature {i}' for i in lr_coef.index[:10]])
axes[1].set_title('Logistic Regression - Top 10 Features (abs coef)')
axes[1].set_xlabel('Absolute Coefficient')
plt.tight_layout()
plt.show()
# Model comparison
results = pd.DataFrame({
'Model': ['Logistic Regression', 'Decision Tree', 'Random Forest', 'Gradient Boosting'],
'Accuracy': [
accuracy_score(y_test, y_pred_lr),
accuracy_score(y_test, y_pred_dt),
accuracy_score(y_test, y_pred_rf),
accuracy_score(y_test, y_pred_gb),
],
'AUC-ROC': [
roc_auc_score(y_test, y_proba_lr),
roc_auc_score(y_test, y_proba_dt),
roc_auc_score(y_test, y_proba_rf),
roc_auc_score(y_test, y_proba_gb),
],
'F1-Score': [
f1_score(y_test, y_pred_lr),
f1_score(y_test, y_pred_dt),
f1_score(y_test, y_pred_rf),
f1_score(y_test, y_pred_gb),
]
})
print("Model Comparison:")
print(results)
# Cross-validation
cv_scores = cross_val_score(
RandomForestClassifier(n_estimators=100, random_state=42),
X_train, y_train, cv=5, scoring='roc_auc'
)
print(f"\nCross-validation AUC scores: {cv_scores}")
print(f"Mean CV AUC: {cv_scores.mean():.4f} (+/- {cv_scores.std():.4f})")
# Probability calibration
from sklearn.calibration import calibration_curve
prob_true, prob_pred = calibration_curve(y_test, y_proba_rf, n_bins=10)
plt.figure(figsize=(8, 6))
plt.plot(prob_pred, prob_true, 'o-', label='Random Forest')
plt.plot([0, 1], [0, 1], 'k--', label='Perfect Calibration')
plt.xlabel('Mean Predicted Probability')
plt.ylabel('Fraction of Positives')
plt.title('Calibration Curve')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()
Class Imbalance Handling
- Oversampling: Increase minority class samples
- Undersampling: Reduce majority class samples
- SMOTE: Synthetic minority oversampling
- Class weights: Penalize misclassifying minority class
Threshold Selection
- Default (0.5): Equal misclassification cost
- Custom threshold: Based on business requirements
- Optimal: Maximizing F1-score or AUC
Deliverables
- Classification metrics (accuracy, precision, recall, F1)
- Confusion matrices for all models
- ROC and Precision-Recall curves
- Feature importance analysis
- Model comparison table
- Recommendations for best model
- Probability calibration plots
Quick Install
/plugin add https://github.com/aj-geddes/useful-ai-prompts/tree/main/classification-modelingCopy and paste this command in Claude Code to install this skill
GitHub 仓库
Related Skills
langchain
MetaLangChain is a framework for building LLM applications using agents, chains, and RAG pipelines. It supports multiple LLM providers, offers 500+ integrations, and includes features like tool calling and memory management. Use it for rapid prototyping and deploying production systems like chatbots, autonomous agents, and question-answering services.
Algorithmic Art Generation
MetaThis skill helps developers create algorithmic art using p5.js, focusing on generative art, computational aesthetics, and interactive visualizations. It automatically activates for topics like "generative art" or "p5.js visualization" and guides you through creating unique algorithms with features like seeded randomness, flow fields, and particle systems. Use it when you need to build reproducible, code-driven artistic patterns.
webapp-testing
TestingThis Claude Skill provides a Playwright-based toolkit for testing local web applications through Python scripts. It enables frontend verification, UI debugging, screenshot capture, and log viewing while managing server lifecycles. Use it for browser automation tasks but run scripts directly rather than reading their source code to avoid context pollution.
finishing-a-development-branch
TestingThis skill helps developers complete finished work by verifying tests pass and then presenting structured integration options. It guides the workflow for merging, creating PRs, or cleaning up branches after implementation is done. Use it when your code is ready and tested to systematically finalize the development process.
