Feature Engineering
About
This Claude Skill enables feature engineering for machine learning by creating and transforming features using techniques like encoding, scaling, and polynomial features. It helps developers improve model performance and interpretability by applying domain-specific transformations and mathematical operations. Use it to preprocess data and generate more predictive input features for your ML models.
Documentation
Feature Engineering
Feature engineering creates and transforms features to improve model performance, interpretability, and generalization through domain knowledge and mathematical transformations.
Engineering Techniques
- Encoding: Converting categorical to numerical
- Scaling: Normalizing feature ranges
- Polynomial Features: Higher-order terms
- Interactions: Combining features
- Domain-specific: Business-relevant transformations
- Temporal: Time-based features
Key Principles
- Create features based on domain knowledge
- Remove redundant features
- Scale features appropriately
- Handle categorical variables
- Create meaningful interactions
Implementation with Python
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import (
StandardScaler, MinMaxScaler, RobustScaler, PolynomialFeatures,
OneHotEncoder, OrdinalEncoder, LabelEncoder
)
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
import seaborn as sns
# Create sample dataset
np.random.seed(42)
df = pd.DataFrame({
'age': np.random.uniform(18, 80, 1000),
'income': np.random.uniform(20000, 150000, 1000),
'experience_years': np.random.uniform(0, 50, 1000),
'category': np.random.choice(['A', 'B', 'C'], 1000),
'city': np.random.choice(['NYC', 'LA', 'Chicago'], 1000),
'purchased': np.random.choice([0, 1], 1000),
})
print("Original Data:")
print(df.head())
print(df.info())
# 1. Categorical Encoding
# One-Hot Encoding
print("\n1. One-Hot Encoding:")
df_ohe = pd.get_dummies(df, columns=['category', 'city'], drop_first=True)
print(df_ohe.head())
# Ordinal Encoding
print("\n2. Ordinal Encoding:")
ordinal_encoder = OrdinalEncoder()
df['category_ordinal'] = ordinal_encoder.fit_transform(df[['category']])
print(df[['category', 'category_ordinal']].head())
# Label Encoding
print("\n3. Label Encoding:")
le = LabelEncoder()
df['city_encoded'] = le.fit_transform(df['city'])
print(df[['city', 'city_encoded']].head())
# 2. Feature Scaling
print("\n4. Feature Scaling:")
X = df[['age', 'income', 'experience_years']].copy()
# StandardScaler (mean=0, std=1)
scaler = StandardScaler()
X_standard = scaler.fit_transform(X)
# MinMaxScaler [0, 1]
minmax_scaler = MinMaxScaler()
X_minmax = minmax_scaler.fit_transform(X)
# RobustScaler (resistant to outliers)
robust_scaler = RobustScaler()
X_robust = robust_scaler.fit_transform(X)
# Visualization
fig, axes = plt.subplots(2, 2, figsize=(12, 8))
axes[0, 0].hist(X['age'], bins=30, edgecolor='black')
axes[0, 0].set_title('Original Age')
axes[0, 1].hist(X_standard[:, 0], bins=30, edgecolor='black')
axes[0, 1].set_title('StandardScaler Age')
axes[1, 0].hist(X_minmax[:, 0], bins=30, edgecolor='black')
axes[1, 0].set_title('MinMaxScaler Age')
axes[1, 1].hist(X_robust[:, 0], bins=30, edgecolor='black')
axes[1, 1].set_title('RobustScaler Age')
plt.tight_layout()
plt.show()
# 3. Polynomial Features
print("\n5. Polynomial Features:")
X_simple = df[['age']].copy()
poly = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly.fit_transform(X_simple)
X_poly_df = pd.DataFrame(X_poly, columns=['age', 'age^2'])
print(X_poly_df.head())
# Visualization
plt.figure(figsize=(12, 5))
plt.scatter(df['age'], df['income'], alpha=0.5)
plt.xlabel('Age')
plt.ylabel('Income')
plt.title('Age vs Income')
plt.grid(True, alpha=0.3)
plt.show()
# 4. Feature Interactions
print("\n6. Feature Interactions:")
df['age_income_interaction'] = df['age'] * df['income'] / 10000
df['age_experience_ratio'] = df['age'] / (df['experience_years'] + 1)
print(df[['age', 'income', 'age_income_interaction', 'age_experience_ratio']].head())
# 5. Domain-specific Transformations
print("\n7. Domain-specific Features:")
df['age_group'] = pd.cut(df['age'], bins=[0, 30, 45, 60, 100],
labels=['Young', 'Middle', 'Senior', 'Retired'])
df['income_level'] = pd.qcut(df['income'], q=3, labels=['Low', 'Medium', 'High'])
df['log_income'] = np.log1p(df['income'])
df['sqrt_experience'] = np.sqrt(df['experience_years'])
print(df[['age', 'age_group', 'income', 'income_level', 'log_income']].head())
# 6. Temporal Features (if date data available)
print("\n8. Temporal Features:")
dates = pd.date_range('2023-01-01', periods=len(df))
df['date'] = dates
df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month
df['day_of_week'] = df['date'].dt.dayofweek
df['quarter'] = df['date'].dt.quarter
df['is_weekend'] = df['date'].dt.dayofweek >= 5
print(df[['date', 'year', 'month', 'day_of_week', 'is_weekend']].head())
# 7. Feature Standardization Pipeline
print("\n9. Feature Engineering Pipeline:")
# Separate numerical and categorical features
numerical_features = ['age', 'income', 'experience_years']
categorical_features = ['category', 'city']
# Create preprocessing pipeline
preprocessor = ColumnTransformer(
transformers=[
('num', StandardScaler(), numerical_features),
('cat', OneHotEncoder(drop='first'), categorical_features),
]
)
X_processed = preprocessor.fit_transform(df[numerical_features + categorical_features])
print(f"Processed shape: {X_processed.shape}")
# 8. Feature Statistics
print("\n10. Feature Statistics:")
X_for_stats = df[numerical_features].copy()
X_for_stats['category_A'] = (df['category'] == 'A').astype(int)
X_for_stats['city_NYC'] = (df['city'] == 'NYC').astype(int)
feature_stats = pd.DataFrame({
'Feature': X_for_stats.columns,
'Mean': X_for_stats.mean(),
'Std': X_for_stats.std(),
'Min': X_for_stats.min(),
'Max': X_for_stats.max(),
'Skewness': X_for_stats.skew(),
'Kurtosis': X_for_stats.kurtosis(),
})
print(feature_stats)
# 9. Feature Correlations
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
X_numeric = df[numerical_features].copy()
X_numeric['purchased'] = df['purchased']
corr_matrix = X_numeric.corr()
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', center=0, ax=axes[0])
axes[0].set_title('Feature Correlation Matrix')
# Distribution of engineered features
axes[1].hist(df['age_income_interaction'], bins=30, edgecolor='black', alpha=0.7)
axes[1].set_title('Age-Income Interaction Distribution')
axes[1].set_xlabel('Value')
axes[1].set_ylabel('Frequency')
plt.tight_layout()
plt.show()
# 10. Feature Binning / Discretization
print("\n11. Feature Binning:")
df['age_bin_equal'] = pd.cut(df['age'], bins=5)
df['age_bin_quantile'] = pd.qcut(df['age'], q=5)
df['income_bins'] = pd.cut(df['income'], bins=[0, 50000, 100000, 150000])
print("Equal Width Binning:")
print(df['age_bin_equal'].value_counts().sort_index())
print("\nEqual Frequency Binning:")
print(df['age_bin_quantile'].value_counts().sort_index())
# 11. Missing Value Creation and Handling
print("\n12. Missing Value Imputation:")
df_with_missing = df.copy()
missing_indices = np.random.choice(len(df), 50, replace=False)
df_with_missing.loc[missing_indices, 'age'] = np.nan
# Mean imputation
age_mean = df_with_missing['age'].mean()
df_with_missing['age_imputed_mean'] = df_with_missing['age'].fillna(age_mean)
# Median imputation
age_median = df_with_missing['age'].median()
df_with_missing['age_imputed_median'] = df_with_missing['age'].fillna(age_median)
# Forward fill
df_with_missing['age_imputed_ffill'] = df_with_missing['age'].fillna(method='ffill')
print(df_with_missing[['age', 'age_imputed_mean', 'age_imputed_median']].head(10))
print("\nFeature Engineering Complete!")
print(f"Original features: {len(df.columns) - 5}")
print(f"Final features available: {len(df.columns)}")
Best Practices
- Understand your domain before engineering features
- Create features that are interpretable
- Avoid data leakage (using future information)
- Test feature importance after engineering
- Document all transformations
- Use appropriate scaling for different algorithms
Common Transformations
- Log Transform: For skewed distributions
- Polynomial Features: For non-linear relationships
- Interaction Terms: For combined effects
- Binning: For categorical approximation
- Normalization: For comparison across scales
Deliverables
- Engineered feature dataset
- Feature transformation documentation
- Correlation analysis of new features
- Distribution comparisons (before/after)
- Feature importance rankings
- Preprocessing pipeline code
- Data dictionary with feature descriptions
Quick Install
/plugin add https://github.com/aj-geddes/useful-ai-prompts/tree/main/feature-engineeringCopy and paste this command in Claude Code to install this skill
GitHub 仓库
Related Skills
sglang
MetaSGLang is a high-performance LLM serving framework that specializes in fast, structured generation for JSON, regex, and agentic workflows using its RadixAttention prefix caching. It delivers significantly faster inference, especially for tasks with repeated prefixes, making it ideal for complex, structured outputs and multi-turn conversations. Choose SGLang over alternatives like vLLM when you need constrained decoding or are building applications with extensive prefix sharing.
evaluating-llms-harness
TestingThis Claude Skill runs the lm-evaluation-harness to benchmark LLMs across 60+ standardized academic tasks like MMLU and GSM8K. It's designed for developers to compare model quality, track training progress, or report academic results. The tool supports various backends including HuggingFace and vLLM models.
llamaguard
OtherLlamaGuard is Meta's 7-8B parameter model for moderating LLM inputs and outputs across six safety categories like violence and hate speech. It offers 94-95% accuracy and can be deployed using vLLM, Hugging Face, or Amazon SageMaker. Use this skill to easily integrate content filtering and safety guardrails into your AI applications.
langchain
MetaLangChain is a framework for building LLM applications using agents, chains, and RAG pipelines. It supports multiple LLM providers, offers 500+ integrations, and includes features like tool calling and memory management. Use it for rapid prototyping and deploying production systems like chatbots, autonomous agents, and question-answering services.
