Feature Engineering

aj-geddes

Updated Today

25 views

Metaai

About

This Claude Skill enables feature engineering for machine learning by creating and transforming features using techniques like encoding, scaling, and polynomial features. It helps developers improve model performance and interpretability by applying domain-specific transformations and mathematical operations. Use it to preprocess data and generate more predictive input features for your ML models.

Documentation

Feature Engineering

Feature engineering creates and transforms features to improve model performance, interpretability, and generalization through domain knowledge and mathematical transformations.

Engineering Techniques

Encoding: Converting categorical to numerical
Scaling: Normalizing feature ranges
Polynomial Features: Higher-order terms
Interactions: Combining features
Domain-specific: Business-relevant transformations
Temporal: Time-based features

Key Principles

Create features based on domain knowledge
Remove redundant features
Scale features appropriately
Handle categorical variables
Create meaningful interactions

Implementation with Python

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import (
    StandardScaler, MinMaxScaler, RobustScaler, PolynomialFeatures,
    OneHotEncoder, OrdinalEncoder, LabelEncoder
)
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
import seaborn as sns

# Create sample dataset
np.random.seed(42)
df = pd.DataFrame({
    'age': np.random.uniform(18, 80, 1000),
    'income': np.random.uniform(20000, 150000, 1000),
    'experience_years': np.random.uniform(0, 50, 1000),
    'category': np.random.choice(['A', 'B', 'C'], 1000),
    'city': np.random.choice(['NYC', 'LA', 'Chicago'], 1000),
    'purchased': np.random.choice([0, 1], 1000),
})

print("Original Data:")
print(df.head())
print(df.info())

# 1. Categorical Encoding
# One-Hot Encoding
print("\n1. One-Hot Encoding:")
df_ohe = pd.get_dummies(df, columns=['category', 'city'], drop_first=True)
print(df_ohe.head())

# Ordinal Encoding
print("\n2. Ordinal Encoding:")
ordinal_encoder = OrdinalEncoder()
df['category_ordinal'] = ordinal_encoder.fit_transform(df[['category']])
print(df[['category', 'category_ordinal']].head())

# Label Encoding
print("\n3. Label Encoding:")
le = LabelEncoder()
df['city_encoded'] = le.fit_transform(df['city'])
print(df[['city', 'city_encoded']].head())

# 2. Feature Scaling
print("\n4. Feature Scaling:")
X = df[['age', 'income', 'experience_years']].copy()

# StandardScaler (mean=0, std=1)
scaler = StandardScaler()
X_standard = scaler.fit_transform(X)

# MinMaxScaler [0, 1]
minmax_scaler = MinMaxScaler()
X_minmax = minmax_scaler.fit_transform(X)

# RobustScaler (resistant to outliers)
robust_scaler = RobustScaler()
X_robust = robust_scaler.fit_transform(X)

# Visualization
fig, axes = plt.subplots(2, 2, figsize=(12, 8))

axes[0, 0].hist(X['age'], bins=30, edgecolor='black')
axes[0, 0].set_title('Original Age')

axes[0, 1].hist(X_standard[:, 0], bins=30, edgecolor='black')
axes[0, 1].set_title('StandardScaler Age')

axes[1, 0].hist(X_minmax[:, 0], bins=30, edgecolor='black')
axes[1, 0].set_title('MinMaxScaler Age')

axes[1, 1].hist(X_robust[:, 0], bins=30, edgecolor='black')
axes[1, 1].set_title('RobustScaler Age')

plt.tight_layout()
plt.show()

# 3. Polynomial Features
print("\n5. Polynomial Features:")
X_simple = df[['age']].copy()
poly = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly.fit_transform(X_simple)
X_poly_df = pd.DataFrame(X_poly, columns=['age', 'age^2'])
print(X_poly_df.head())

# Visualization
plt.figure(figsize=(12, 5))
plt.scatter(df['age'], df['income'], alpha=0.5)
plt.xlabel('Age')
plt.ylabel('Income')
plt.title('Age vs Income')
plt.grid(True, alpha=0.3)
plt.show()

# 4. Feature Interactions
print("\n6. Feature Interactions:")
df['age_income_interaction'] = df['age'] * df['income'] / 10000
df['age_experience_ratio'] = df['age'] / (df['experience_years'] + 1)
print(df[['age', 'income', 'age_income_interaction', 'age_experience_ratio']].head())

# 5. Domain-specific Transformations
print("\n7. Domain-specific Features:")
df['age_group'] = pd.cut(df['age'], bins=[0, 30, 45, 60, 100],
                          labels=['Young', 'Middle', 'Senior', 'Retired'])
df['income_level'] = pd.qcut(df['income'], q=3, labels=['Low', 'Medium', 'High'])
df['log_income'] = np.log1p(df['income'])
df['sqrt_experience'] = np.sqrt(df['experience_years'])

print(df[['age', 'age_group', 'income', 'income_level', 'log_income']].head())

# 6. Temporal Features (if date data available)
print("\n8. Temporal Features:")
dates = pd.date_range('2023-01-01', periods=len(df))
df['date'] = dates
df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month
df['day_of_week'] = df['date'].dt.dayofweek
df['quarter'] = df['date'].dt.quarter
df['is_weekend'] = df['date'].dt.dayofweek >= 5

print(df[['date', 'year', 'month', 'day_of_week', 'is_weekend']].head())

# 7. Feature Standardization Pipeline
print("\n9. Feature Engineering Pipeline:")

# Separate numerical and categorical features
numerical_features = ['age', 'income', 'experience_years']
categorical_features = ['category', 'city']

# Create preprocessing pipeline
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numerical_features),
        ('cat', OneHotEncoder(drop='first'), categorical_features),
    ]
)

X_processed = preprocessor.fit_transform(df[numerical_features + categorical_features])
print(f"Processed shape: {X_processed.shape}")

# 8. Feature Statistics
print("\n10. Feature Statistics:")
X_for_stats = df[numerical_features].copy()
X_for_stats['category_A'] = (df['category'] == 'A').astype(int)
X_for_stats['city_NYC'] = (df['city'] == 'NYC').astype(int)

feature_stats = pd.DataFrame({
    'Feature': X_for_stats.columns,
    'Mean': X_for_stats.mean(),
    'Std': X_for_stats.std(),
    'Min': X_for_stats.min(),
    'Max': X_for_stats.max(),
    'Skewness': X_for_stats.skew(),
    'Kurtosis': X_for_stats.kurtosis(),
})

print(feature_stats)

# 9. Feature Correlations
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

X_numeric = df[numerical_features].copy()
X_numeric['purchased'] = df['purchased']
corr_matrix = X_numeric.corr()

sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', center=0, ax=axes[0])
axes[0].set_title('Feature Correlation Matrix')

# Distribution of engineered features
axes[1].hist(df['age_income_interaction'], bins=30, edgecolor='black', alpha=0.7)
axes[1].set_title('Age-Income Interaction Distribution')
axes[1].set_xlabel('Value')
axes[1].set_ylabel('Frequency')

plt.tight_layout()
plt.show()

# 10. Feature Binning / Discretization
print("\n11. Feature Binning:")
df['age_bin_equal'] = pd.cut(df['age'], bins=5)
df['age_bin_quantile'] = pd.qcut(df['age'], q=5)
df['income_bins'] = pd.cut(df['income'], bins=[0, 50000, 100000, 150000])

print("Equal Width Binning:")
print(df['age_bin_equal'].value_counts().sort_index())

print("\nEqual Frequency Binning:")
print(df['age_bin_quantile'].value_counts().sort_index())

# 11. Missing Value Creation and Handling
print("\n12. Missing Value Imputation:")
df_with_missing = df.copy()
missing_indices = np.random.choice(len(df), 50, replace=False)
df_with_missing.loc[missing_indices, 'age'] = np.nan

# Mean imputation
age_mean = df_with_missing['age'].mean()
df_with_missing['age_imputed_mean'] = df_with_missing['age'].fillna(age_mean)

# Median imputation
age_median = df_with_missing['age'].median()
df_with_missing['age_imputed_median'] = df_with_missing['age'].fillna(age_median)

# Forward fill
df_with_missing['age_imputed_ffill'] = df_with_missing['age'].fillna(method='ffill')

print(df_with_missing[['age', 'age_imputed_mean', 'age_imputed_median']].head(10))

print("\nFeature Engineering Complete!")
print(f"Original features: {len(df.columns) - 5}")
print(f"Final features available: {len(df.columns)}")

Best Practices

Understand your domain before engineering features
Create features that are interpretable
Avoid data leakage (using future information)
Test feature importance after engineering
Document all transformations
Use appropriate scaling for different algorithms

Common Transformations

Log Transform: For skewed distributions
Polynomial Features: For non-linear relationships
Interaction Terms: For combined effects
Binning: For categorical approximation
Normalization: For comparison across scales

Deliverables

Engineered feature dataset
Feature transformation documentation
Correlation analysis of new features
Distribution comparisons (before/after)
Feature importance rankings
Preprocessing pipeline code
Data dictionary with feature descriptions

Quick Install

/plugin add https://github.com/aj-geddes/useful-ai-prompts/tree/main/feature-engineering

Copy and paste this command in Claude Code to install this skill

GitHub 仓库

aj-geddes/useful-ai-prompts

Path: skills/feature-engineering

Related Skills

sglang

evaluating-llms-harness

Testing

This Claude Skill runs the lm-evaluation-harness to benchmark LLMs across 60+ standardized academic tasks like MMLU and GSM8K. It's designed for developers to compare model quality, track training progress, or report academic results. The tool supports various backends including HuggingFace and vLLM models.

View skill

llamaguard

Other

LlamaGuard is Meta's 7-8B parameter model for moderating LLM inputs and outputs across six safety categories like violence and hate speech. It offers 94-95% accuracy and can be deployed using vLLM, Hugging Face, or Amazon SageMaker. Use this skill to easily integrate content filtering and safety guardrails into your AI applications.

View skill

Feature Engineering

About

Documentation

Feature Engineering

Engineering Techniques

Key Principles

Implementation with Python

Best Practices

Common Transformations

Deliverables

Quick Install

GitHub 仓库

Related Skills

sglang

evaluating-llms-harness

llamaguard

langchain