detecting-data-anomalies

jeremylongshore

Updated Today

27 views

712

Otherdata

About

This skill detects anomalies and outliers in datasets using machine learning algorithms like those in scikit-learn. Use it when analyzing data for unusual patterns or unexpected deviations from normal behavior. It's triggered with phrases like "detect anomalies" and requires a prepared dataset in formats like CSV or JSON.

Quick Install

Claude Code

Recommended

Plugin CommandRecommended

/plugin add https://github.com/jeremylongshore/claude-code-plugins-plus

Git CloneAlternative

git clone https://github.com/jeremylongshore/claude-code-plugins-plus.git ~/.claude/skills/detecting-data-anomalies

Copy and paste this command in Claude Code to install this skill

Documentation

Prerequisites

Before using this skill, ensure you have:

Dataset in accessible format (CSV, JSON, or database)
Python environment with scikit-learn or similar ML libraries
Understanding of data distribution and expected patterns
Sufficient data volume for statistical significance
Knowledge of domain-specific normal behavior
Data preprocessing capabilities for cleaning and scaling

Instructions

Step 1: Prepare Data for Analysis

Set up the dataset for anomaly detection:

Load dataset using Read tool
Inspect data structure and identify relevant features
Clean data by handling missing values and inconsistencies
Normalize or scale features as appropriate for algorithm
Split temporal data if time-series analysis is needed

Step 2: Select Detection Algorithm

Choose appropriate anomaly detection method based on data characteristics:

Isolation Forest: For high-dimensional data with complex anomalies
One-Class SVM: For clearly defined normal behavior patterns
Local Outlier Factor (LOF): For density-based anomaly detection
Statistical Methods: For simple univariate or multivariate analysis
Autoencoders: For complex patterns in large datasets

Step 3: Configure Detection Parameters

Set algorithm parameters to balance sensitivity:

Define contamination rate (expected proportion of anomalies)
Set distance metrics appropriate for feature types
Configure threshold values for anomaly scoring
Establish validation strategy for parameter tuning

Step 4: Execute Anomaly Detection

Run the detection algorithm on prepared data:

Apply selected algorithm using Bash tool
Generate anomaly scores for each data point
Classify points as normal or anomalous based on threshold
Extract characteristics of identified anomalies

Step 5: Analyze and Report Results

Interpret detection results and provide insights:

Summarize number and distribution of anomalies
Highlight most significant outliers with context
Identify patterns or clusters among anomalies
Generate visualizations showing anomaly distribution
Provide recommendations for further investigation

Output

The skill produces comprehensive anomaly detection results:

Anomaly Summary Report

Total data points analyzed
Number of anomalies detected
Contamination rate (percentage of anomalies)
Algorithm used and configuration parameters
Confidence scores for detected anomalies

Detailed Anomaly List

For each detected anomaly:

Record identifier and timestamp (if applicable)
Anomaly score and confidence level
Feature values showing deviation from normal
Contextual information about the outlier
Severity classification (low, medium, high, critical)

Statistical Analysis

Distribution of anomaly scores across dataset
Feature importance for anomaly classification
Comparison with normal data patterns
Temporal distribution of anomalies (if time-series)
Clustering analysis of anomaly types

Visualizations

Scatter plots highlighting anomalies in feature space
Time-series plots with anomaly markers
Distribution histograms comparing normal vs anomalous data
Heatmaps showing feature correlations for anomalies

Recommendations

Suggested follow-up investigations for critical anomalies
Data quality improvements to reduce false positives
Monitoring strategies for real-time detection
Algorithm refinements based on domain knowledge

Error Handling

Common issues and solutions:

Insufficient Data Volume

Error: Not enough data points for statistical significance
Solution: Collect more data, adjust contamination rate, or use simpler statistical methods

High False Positive Rate

Error: Too many normal points classified as anomalies
Solution: Adjust detection threshold, refine feature selection, or use domain-specific constraints

Algorithm Performance Issues

Error: Detection algorithm too slow for large datasets
Solution: Use sampling techniques, optimize parameters, or switch to faster algorithms like Isolation Forest

Feature Scaling Problems

Error: Anomalies dominated by high-magnitude features
Solution: Apply appropriate normalization or standardization to all features before detection

Missing Ground Truth

Error: Unable to validate detection accuracy without labels
Solution: Use domain expertise for manual validation, implement feedback loop for model improvement

Resources

Anomaly Detection Algorithms

Isolation Forest documentation and implementation examples
One-Class SVM for novelty detection
Local Outlier Factor (LOF) for density-based detection
Autoencoder-based anomaly detection for deep learning approaches

Python Libraries

scikit-learn anomaly detection module
PyOD (Python Outlier Detection) comprehensive library
TensorFlow/PyTorch for deep learning-based detection
statsmodels for statistical anomaly detection

Domain-Specific Applications

Fraud detection in financial transactions
Network intrusion detection and security monitoring
Manufacturing quality control and defect detection
Healthcare anomaly detection for patient monitoring
IoT sensor data anomaly identification

Best Practices

Balance sensitivity to avoid excessive false positives
Validate results with domain experts
Monitor detection performance over time
Update models as normal behavior evolves
Document anomaly investigation procedures

GitHub Repository

jeremylongshore/claude-code-plugins-plus

Path: plugins/ai-ml/anomaly-detection-system/skills/anomaly-detection-system

aiautomationclaude-codedevopsmarketplacemcp

Related Skills

csv-data-summarizer

Meta

This skill automatically analyzes CSV files to generate comprehensive statistical summaries and visualizations using Python's pandas and matplotlib/seaborn. It should be triggered whenever a user uploads or references CSV data without prompting for analysis preferences. The tool provides immediate insights into data structure, quality, and patterns through automated analysis and visualization.

View skill

llamaindex

Meta

LlamaIndex is a data framework for building RAG-powered LLM applications, specializing in document ingestion, indexing, and querying. It provides key features like vector indices, query engines, and agents, and supports over 300 data connectors. Use it for document Q&A, chatbots, and knowledge retrieval when building data-centric applications.

View skill

hybrid-cloud-networking

Meta

This skill configures secure hybrid cloud networking between on-premises infrastructure and cloud platforms like AWS, Azure, and GCP. Use it when connecting data centers to the cloud, building hybrid architectures, or implementing secure cross-premises connectivity. It supports key capabilities such as VPNs and dedicated connections like AWS Direct Connect for high-performance, reliable setups.

View skill

Excel Analysis

Meta

This skill enables developers to analyze Excel files and perform data operations using pandas. It can read spreadsheets, create pivot tables, generate charts, and conduct data analysis on .xlsx files and tabular data. Use it when working with Excel files, spreadsheets, or any structured tabular data within Claude Code.

View skill