SKILL·950284

detect-anomalies-aiops

Name: detect-anomalies-aiops
Author: pjt222

pjt222

Aktualisiert 1 month ago

8 Ansichten

Andereaiapi

Über

Diese Fähigkeit implementiert KI-gestützte Anomalieerkennung für operative Metriken mittels Zeitreihenanalyse (Isolation Forest, Prophet, LSTM), Alarmkorrelation und Root-Cause-Analyse. Sie reduziert Alarmmüdigkeit, indem sie intelligente echte Anomalien in Systemmetriken, Logs und Traces jenseits einfacher statischer Schwellenwerte identifiziert. Nutzen Sie sie, wenn Betriebsteams vom Alarmaufkommen überfordert sind, wenn komplexe Multi-Metrik-Anomalien erkannt werden müssen oder wenn saisonale Muster traditionelle Schwellenwerte unwirksam machen.

Schnellinstallation

Claude Code

Dokumentation

Detect Anomalies for AIOps

See Extended Examples for complete configuration files and templates.

ML → anomalies in ops metrics + alert correlation + cut false positives.

Use When

Ops team drowns in alerts (>100/day)
Multi-metric anomalies (not just threshold)
Seasonal patterns → static thresholds fail
Predict issues before user impact
Correlate alerts → root cause
Monitoring → too many false positives
Subtle perf degradation trends

In

Required: Time series metrics (CPU, mem, latency, err rate)
Required: Historical data (30-90 days min)
Optional: Alert history w/ labels (TP/FP)
Optional: Sys topology (svc deps)
Optional: Logs → correlation
Optional: Deploy/change events → context

Do

Step 1: Env + Load Data

Install deps + prep time series.

# Create virtual environment
python -m venv venv
source venv/bin/activate

# Install anomaly detection libraries
pip install prophet scikit-learn pandas numpy
pip install tensorflow keras  # for LSTM models
pip install pyod  # Python Outlier Detection library
pip install statsmodels  # for statistical methods
pip install prometheus-api-client  # if using Prometheus

# Visualization
pip install plotly matplotlib seaborn

Load + prep:

# aiops/data_loader.py
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
from typing import List, Dict
import logging

logging.basicConfig(level=logging.INFO)
# ... (see EXAMPLES.md for complete implementation)

→ Time series loaded w/ regular intervals, missing vals handled, features engineered.

If err: Prometheus conn fails → verify URL + net. Data gaps → forward-fill or interpolate. Ensure ts col is datetime. Mem issues on large ranges → chunks.

Step 2: Isolation Forest (Multivariate)

Unsupervised Isolation Forest.

# aiops/isolation_forest_detector.py
from sklearn.ensemble import IsolationForest
from sklearn.preprocessing import StandardScaler
import pandas as pd
import numpy as np
from typing import Dict, List
import joblib

# ... (see EXAMPLES.md for complete implementation)

→ Model trained on history, anomalies scored, typically 0.5-2% flagged.

If err: too many (>5%) → reduce contamination or retrain on cleaner baseline. Too few (<0.1%) → increase contamination or check scaling. Verify features have variance.

Step 3: Prophet (Forecast + Anomaly)

Facebook Prophet → seasonality + deviations.

# aiops/prophet_detector.py
from prophet import Prophet
import pandas as pd
import numpy as np
from typing import Dict, Tuple
import logging

logger = logging.getLogger(__name__)
# ... (see EXAMPLES.md for complete implementation)

→ Prophet captures daily/weekly seasonality, anomalies when actuals fall outside 99% CI, forecasts for capacity planning.

If err: too slow (>5 min/metric) → reduce history to 30 days or disable weekly_seasonality. Too many FP → interval_width to 0.995. Missing seasonal → custom seasonalities. TZ consistency in ts.

Step 4: Correlate Alerts + Root Cause

Group related anomalies, find causes.

# aiops/alert_correlation.py
import pandas as pd
import numpy as np
from sklearn.cluster import DBSCAN
from typing import List, Dict
from datetime import timedelta
import networkx as nx

# ... (see EXAMPLES.md for complete implementation)

→ Related anomalies → incidents, root causes via dep graph, incident summaries.

If err: all anomalies as separate → increase time_window_minutes. Root cause unclear → define metric_relationships per architecture. Verify ts sort.

Step 5: Integrate w/ Alerting

Smart alerts + noise suppress.

# aiops/intelligent_alerting.py
import requests
import logging
from typing import Dict, List
from datetime import datetime, timedelta
import json

logger = logging.getLogger(__name__)
# ... (see EXAMPLES.md for complete implementation)

→ High sev → PagerDuty, med → Slack, low → log only, dupes suppressed in 15-min window.

If err: test webhook w/ curl first. Verify severity (0.5-0.9 range). Check rate limit doesn't suppress all. TZ handling for last_alerts.

Step 6: Deploy as Continuous Svc

Auto pipeline on interval.

# aiops/monitoring_service.py
import schedule
import time
import logging
from datetime import datetime, timedelta
from data_loader import MetricsDataLoader
from isolation_forest_detector import IsolationForestDetector
from prophet_detector import ProphetAnomalyDetector
# ... (see EXAMPLES.md for complete implementation)

→ Svc runs continuous, detects every 5 min, alerts on incidents, logs all.

If err: scheduler process alive (systemd/supervisor in prod). Verify Prometheus conn. Models loaded OK. Dead man's switch if svc stops. Monitor mem (reload models periodically if grows).

Check

Traps

Train on anomaly data: Baseline must be clean (no incidents). Manual review or labeled data.
Ignore seasonality: Static models fail on daily/weekly. Prophet or time features.
Too sensitive: 99% CI flags normal peaks. Start 99.5% + tune on FP.
Skip missing data: Gaps → model errors. Robust preprocess + interpolate.
Alert fatigue from low sev: Filter below threshold. High-conf only.
Ignore topology: Treating metrics solo misses cascades. Define deps.
Model drift: Old data → stale. Retrain monthly or on sys changes.
Resource contention: Detecting every metric costly. Prioritize critical svcs or sample.

→

monitor-model-drift — detect when detection models degrade
monitor-data-integrity — data quality before detection
setup-prometheus-monitoring — collect ops metrics
forecast-operational-metrics — capacity planning w/ Prophet

GitHub Repository

pjt222/agent-almanac

Pfad: i18n/caveman-ultra/skills/detect-anomalies-aiops

agentsagentskillsai-assisted-developmentclaude-codeskillsteams

FAQ

Frequently asked questions

What is the detect-anomalies-aiops skill?

detect-anomalies-aiops is a Claude Skill by pjt222. Skills package instructions and resources that Claude loads on demand, so Claude can perform detect-anomalies-aiops-related tasks without extra prompting.

How do I install detect-anomalies-aiops?

Use the install commands on this page: add detect-anomalies-aiops to Claude Code as a plugin, or clone its repository into your skills directory, then restart Claude so it picks up the skill.

What category does detect-anomalies-aiops belong to?

detect-anomalies-aiops is in the Other category, tagged ai and api.

Is detect-anomalies-aiops free to use?

Yes. detect-anomalies-aiops is listed on AIMCP and free to install. It runs inside Claude, so no separate service account is required to use the skill itself.