スキル一覧に戻る

detect-anomalies-aiops

pjt222
更新日 2 days ago
5 閲覧
17
2
17
GitHubで表示
その他aiapi

について

このスキルは、Isolation Forest、Prophet、LSTMなどのAIモデルを使用して、運用時系列データ、ログ、トレースから真の異常を検出します。アラートの相関付けと根本原因分析を行うことで静的しきい値を超えた分析を実現し、アラート疲労を軽減します。アラート量に圧倒されている場合、複雑な複数メトリクスの異常や季節パターンを扱う場合、または問題の予測を事前に行いたい場合にご利用ください。

クイックインストール

Claude Code

推奨
メイン
npx skills add pjt222/agent-almanac -a claude-code
プラグインコマンド代替
/plugin add https://github.com/pjt222/agent-almanac
Git クローン代替
git clone https://github.com/pjt222/agent-almanac.git ~/.claude/skills/detect-anomalies-aiops

このコマンドをClaude Codeにコピー&ペーストしてスキルをインストールします

ドキュメント

Detect Anomalies for AIOps

See Extended Examples for complete configuration files and templates.

Apply ML to find anomalies in operational metrics. Correlate alerts, cut false positives.

When Use

  • Ops team drowning in alerts (>100/day)
  • Need to detect complex multi-metric anomalies (not just threshold breaches)
  • Seasonal patterns make static thresholds useless
  • Want to predict issues before they hit users (proactive detection)
  • Need to correlate related alerts → root cause
  • Monitoring creates too many false positives
  • Want to spot subtle perf degradation trends

Inputs

  • Required: Time series metrics from monitoring (CPU, memory, latency, error rate)
  • Required: Historical data (30-90 days min)
  • Optional: Alert history with labels (true positive / false positive)
  • Optional: System topology (service deps)
  • Optional: Log data for correlation
  • Optional: Deploy/change events for context

Steps

Step 1: Set Up Environment + Load Data

Install deps. Prep time series data.

# Create virtual environment
python -m venv venv
source venv/bin/activate

# Install anomaly detection libraries
pip install prophet scikit-learn pandas numpy
pip install tensorflow keras  # for LSTM models
pip install pyod  # Python Outlier Detection library
pip install statsmodels  # for statistical methods
pip install prometheus-api-client  # if using Prometheus

# Visualization
pip install plotly matplotlib seaborn

Load + prep data:

# aiops/data_loader.py
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
from typing import List, Dict
import logging

logging.basicConfig(level=logging.INFO)
# ... (see EXAMPLES.md for complete implementation)

Got: Time series loaded, regular intervals, missing values handled, features engineered for ML.

If fail: Prometheus connection fails? Check URL + network. Data gaps? Forward-fill or interpolate. Timestamp column must be datetime. Memory issues with big date ranges? Process in chunks.

Step 2: Impl Isolation Forest for Multivariate Anomaly Detection

Unsupervised Isolation Forest finds anomalies.

# aiops/isolation_forest_detector.py
from sklearn.ensemble import IsolationForest
from sklearn.preprocessing import StandardScaler
import pandas as pd
import numpy as np
from typing import Dict, List
import joblib

# ... (see EXAMPLES.md for complete implementation)

Got: Model trained on historical data. Anomalies detected with scores. Usually 0.5-2% of points flagged.

If fail: Too many anomalies (>5%)? Reduce contamination or retrain on cleaner baseline. Too few (<0.1%)? Increase contamination or check feature scaling. Features need variance.

Step 3: Impl Prophet for Time Series Forecasting + Anomaly Detection

Facebook Prophet models seasonality, finds deviations.

# aiops/prophet_detector.py
from prophet import Prophet
import pandas as pd
import numpy as np
from typing import Dict, Tuple
import logging

logger = logging.getLogger(__name__)
# ... (see EXAMPLES.md for complete implementation)

Got: Prophet models capture daily/weekly seasonality. Anomalies flagged when actual outside 99% CI. Forecasts for capacity planning.

If fail: Prophet too slow (>5 min per metric)? Cut history to 30 days or disable weekly_seasonality. Too many false positives? Raise interval_width to 0.995. Missing seasonal patterns? Add custom seasonalities. Check timezone consistency.

Step 4: Correlate Alerts + Find Root Cause

Group related anomalies. Identify root causes.

# aiops/alert_correlation.py
import pandas as pd
import numpy as np
from sklearn.cluster import DBSCAN
from typing import List, Dict
from datetime import timedelta
import networkx as nx

# ... (see EXAMPLES.md for complete implementation)

Got: Related anomalies grouped into incidents. Root causes from dependency graph. Incident summaries for investigation.

If fail: All anomalies separate incidents? Raise time_window_minutes. Root cause unclear? Define metric_relationships explicit from architecture. Check timestamp sort.

Step 5: Integrate with Alerting System

Send intelligent alerts with context. Suppress noise.

# aiops/intelligent_alerting.py
import requests
import logging
from typing import Dict, List
from datetime import datetime, timedelta
import json

logger = logging.getLogger(__name__)
# ... (see EXAMPLES.md for complete implementation)

Got: High-severity → PagerDuty. Medium → Slack. Low → logged only. Duplicate alerts suppressed in 15-min window.

If fail: Test webhook URLs with curl first. Severity calc should give 0.5-0.9 range. Rate limiting must not suppress all alerts. Check timezone for last_alerts tracking.

Step 6: Deploy as Continuous Monitoring Service

Auto-pipeline runs periodically.

# aiops/monitoring_service.py
import schedule
import time
import logging
from datetime import datetime, timedelta
from data_loader import MetricsDataLoader
from isolation_forest_detector import IsolationForestDetector
from prophet_detector import ProphetAnomalyDetector
# ... (see EXAMPLES.md for complete implementation)

Got: Service runs continuously. Detects anomalies every 5 min. Alerts sent for incidents. Logs all activity.

If fail: Scheduler process must stay alive (use systemd/supervisor for prod). Check Prometheus connection. Models must load OK. Add dead man's switch alert if service stops. Monitor memory (reload models periodically if growing).

Checks

  • Historical data loaded, no missing timestamps
  • Isolation Forest finds known anomalies in test set
  • Prophet models capture daily/weekly seasonality
  • Alert correlation groups temporally-related anomalies
  • Root cause detection finds upstream issues
  • Intelligent alerting suppresses duplicates
  • Severity calc gives reasonable scores (0.5-0.9)
  • Monitoring service runs continuously 7+ days, no crash
  • False positive rate < 10% (vs labeled data)
  • True positive rate > 80% for critical incidents

Pitfalls

  • Training on anomalous data: Baseline period for training must be clean (no incidents). Manually review or use labeled data.
  • Ignoring seasonality: Static models fail on daily/weekly patterns. Use Prophet or add time features.
  • Too sensitive thresholds: 99% CI may flag normal peaks. Start 99.5%, tune by false positives.
  • Not handling missing data: Gaps cause model errors. Robust preprocessing with interpolation.
  • Alert fatigue from low severity: Filter below threshold. Focus on high-confidence.
  • Ignoring system topology: Treating metrics independent misses cascading failures. Define deps.
  • Model drift: Old-data models go stale. Retrain monthly or on system change.
  • Resource contention: Running detection on every metric = expensive. Prioritize critical services or sample.

See Also

  • monitor-model-drift - Find when anomaly models degrade
  • monitor-data-integrity - Data quality checks before anomaly detection
  • setup-prometheus-monitoring - Collect operational metrics
  • forecast-operational-metrics - Capacity planning with Prophet forecasts

GitHub リポジトリ

pjt222/agent-almanac
パス: i18n/caveman/skills/detect-anomalies-aiops
0
agentsagentskillsai-assisted-developmentclaude-codeskillsteams

関連スキル

llamaguard

その他

LlamaGuardは、暴力やヘイトスピーチなど6つの安全性カテゴリーにおいて、LLMの入力と出力をモデレートするMetaの70-80億パラメータモデルです。94〜95%の精度を提供し、vLLM、Hugging Face、Amazon SageMakerを使用してデプロイ可能です。このスキルを使用して、AIアプリケーションにコンテンツフィルタリングと安全策を簡単に統合できます。

スキルを見る

cost-optimization

その他

このClaudeスキルは、リソースの適正サイジング、タグ付け戦略、支出分析を通じて、開発者がクラウドコストを最適化することを支援します。AWS、Azure、GCPにわたるクラウド支出の削減とコストガバナンスの実施のためのフレームワークを提供します。インフラコストの分析、リソースの適正サイジング、または予算制約への対応が必要な際にご利用ください。

スキルを見る

quantizing-models-bitsandbytes

その他

このスキルは、bitsandbytesを使用してLLMを8ビットまたは4ビット精度に量子化し、精度の低下を最小限に抑えつつ50〜75%のメモリ削減を実現します。限られたGPUメモリでより大規模なモデルを実行したり、推論を高速化するのに理想的で、INT8、NF4、FP4などのフォーマットをサポートしています。HuggingFace Transformersと統合され、QLoRAトレーニングや8ビットオプティマイザーを可能にします。

スキルを見る

dispatching-parallel-agents

その他

このClaudeスキルは、複数のエージェントを配備し、3つ以上の独立した問題を並行して調査・修正します。共有状態や依存関係がなく解決可能な、無関係な障害が発生するシナリオ向けに設計されています。中核となる機能は並列問題解決であり、効率を最大化するために独立した問題領域ごとに1つのエージェントを割り当てます。

スキルを見る