moai-domain-monitoring

modu-ai

Updated Today

35 views

424

Otherai

About

This Claude Skill provides enterprise application monitoring with AI-powered observability for scalable modern applications. It integrates Context7 for intelligent performance orchestration and supports monitoring, metrics, logging, and tracing workflows. Use this skill when you need to analyze application performance, troubleshoot production issues, or implement comprehensive observability solutions.

Quick Install

Claude Code

Recommended

Plugin CommandRecommended

/plugin add https://github.com/modu-ai/moai-adk

Git CloneAlternative

git clone https://github.com/modu-ai/moai-adk.git ~/.claude/skills/moai-domain-monitoring

Copy and paste this command in Claude Code to install this skill

Documentation

Enterprise Application Monitoring Expert v4.0.0

Skill Metadata

Field	Value
Skill Name	moai-domain-monitoring
Version	4.0.0 (2025-11-13)
Tier	Enterprise Monitoring Expert
AI-Powered	✅ Context7 Integration, Intelligent Architecture
Auto-load	On demand when monitoring keywords detected

What It Does

Enterprise Application Monitoring expert with AI-powered observability architecture, Context7 integration, and intelligent performance orchestration for scalable modern applications.

Revolutionary v4.0.0 capabilities:

🤖 AI-Powered Monitoring Architecture using Context7 MCP for latest observability patterns
📊 Intelligent Performance Analytics with automated anomaly detection and optimization
🚀 Advanced Observability Integration with AI-driven distributed tracing and correlation
🔗 Enterprise Alerting Systems with zero-configuration intelligent incident management
📈 Predictive Performance Insights with usage forecasting and capacity planning

When to Use

Automatic triggers:

Application monitoring architecture and observability strategy discussions
Performance optimization and bottleneck analysis planning
Alerting and incident management system implementation
Distributed tracing and system correlation analysis

Manual invocation:

Designing enterprise monitoring architectures with optimal observability
Implementing comprehensive performance monitoring and analytics
Planning incident response and alerting strategies
Optimizing system performance and capacity planning

Quick Reference (Level 1)

Modern Monitoring Stack (November 2025)

Core Monitoring Components

Metrics Collection: Prometheus, Grafana, DataDog, New Relic
Logging: ELK Stack, Grafana Loki, Fluentd, Logstash
Tracing: Jaeger, OpenTelemetry, Zipkin, AWS X-Ray
APM: Application Performance Monitoring with real-time insights
Synthetic Monitoring: Active user experience simulation

Key Observability Pillars

Logs: Structured event logging with correlation IDs
Metrics: Time-series data for system performance
Traces: Distributed request flow across services
Events: Business and system event correlation
Profiles: Application performance profiling

Popular Integration Patterns

OpenTelemetry: Vendor-neutral observability data collection
Prometheus: Metrics collection and alerting
Grafana: Visualization and dashboarding
DataDog: Full-stack monitoring and APM
New Relic: Application performance and infrastructure monitoring

Alerting Strategy

SLI/SLO Monitoring: Service level objectives and indicators
Threshold-based Alerts: Performance and availability thresholds
Anomaly Detection: AI-powered anomaly identification
Escalation Policies: Multi-level alerting and notification

Core Implementation (Level 2)

Monitoring Architecture Intelligence

# AI-powered monitoring architecture optimization with Context7
class MonitoringArchitectOptimizer:
    def __init__(self):
        self.context7_client = Context7Client()
        self.observability_analyzer = ObservabilityAnalyzer()
        self.performance_optimizer = PerformanceOptimizer()
    
    async def design_optimal_monitoring_architecture(self, 
                                                   requirements: MonitoringRequirements) -> MonitoringArchitecture:
        """Design optimal monitoring architecture using AI analysis."""
        
        # Get latest monitoring and observability documentation via Context7
        monitoring_docs = await self.context7_client.get_library_docs(
            context7_library_id='/monitoring/docs',
            topic="observability metrics tracing logging alerting 2025",
            tokens=3000
        )
        
        observability_docs = await self.context7_client.get_library_docs(
            context7_library_id='/observability/docs',
            topic="opentelemetry prometheus grafana performance 2025",
            tokens=2000
        )
        
        # Optimize observability stack
        observability_design = self.observability_analyzer.optimize_stack(
            requirements.application_complexity,
            requirements.scale_requirements,
            monitoring_docs
        )
        
        # Design alerting strategy
        alerting_strategy = self.performance_optimizer.design_alerting(
            requirements.service_level_objectives,
            requirements.notification_preferences,
            observability_docs
        )
        
        return MonitoringArchitecture(
            metrics_collection=self._configure_metrics(requirements),
            logging_system=self._configure_logging(requirements),
            tracing_setup=self._configure_tracing(requirements),
            alerting_framework=alerting_strategy,
            observability_stack=observability_design,
            dashboard_configuration=self._design_dashboards(requirements),
            performance_predictions=observability_design.predictions
        )

OpenTelemetry Integration

// Comprehensive OpenTelemetry setup for Node.js applications
import { NodeSDK } from '@opentelemetry/sdk-node';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { Resource } from '@opentelemetry/resources';
import { SemanticResourceAttributes } from '@opentelemetry/semantic-conventions';
import { OTLPTraceExporter } from '@opentelemetry/exporter-otlp-grpc';
import { OTLPMetricExporter } from '@opentelemetry/exporter-otlp-grpc';
import { PrometheusExporter } from '@opentelemetry/exporter-prometheus';

// Initialize OpenTelemetry SDK
const sdk = new NodeSDK({
  resource: new Resource({
    [SemanticResourceAttributes.SERVICE_NAME]: 'your-service-name',
    [SemanticResourceAttributes.SERVICE_VERSION]: '1.0.0',
    [SemanticResourceAttributes.DEPLOYMENT_ENVIRONMENT]: process.env.NODE_ENV,
  }),
  
  // Auto-instrumentation for popular libraries
  instrumentations: [getNodeAutoInstrumentations()],
  
  // Trace exporter for distributed tracing
  traceExporter: new OTLPTraceExporter({
    url: process.env.OTEL_EXPORTER_OTLP_TRACES_ENDPOINT || 'http://jaeger:4317',
  }),
  
  // Metrics exporter
  metricExporter: new OTLPMetricExporter({
    url: process.env.OTEL_EXPORTER_OTLP_METRICS_ENDPOINT || 'http://prometheus:9090',
  }),
  
  // Additional Prometheus endpoint
  metricReader: new PrometheusExporter({
    port: 9464,
    endpoint: '/metrics',
  }),
  
  // Performance optimizations
  spanLimits: {
    attributeCountLimit: 100,
    eventCountLimit: 1000,
    linkCountLimit: 100,
  },
});

// Start the SDK
sdk.start().then(() => {
  console.log('OpenTelemetry initialized successfully');
});

// Graceful shutdown
process.on('SIGTERM', () => {
  sdk.shutdown()
    .then(() => console.log('OpenTelemetry shut down successfully'))
    .catch((error) => console.error('Error shutting down OpenTelemetry', error))
    .finally(() => process.exit(0));
});

// Custom span creation for business logic
import { trace } from '@opentelemetry/api';

export function createBusinessSpan(operationName: string, attributes: Record<string, string>) {
  const tracer = trace.getTracer('business-logic');
  
  return tracer.startSpan(operationName, {
    attributes: {
      'business.operation': operationName,
      'service.name': 'your-service-name',
      ...attributes,
    },
  });
}

// Example usage in business logic
export async function processUserOrder(userId: string, orderId: string) {
  const span = createBusinessSpan('process_user_order', {
    'user.id': userId,
    'order.id': orderId,
  });
  
  try {
    // Business logic here
    const result = await orderService.process(userId, orderId);
    
    span.setAttributes({
      'order.status': result.status,
      'order.amount': result.amount.toString(),
    });
    
    return result;
  } catch (error) {
    span.recordException(error as Error);
    throw error;
  } finally {
    span.end();
  }
}

Prometheus Metrics Implementation

// Custom Prometheus metrics for application monitoring
import { Counter, Histogram, Gauge, register } from 'prom-client';

// Business metrics
export const businessMetrics = {
  // Request counters
  httpRequestsTotal: new Counter({
    name: 'http_requests_total',
    help: 'Total number of HTTP requests',
    labelNames: ['method', 'route', 'status_code'],
  }),
  
  // Response time histograms
  httpRequestDuration: new Histogram({
    name: 'http_request_duration_seconds',
    help: 'HTTP request duration in seconds',
    labelNames: ['method', 'route'],
    buckets: [0.1, 0.3, 0.5, 0.7, 1, 3, 5, 7, 10],
  }),
  
  // Active connections gauge
  activeConnections: new Gauge({
    name: 'active_connections',
    help: 'Number of active connections',
  }),
  
  // Business operations
  ordersProcessed: new Counter({
    name: 'orders_processed_total',
    help: 'Total number of orders processed',
    labelNames: ['status', 'payment_method'],
  }),
  
  revenueGenerated: new Counter({
    name: 'revenue_generated_total',
    help: 'Total revenue generated',
    labelNames: ['currency'],
  }),
};

// System metrics
export const systemMetrics = {
  // Memory usage
  memoryUsage: new Gauge({
    name: 'memory_usage_bytes',
    help: 'Memory usage in bytes',
    labelNames: ['type'], // heap, external, array_buffers
  }),
  
  // CPU usage
  cpuUsage: new Gauge({
    name: 'cpu_usage_percent',
    help: 'CPU usage percentage',
  }),
  
  // Event loop lag
  eventLoopLag: new Histogram({
    name: 'event_loop_lag_seconds',
    help: 'Event loop lag in seconds',
    buckets: [0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1, 5],
  }),
};

// Metrics collection middleware
export function metricsMiddleware() {
  return (req: Request, res: Response, next: NextFunction) => {
    const start = Date.now();
    
    // Increment active connections
    systemMetrics.activeConnections.inc();
    
    res.on('finish', () => {
      const duration = (Date.now() - start) / 1000;
      
      // Record HTTP request metrics
      businessMetrics.httpRequestsTotal
        .labels(req.method, req.route?.path || req.path, res.statusCode.toString())
        .inc();
      
      businessMetrics.httpRequestDuration
        .labels(req.method, req.route?.path || req.path)
        .observe(duration);
      
      // Decrement active connections
      systemMetrics.activeConnections.dec();
    });
    
    next();
  };
}

// Export metrics for Prometheus
export function getMetrics() {
  return register.metrics();
}

// System metrics collection
setInterval(() => {
  const memUsage = process.memoryUsage();
  systemMetrics.memoryUsage.labels('heap').set(memUsage.heapUsed);
  systemMetrics.memoryUsage.labels('external').set(memUsage.external);
  systemMetrics.memoryUsage.labels('array_buffers').set(memUsage.arrayBuffers);
}, 5000);

Advanced Implementation (Level 3)

Advanced Alerting and Incident Management

class IntelligentAlertingSystem:
    def __init__(self):
        self.anomaly_detector = AnomalyDetector()
        self.escalation_manager = EscalationManager()
        self.correlation_engine = AlertCorrelationEngine()
    
    async def setup_intelligent_alerting(self, 
                                       monitoring_config: MonitoringConfiguration) -> AlertingSetup:
        """Configure intelligent alerting with anomaly detection."""
        
        # Set up anomaly detection
        anomaly_config = self.anomaly_detector.configure_detection(
            monitoring_config.metrics,
            sensitivity_level=monitoring_config.sensitivity,
            learning_period=monitoring_config.learning_period
        )
        
        # Configure escalation policies
        escalation_policies = self.escalation_manager.create_policies(
            monitoring_config.severity_levels,
            monitoring_config.notification_channels
        )
        
        # Set up alert correlation
        correlation_rules = self.correlation_engine.define_correlation_rules(
            monitoring_config.service_dependencies,
            monitoring_config.infrastructure_topology
        )
        
        return AlertingSetup(
            anomaly_detection=anomaly_config,
            escalation_policies=escalation_policies,
            correlation_rules=correlation_rules,
            suppression_rules=self._configure_suppression_rules(),
            enrichment_rules=self._configure_enrichment_rules()
        )

Performance Optimization with Machine Learning

// AI-powered performance optimization
export class PerformanceOptimizer {
  private performanceData: PerformanceMetrics[] = [];
  private model: PerformanceModel;

  constructor() {
    this.model = new PerformanceModel();
  }

  async collectPerformanceMetrics(): Promise<void> {
    // Collect comprehensive performance metrics
    const metrics = await this.gatherMetrics();
    this.performanceData.push(metrics);
    
    // Keep only recent data for training
    if (this.performanceData.length > 1000) {
      this.performanceData = this.performanceData.slice(-1000);
    }
  }

  async predictPerformanceIssues(): Promise<PerformancePrediction[]> {
    // Use trained model to predict potential issues
    const features = this.extractFeatures(this.performanceData);
    const predictions = await this.model.predict(features);
    
    return predictions.map((prediction, index) => ({
      timestamp: Date.now() + (index * 60000), // Next hour predictions
      issue_type: prediction.type,
      confidence: prediction.confidence,
      severity: prediction.severity,
      recommended_actions: this.getRecommendedActions(prediction),
    }));
  }

  async optimizeResourceAllocation(): Promise<ResourceOptimization> {
    // Analyze current resource usage patterns
    const usagePatterns = this.analyzeUsagePatterns();
    
    // Generate optimization recommendations
    return {
      cpu_scaling: this.optimizeCPUAllocation(usagePatterns.cpu),
      memory_scaling: this.optimizeMemoryAllocation(usagePatterns.memory),
      database_scaling: this.optimizeDatabaseAllocation(usagePatterns.database),
      cache_optimization: this.optimizeCacheConfiguration(usagePatterns.cache),
    };
  }

  private getRecommendedActions(prediction: any): string[] {
    const actions: string[] = [];
    
    switch (prediction.type) {
      case 'high_cpu':
        actions.push('Scale up CPU resources');
        actions.push('Optimize CPU-intensive operations');
        break;
      case 'memory_leak':
        actions.push('Investigate memory usage patterns');
        actions.push('Consider memory profiling');
        break;
      case 'slow_database':
        actions.push('Check database query performance');
        actions.push('Optimize database indexes');
        break;
      case 'high_response_time':
        actions.push('Analyze request handling bottlenecks');
        actions.push('Implement request batching');
        break;
    }
    
    return actions;
  }
}

Distributed Tracing Implementation

// Advanced distributed tracing with correlation
export class DistributedTracing {
  private tracer: Tracer;

  constructor() {
    this.tracer = trace.getTracer('distributed-tracing');
  }

  async traceWorkflow(workflowName: string, steps: WorkflowStep[]): Promise<void> {
    const mainSpan = this.tracer.startSpan(`workflow.${workflowName}`, {
      attributes: {
        'workflow.name': workflowName,
        'workflow.steps_count': steps.length.toString(),
      },
    });

    try {
      for (const step of steps) {
        const stepSpan = this.tracer.startSpan(`step.${step.name}`, {
          parent: mainSpan,
          attributes: {
            'step.name': step.name,
            'step.type': step.type,
            'step.service': step.service,
          },
        });

        try {
          await this.executeStep(step);
          
          stepSpan.setAttributes({
            'step.status': 'success',
            'step.duration': stepSpan.duration[0].toString(),
          });
        } catch (error) {
          stepSpan.recordException(error as Error);
          stepSpan.setAttributes({
            'step.status': 'error',
            'step.error': (error as Error).message,
          });
          throw error;
        } finally {
          stepSpan.end();
        }
      }
    } finally {
      mainSpan.end();
    }
  }

  private async executeStep(step: WorkflowStep): Promise<void> {
    // Add custom baggage for context propagation
    const baggage = propagate.getActiveBaggage();
    if (!baggage) {
      propagate.setBaggage(
        Baggage.fromEntries([
          ['workflow.id', crypto.randomUUID()],
          ['correlation.id', crypto.randomUUID()],
          ['user.id', step.context?.userId || 'anonymous'],
        ])
      );
    }

    // Execute the step with proper context
    await step.execute();
  }

  // Correlation analysis for distributed systems
  async analyzeCorrelations(traceData: TraceData[]): Promise<CorrelationAnalysis> {
    const correlations = new Map<string, CorrelationResult>();
    
    // Analyze trace patterns
    for (const trace of traceData) {
      const correlationId = trace.attributes['correlation.id'];
      
      if (correlationId) {
        const existing = correlations.get(correlationId) || {
          correlationId,
          spans: [],
          services: new Set(),
          errors: [],
          totalDuration: 0,
        };
        
        existing.spans.push(trace);
        existing.services.add(trace.attributes['service.name']);
        
        if (trace.attributes['error']) {
          existing.errors.push(trace);
        }
        
        existing.totalDuration += trace.duration || 0;
        correlations.set(correlationId, existing);
      }
    }
    
    return {
      totalCorrelations: correlations.size,
      correlationResults: Array.from(correlations.values()),
      errorRate: this.calculateErrorRate(correlations),
      averageDuration: this.calculateAverageDuration(correlations),
    };
  }
}

Reference & Integration (Level 4)

API Reference

Core Monitoring Operations

create_metric(name, type, labels) - Create custom metric
record_event(event_name, attributes) - Record business event
create_span(name, parent_span) - Create tracing span
set_alert(condition, severity, channels) - Configure alert
create_dashboard(metrics, visualization) - Create monitoring dashboard

Context7 Integration

get_latest_monitoring_documentation() - Official monitoring docs via Context7
analyze_observability_patterns() - Observability best practices via Context7
optimize_monitoring_stack() - Monitoring optimization via Context7

Best Practices (November 2025)

DO

Use OpenTelemetry for vendor-neutral observability
Implement structured logging with correlation IDs
Set up comprehensive alerting with proper escalation
Monitor business metrics alongside technical metrics
Use dashboards for real-time system visibility
Implement anomaly detection for proactive monitoring
Set up SLI/SLO monitoring for service reliability
Use distributed tracing for microservice debugging

DON'T

Skip monitoring for development environments
Create too many alerts without proper prioritization
Ignore business metrics and user experience
Forget to monitor infrastructure costs
Use alerting as a replacement for proper monitoring
Skip performance testing and benchmarking
Ignore monitoring data retention policies
Forget to secure monitoring endpoints and data

Works Well With

moai-baas-foundation (Enterprise BaaS monitoring)
moai-essentials-perf (Performance optimization)
moai-security-api (Security monitoring)
moai-foundation-trust (Compliance monitoring)
moai-domain-backend (Backend application monitoring)
moai-domain-frontend (Frontend performance monitoring)
moai-domain-devops (DevOps and infrastructure monitoring)
moai-security-owasp (Security threat monitoring)

Changelog

v4.0.0 (2025-11-13): Complete Enterprise v4.0 rewrite with 40% content reduction, 4-layer Progressive Disclosure structure, Context7 integration, November 2025 monitoring stack updates, and intelligent alerting patterns
v2.0.0 (2025-11-11): Complete metadata structure, monitoring patterns, alerting configuration
v1.0.0 (2025-11-11): Initial application monitoring

End of Skill | Updated 2025-11-13

Security & Compliance

Monitoring Security

Secure transmission of monitoring data with encryption
Access controls for sensitive metrics and logs
Data anonymization for user privacy protection
Secure API endpoints for monitoring data collection

Compliance Management

GDPR compliance with data minimization in monitoring
SOC2 monitoring controls and audit trails
Industry-specific compliance monitoring (HIPAA, PCI-DSS)
Automated compliance reporting and alerting

End of Enterprise Application Monitoring Expert v4.0.0

GitHub Repository

modu-ai/moai-adk

Path: .claude/skills/moai-domain-monitoring

agentic-aiagentic-codingagentic-workflowclaudeclaudecodevibe-coding

Related Skills

sglang

evaluating-llms-harness

Testing

This Claude Skill runs the lm-evaluation-harness to benchmark LLMs across 60+ standardized academic tasks like MMLU and GSM8K. It's designed for developers to compare model quality, track training progress, or report academic results. The tool supports various backends including HuggingFace and vLLM models.

View skill

llamaguard

Other

LlamaGuard is Meta's 7-8B parameter model for moderating LLM inputs and outputs across six safety categories like violence and hate speech. It offers 94-95% accuracy and can be deployed using vLLM, Hugging Face, or Amazon SageMaker. Use this skill to easily integrate content filtering and safety guardrails into your AI applications.