Back to Skills

Must-Gather Analyzer

openshift-eng
Updated Today
30 views
16
110
16
View on GitHub
Metaaidata

About

This skill analyzes OpenShift must-gather diagnostic data to provide insights into cluster health. It helps diagnose issues with cluster operators, pods, nodes, and network components. Use it when you need to check operator status, identify failing pods, or investigate cluster problems from diagnostic data.

Quick Install

Claude Code

Recommended
Plugin CommandRecommended
/plugin add https://github.com/openshift-eng/ai-helpers
Git CloneAlternative
git clone https://github.com/openshift-eng/ai-helpers.git ~/.claude/skills/Must-Gather Analyzer

Copy and paste this command in Claude Code to install this skill

Documentation

Must-Gather Analyzer Skill

Comprehensive analysis of OpenShift must-gather diagnostic data with helper scripts that parse YAML and display output in oc-like format.

Overview

This skill provides analysis for:

  • ClusterVersion: Current version, update status, and capabilities
  • Cluster Operators: Status, degradation, and availability
  • Pods: Health, restarts, crashes, and failures across namespaces
  • Nodes: Conditions, capacity, and readiness
  • Network: OVN/SDN diagnostics and connectivity
  • Events: Warning and error events across namespaces
  • etcd: Cluster health, member status, and quorum
  • Storage: PersistentVolumes and PersistentVolumeClaims status

Must-Gather Directory Structure

Important: Must-gather data is contained in a subdirectory with a long hash name:

must-gather/
└── registry-ci-openshift-org-origin-...-sha256-<hash>/
    ├── cluster-scoped-resources/
    │   ├── config.openshift.io/clusteroperators/
    │   └── core/nodes/
    ├── namespaces/
    │   └── <namespace>/
    │       └── pods/
    │           └── <pod-name>/
    │               └── <pod-name>.yaml
    └── network_logs/

The analysis scripts expect the path to the subdirectory (the one with the hash), not the root must-gather folder.

Instructions

1. Get Must-Gather Path

Ask the user for the must-gather directory path if not already provided.

  • If they provide the root directory, look for the subdirectory with the hash name
  • The correct path contains cluster-scoped-resources/ and namespaces/ directories

2. Choose Analysis Type

Based on user's request, run the appropriate helper script:

ClusterVersion Analysis

./scripts/analyze_clusterversion.py <must-gather-path>

Shows cluster version information similar to oc get clusterversion:

  • Current version and update status
  • Progressing state
  • Available updates
  • Version conditions
  • Enabled capabilities
  • Update history

Cluster Operators Analysis

./scripts/analyze_clusteroperators.py <must-gather-path>

Shows cluster operator status similar to oc get clusteroperators:

  • Available, Progressing, Degraded conditions
  • Version information
  • Time since condition change
  • Detailed messages for operators with issues

Pods Analysis

# All namespaces
./scripts/analyze_pods.py <must-gather-path>

# Specific namespace
./scripts/analyze_pods.py <must-gather-path> --namespace <namespace>

# Show only problematic pods
./scripts/analyze_pods.py <must-gather-path> --problems-only

Shows pod status similar to oc get pods -A:

  • Ready/Total containers
  • Status (Running, Pending, CrashLoopBackOff, etc.)
  • Restart counts
  • Age
  • Categorized issues (crashlooping, pending, failed)

Nodes Analysis

./scripts/analyze_nodes.py <must-gather-path>

# Show only nodes with issues
./scripts/analyze_nodes.py <must-gather-path> --problems-only

Shows node status similar to oc get nodes:

  • Ready status
  • Roles (master, worker)
  • Age
  • Kubernetes version
  • Node conditions (DiskPressure, MemoryPressure, etc.)
  • Capacity and allocatable resources

Network Analysis

./scripts/analyze_network.py <must-gather-path>

Shows network health:

  • Network type (OVN-Kubernetes, OpenShift SDN)
  • Network operator status
  • OVN pod health
  • PodNetworkConnectivityCheck results
  • Network-related issues

Events Analysis

# Recent events (last 100)
./scripts/analyze_events.py <must-gather-path>

# Warning events only
./scripts/analyze_events.py <must-gather-path> --type Warning

# Events in specific namespace
./scripts/analyze_events.py <must-gather-path> --namespace openshift-etcd

# Show last 50 events
./scripts/analyze_events.py <must-gather-path> --count 50

Shows cluster events:

  • Event type (Warning, Normal)
  • Last seen timestamp
  • Reason and message
  • Affected object
  • Event count

etcd Analysis

./scripts/analyze_etcd.py <must-gather-path>

Shows etcd cluster health:

  • Member health status
  • Member list with IDs and URLs
  • Endpoint status (leader, version, DB size)
  • Quorum status
  • Cluster summary

Storage Analysis

# All PVs and PVCs
./scripts/analyze_pvs.py <must-gather-path>

# PVCs in specific namespace
./scripts/analyze_pvs.py <must-gather-path> --namespace openshift-monitoring

Shows storage resources:

  • PersistentVolumes (capacity, status, claims)
  • PersistentVolumeClaims (binding, capacity)
  • Storage classes
  • Pending/unbound volumes

Monitoring Analysis

# All alerts.
./scripts/analyze_prometheus.py <must-gather-path>

# Alerts in specific namespace
./scripts/analyze_prometheus.py <must-gather-path> --namespace openshift-monitoring

Shows monitoring information:

  • Alerts (state, namespace, name, active since, labels)
  • Total of pending/firing alerts

3. Interpret and Report

After running the scripts:

  1. Review the summary statistics
  2. Focus on items flagged with issues
  3. Provide actionable insights and next steps
  4. Suggest log analysis for specific components if needed
  5. Cross-reference issues (e.g., degraded operator → failing pods → node issues)

Output Format

All scripts provide:

  • Summary Section: High-level statistics with emoji indicators
  • Table View: oc-like formatted output
  • Issues Section: Detailed breakdown of problems

Example summary format:

================================================================================
SUMMARY: 25/28 operators healthy
  ⚠️  3 operators with issues
  🔄 1 progressing
  ❌ 2 degraded
================================================================================

Helper Scripts Reference

scripts/analyze_clusterversion.py

Parses: cluster-scoped-resources/config.openshift.io/clusterversions/version.yaml Output: ClusterVersion table with detailed version info, conditions, and capabilities

scripts/analyze_clusteroperators.py

Parses: cluster-scoped-resources/config.openshift.io/clusteroperators/ Output: ClusterOperator status table with conditions

scripts/analyze_pods.py

Parses: namespaces/*/pods/*/*.yaml (individual pod directories) Output: Pod status table with issues categorized

scripts/analyze_nodes.py

Parses: cluster-scoped-resources/core/nodes/ Output: Node status table with conditions and capacity

scripts/analyze_network.py

Parses: network_logs/, network operator, OVN resources Output: Network health summary and diagnostics

scripts/analyze_events.py

Parses: namespaces/*/core/events.yaml Output: Event table sorted by last occurrence

scripts/analyze_etcd.py

Parses: etcd_info/ (endpoint_health.json, member_list.json, endpoint_status.json) Output: etcd cluster health and member status

scripts/analyze_pvs.py

Parses: cluster-scoped-resources/core/persistentvolumes/, namespaces/*/core/persistentvolumeclaims.yaml Output: PV and PVC status tables

Tips for Analysis

  1. Start with Cluster Operators: They often reveal system-wide issues
  2. Check Timing: Look at "SINCE" columns to understand when issues started
  3. Follow Dependencies: Degraded operator → check its namespace pods → check hosting nodes
  4. Look for Patterns: Multiple pods failing on same node suggests node issue
  5. Cross-reference: Use multiple scripts together for complete picture

Common Scenarios

"Why is my cluster degraded?"

  1. Run analyze_clusteroperators.py - identify degraded operators
  2. Run analyze_pods.py --namespace <operator-namespace> - check operator pods
  3. Run analyze_nodes.py - verify node health

"Pods keep crashing"

  1. Run analyze_pods.py --problems-only - find crashlooping pods
  2. Check which nodes they're on
  3. Run analyze_nodes.py - verify node conditions
  4. Suggest checking pod logs in must-gather data

"Network connectivity issues"

  1. Run analyze_network.py - check network health
  2. Run analyze_pods.py --namespace openshift-ovn-kubernetes
  3. Check PodNetworkConnectivityCheck results

Next Steps After Analysis

Based on findings, suggest:

  • Examining specific pod logs in namespaces/<ns>/pods/<pod>/<container>/logs/
  • Reviewing events in namespaces/<ns>/core/events.yaml
  • Checking audit logs in audit_logs/
  • Analyzing metrics data if available
  • Looking at host service logs in host_service_logs/

GitHub Repository

openshift-eng/ai-helpers
Path: plugins/must-gather/skills/must-gather-analyzer

Related Skills

content-collections

Meta

This skill provides a production-tested setup for Content Collections, a TypeScript-first tool that transforms Markdown/MDX files into type-safe data collections with Zod validation. Use it when building blogs, documentation sites, or content-heavy Vite + React applications to ensure type safety and automatic content validation. It covers everything from Vite plugin configuration and MDX compilation to deployment optimization and schema validation.

View skill

sglang

Meta

SGLang is a high-performance LLM serving framework that specializes in fast, structured generation for JSON, regex, and agentic workflows using its RadixAttention prefix caching. It delivers significantly faster inference, especially for tasks with repeated prefixes, making it ideal for complex, structured outputs and multi-turn conversations. Choose SGLang over alternatives like vLLM when you need constrained decoding or are building applications with extensive prefix sharing.

View skill

evaluating-llms-harness

Testing

This Claude Skill runs the lm-evaluation-harness to benchmark LLMs across 60+ standardized academic tasks like MMLU and GSM8K. It's designed for developers to compare model quality, track training progress, or report academic results. The tool supports various backends including HuggingFace and vLLM models.

View skill

llamaguard

Other

LlamaGuard is Meta's 7-8B parameter model for moderating LLM inputs and outputs across six safety categories like violence and hate speech. It offers 94-95% accuracy and can be deployed using vLLM, Hugging Face, or Amazon SageMaker. Use this skill to easily integrate content filtering and safety guardrails into your AI applications.

View skill