intermittent-issue-debugging
About
This Claude Skill helps debug sporadic, hard-to-reproduce issues like flaky tests and race conditions. It provides systematic investigation techniques including comprehensive logging and monitoring strategies. Developers should use it when dealing with intermittent errors, timing-dependent bugs, or resource exhaustion issues.
Documentation
Intermittent Issue Debugging
Overview
Intermittent issues are the most difficult to debug because they don't occur consistently. Systematic approach and comprehensive monitoring are essential.
When to Use
- Sporadic errors in logs
- Users report occasional issues
- Flaky tests
- Race conditions suspected
- Timing-dependent bugs
- Resource exhaustion issues
Instructions
1. Capturing Intermittent Issues
// Strategy 1: Comprehensive Logging
// Add detailed logging around suspected code
function processPayment(orderId) {
const startTime = Date.now();
console.log(`[${startTime}] Payment start: order=${orderId}`);
try {
const result = chargeCard(orderId);
console.log(`[${Date.now()}] Payment success: ${orderId}`);
return result;
} catch (error) {
const duration = Date.now() - startTime;
console.error(`[${Date.now()}] Payment FAILED:`, {
order: orderId,
error: error.message,
duration_ms: duration,
error_type: error.constructor.name,
stack: error.stack
});
throw error;
}
}
// Strategy 2: Correlation IDs
// Track requests across systems
const correlationId = generateId();
logger.info({
correlationId,
action: 'payment_start',
orderId: 123
});
chargeCard(orderId, {headers: {correlationId}});
logger.info({
correlationId,
action: 'payment_end',
status: 'success'
});
// Later, can grep logs by correlationId to see full trace
// Strategy 3: Error Sampling
// Capture full error context when occurs
window.addEventListener('error', (event) => {
const errorData = {
message: event.message,
url: event.filename,
line: event.lineno,
col: event.colno,
stack: event.error?.stack,
userAgent: navigator.userAgent,
memory: performance.memory?.usedJSHeapSize,
timestamp: new Date().toISOString()
};
sendToMonitoring(errorData); // Send to error tracking
});
2. Common Intermittent Issues
Issue: Race Condition
Symptom: Inconsistent behavior depending on timing
Example:
Thread 1: Read count (5)
Thread 2: Read count (5), increment to 6, write
Thread 1: Increment to 6, write (overrides Thread 2)
Result: Should be 7, but is 6
Debug:
1. Add detailed timestamps
2. Log all operations
3. Look for overlapping operations
4. Check if order matters
Solution:
- Use locks/mutexes
- Use atomic operations
- Use message queues
- Ensure single writer
---
Issue: Timing-Dependent Bug
Symptom: Test passes sometimes, fails others
Example:
test_user_creation:
1. Create user (sometimes slow)
2. Check user exists
3. Fails if create took too long
Debug:
- Add timeout logging
- Increase wait time
- Add explicit waits
- Mock slow operations
Solution:
- Explicit wait for condition
- Remove time-dependent assertions
- Use proper test fixtures
---
Issue: Resource Exhaustion
Symptom: Works fine, but after time fails
Example:
- Memory grows over time
- Connections pool exhausted
- Disk space fills up
- Max open files reached
Debug:
- Monitor resources continuously
- Check for leaks (memory growth)
- Monitor connection count
- Check long-running processes
Solution:
- Fix memory leak
- Increase resource limits
- Implement cleanup
- Add monitoring/alerts
---
Issue: Intermittent Network Failure
Symptom: API calls occasionally fail
Debug:
- Check network logs
- Identify timeout patterns
- Check if time-of-day dependent
- Check if load dependent
Solution:
- Implement exponential backoff retry
- Add circuit breaker
- Increase timeout
- Add redundancy
3. Systematic Investigation Process
Step 1: Understand the Pattern
Questions:
- How often does it occur? (1/100, 1/1000?)
- When does it occur? (time of day, load, specific user?)
- What are the conditions? (network, memory, load?)
- Is it reproducible? (deterministic or random?)
- Any recent changes?
Analysis:
- Review error logs
- Check error rate trends
- Identify patterns
- Correlate with changes
Step 2: Reproduce Reliably
Methods:
- Increase test frequency (run 1000 times)
- Stress test (heavy load)
- Simulate poor conditions (network, memory)
- Run on different machines
- Run in production-like environment
Goal: Make issue consistent to analyze
Step 3: Add Instrumentation
- Add detailed logging
- Add monitoring metrics
- Add trace IDs
- Capture errors fully
- Log system state
Step 4: Capture the Issue
- Recreate scenario
- Capture full context
- Note system state
- Document conditions
- Get reproduction case
Step 5: Analyze Data
- Review logs
- Look for patterns
- Compare normal vs error cases
- Check timing correlations
- Identify root cause
Step 6: Implement Fix
- Based on root cause
- Verify with reproduction case
- Test extensively
- Add regression test
4. Monitoring & Prevention
Monitoring Strategy:
Real User Monitoring (RUM):
- Error rates by feature
- Latency percentiles
- User impact
- Trend analysis
Application Performance Monitoring (APM):
- Request traces
- Database query performance
- External service calls
- Resource usage
Synthetic Monitoring:
- Regular test execution
- Simulate user flows
- Alert on failures
- Trend tracking
---
Alerting:
Setup alerts for:
- Error rate spike
- Response time >threshold
- Memory growth trend
- Failed transactions
---
Prevention Checklist:
[ ] Comprehensive logging in place
[ ] Error tracking configured
[ ] Performance monitoring active
[ ] Resource monitoring enabled
[ ] Correlation IDs used
[ ] Failed requests captured
[ ] Timeout values appropriate
[ ] Retry logic implemented
[ ] Circuit breakers in place
[ ] Load testing performed
[ ] Stress testing performed
[ ] Race conditions reviewed
[ ] Timing dependencies checked
---
Tools:
Monitoring:
- New Relic / DataDog
- Prometheus / Grafana
- Sentry / Rollbar
- Custom logging
Testing:
- Load testing (k6, JMeter)
- Chaos engineering (gremlin)
- Property-based testing (hypothesis)
- Fuzz testing
Debugging:
- Distributed tracing (Jaeger)
- Correlation IDs
- Detailed logging
- Debuggers
Key Points
- Comprehensive logging is essential
- Add correlation IDs for tracing
- Monitor for patterns and trends
- Stress test to reproduce
- Use detailed error context
- Implement exponential backoff for retries
- Monitor resource exhaustion
- Add circuit breakers for external services
- Log system state with errors
- Implement proper monitoring/alerting
Quick Install
/plugin add https://github.com/aj-geddes/useful-ai-prompts/tree/main/intermittent-issue-debuggingCopy and paste this command in Claude Code to install this skill
GitHub 仓库
Related Skills
subagent-driven-development
DevelopmentThis skill executes implementation plans by dispatching a fresh subagent for each independent task, with code review between tasks. It enables fast iteration while maintaining quality gates through this review process. Use it when working on mostly independent tasks within the same session to ensure continuous progress with built-in quality checks.
algorithmic-art
MetaThis Claude Skill creates original algorithmic art using p5.js with seeded randomness and interactive parameters. It generates .md files for algorithmic philosophies, plus .html and .js files for interactive generative art implementations. Use it when developers need to create flow fields, particle systems, or other computational art while avoiding copyright issues.
executing-plans
DesignUse the executing-plans skill when you have a complete implementation plan to execute in controlled batches with review checkpoints. It loads and critically reviews the plan, then executes tasks in small batches (default 3 tasks) while reporting progress between each batch for architect review. This ensures systematic implementation with built-in quality control checkpoints.
cost-optimization
OtherThis Claude Skill helps developers optimize cloud costs through resource rightsizing, tagging strategies, and spending analysis. It provides a framework for reducing cloud expenses and implementing cost governance across AWS, Azure, and GCP. Use it when you need to analyze infrastructure costs, right-size resources, or meet budget constraints.
