Root Cause Tracing

bobmatnyc

Updated Yesterday

38 views

Otherdebuggingroot-causetracingcall-stack

About

This skill systematically traces bugs backward through the call stack to identify their original triggers rather than just fixing symptoms. It's designed for use when errors occur deep in execution with unclear data origins or long call chains. The approach involves observing symptoms, finding immediate causes, and repeatedly asking "what called this" until reaching the source.

Documentation

Root Cause Tracing

Overview

Bugs often manifest deep in the call stack (git init in wrong directory, file created in wrong location, database opened with wrong path). Your instinct is to fix where the error appears, but that's treating a symptom.

Core principle: Trace backward through the call chain until you find the original trigger, then fix at the source.

This skill is a specialized technique within the systematic-debugging workflow, typically applied during Phase 1 (Root Cause Investigation) when dealing with deep call stacks.

When to Use This Skill

Use root-cause-tracing when:

Error happens deep in execution (not at entry point)
Stack trace shows long call chain
Unclear where invalid data originated
Need to find which test/code triggers the problem
Symptom appears far from actual cause

Relationship with systematic-debugging:

systematic-debugging: The overall framework (Phases 1-4)
root-cause-tracing: A specific technique for Phase 1 investigation
Use root-cause-tracing WITHIN systematic-debugging Phase 1

The Iron Law

NEVER FIX JUST WHERE THE ERROR APPEARS
ALWAYS TRACE BACK TO FIND THE ORIGINAL TRIGGER

Fixing symptoms creates bandaid solutions that mask root problems.

Core Principles

Trace Backward: Follow call chain from symptom to source
Find Original Trigger: Identify where bad data/state originated
Fix at Source: Address root cause, not symptom
Defense-in-Depth: Add validation at each layer after fixing source

Quick Start

The 5-Step Trace Process

Observe the Symptom: What error message? What failed operation?
Find Immediate Cause: What code directly causes this error?
Ask What Called This: Trace one level up the call stack
Keep Tracing Up: Continue until you find the original trigger
Fix at Source + Defense: Fix root cause and add layer validation

Decision Tree

Error appears deep in stack?
  → Yes: Start tracing backward
    → Can identify caller? → Trace one level up → Repeat
    → Cannot identify caller? → Add instrumentation (see advanced-techniques.md)
  → No: May not need tracing (error at entry point)

The Tracing Process

Example: Git init in wrong directory

Error symptom → execFileAsync('git', ['init'], { cwd: '' })
  ← WorktreeManager.createSessionWorktree(projectDir='')
  ← Session.create() → Project.create() → Test code
  ← ROOT CAUSE: setupCoreTest() returns { tempDir: '' } before beforeEach

At each level ask: Where did this value come from? Is this the origin?

For detailed tracing methodology, see Tracing Techniques For complete real-world examples, see Examples

After Finding Root Cause

Fix at source (throw if accessed before initialization) + Add defense-in-depth (validate at Project.create, WorkspaceManager, environment guards, instrumentation).

This prevents similar bugs and catches issues earlier.

Navigation

For detailed information:

Tracing Techniques: Complete tracing methodology, patterns, and decision trees
Examples: Real-world debugging scenarios with full trace chains
Advanced Techniques: Stack traces, instrumentation, test pollution detection
Integration: How to use with systematic-debugging and other skills

Key Reminders

NEVER fix just where the error appears
ALWAYS trace back to find the original trigger
Use console.error() for debugging in tests (logger may be suppressed)
Log BEFORE the dangerous operation, not after it fails
Include context: directory, cwd, environment, timestamps
Add defense-in-depth after fixing source
Document your trace as you go (write down the call chain)

Red Flags - STOP

If you catch yourself thinking:

"I'll just add validation here" (without finding source)
"This will prevent the error" (symptom fix)
"Too hard to trace back" (add instrumentation instead)
"Quick fix for now" (creates technical debt)

ALL of these mean: Continue tracing to find root cause.

Integration with Other Skills

systematic-debugging: Use root-cause-tracing during Phase 1
defense-in-depth: Add after finding root cause
verification-before-completion: Verify fix worked at source
test-driven-development: Write test for root cause, not symptom

See Integration for complete workflow examples.

Real-World Impact

From debugging session (2025-10-03):

Found root cause through 5-level trace
Fixed at source (getter validation)
Added 4 layers of defense
1847 tests passed, zero pollution
Time saved: 3+ hours vs symptom-fix approach

Bottom line: Tracing takes 15-30 minutes. Symptom fixes take hours of whack-a-mole.

Quick Install

/plugin add https://github.com/bobmatnyc/claude-mpm/tree/main/root-cause-tracing

Copy and paste this command in Claude Code to install this skill

GitHub 仓库

bobmatnyc/claude-mpm

Path: src/claude_mpm/skills/bundled/debugging/root-cause-tracing

Related Skills

smart-bug-fix

Testing

This skill provides an intelligent bug-fixing workflow that systematically identifies root causes using deep analysis and multi-model reasoning. It then generates fixes through Codex auto-fix and validates them with comprehensive testing and regression analysis. Use this skill for methodical debugging that combines automated fixes with thorough validation.

View skill

sherlock-review

Other

Sherlock-review is an evidence-based code review skill that uses deductive reasoning to systematically verify implementation claims, investigate bugs, and perform root cause analysis. It guides developers through a process of observation, deduction, and elimination to determine what actually happened versus what was claimed. This skill is ideal for validating fixes, conducting security audits, and performing performance validation.

View skill

when-debugging-ml-training-use-ml-training-debugger

Other

This skill helps developers diagnose and fix common machine learning training issues like loss divergence, overfitting, and slow convergence. It provides systematic debugging to identify root causes and delivers fixed code with optimization recommendations. Use it when facing training problems like NaN losses, poor validation performance, or when training fails to converge properly.

View skill

systematic-debugging

Other

This skill provides a structured four-phase debugging framework to replace random code changes with systematic problem diagnosis. It helps developers methodically investigate bugs, errors, and unexpected behavior by forming specific hypotheses and testing single changes. Use it when under time pressure or when quick fixes seem obvious to ensure reliable problem resolution.

View skill