Data Lineage

majiayu000

Updated Today

3 views

Otherdata

About

Data Lineage tracks data flow from source to destination, enabling impact analysis and troubleshooting. It automatically captures transformations at orchestration boundaries and provides column-level lineage for critical data. Use this skill to ensure data transparency and trust by understanding its origin and changes.

Quick Install

Claude Code

Recommended

Plugin CommandRecommended

/plugin add https://github.com/majiayu000/claude-skill-registry

Git CloneAlternative

git clone https://github.com/majiayu000/claude-skill-registry.git ~/.claude/skills/Data Lineage

Copy and paste this command in Claude Code to install this skill

Documentation

Data Lineage

Overview

Data Lineage is the process of tracking what happens to data as it flows through various stages—from its original source, through transformations (ETL), to its final destination (dashboards, ML models, or external APIs). Lineage provides the "genealogy" of a dataset.

Core Principle: "To trust the data, you must know where it came from and how it changed."

Best Practices

Capture lineage automatically at orchestration boundaries (Airflow, Spark, dbt) instead of manual docs.
Standardize dataset naming (namespace + name) and keep it stable across environments.
Enrich events with run context (job version/git SHA, run ID, owner/team, and environment).
Prioritize column-level lineage for PII and business-critical metrics; keep table-level for everything else.
Make lineage actionable: use it in schema change reviews and incident RCA/impact analysis.

Quick Start

Choose a lineage standard/tooling (e.g., OpenLineage + Marquez/DataHub).
Instrument your orchestrator to emit lineage events for each job run.
Register stable dataset identifiers (warehouse tables, S3 paths, Kafka topics).
Visualize lineage and validate it during schema changes.
Alert on missing lineage for critical pipelines (treat as a reliability issue).

from __future__ import annotations

import json
from datetime import datetime, timezone
from uuid import uuid4


def build_openlineage_run_event(
    *,
    job_namespace: str,
    job_name: str,
    input_dataset: tuple[str, str],
    output_dataset: tuple[str, str],
    event_type: str = "COMPLETE",
) -> dict:
    event_time = datetime.now(tz=timezone.utc).isoformat()
    run_id = str(uuid4())
    input_ns, input_name = input_dataset
    output_ns, output_name = output_dataset

    return {
        "eventType": event_type,
        "eventTime": event_time,
        "run": {"runId": run_id},
        "job": {"namespace": job_namespace, "name": job_name},
        "inputs": [{"namespace": input_ns, "name": input_name}],
        "outputs": [{"namespace": output_ns, "name": output_name}],
        "producer": "https://openlineage.io/",
        "schemaURL": "https://openlineage.io/spec/1-0-0/OpenLineage.json",
    }


if __name__ == "__main__":
    event = build_openlineage_run_event(
        job_namespace="prod-etl",
        job_name="clean_orders_job",
        input_dataset=("db_raw", "raw_orders"),
        output_dataset=("db_prod", "clean_orders"),
    )
    print(json.dumps(event, indent=2))

1. Why Data Lineage Matters

Benefit	Use Case
Root Cause Analysis	A dashboard is wrong; which upstream table caused the error?
Impact Analysis	I want to delete a column; will it break any downstream reports?
Compliance	Where exactly does PII (SSN, Email) flow in our system? (GDPR/CCPA)
Data Discovery	How is the `active_users` metric actually calculated?

2. Types of Data Lineage

Table-Level Lineage: Shows how data moves between tables (e.g., raw_orders -> clean_orders -> orders_summary).
Column-Level Lineage: Shows how a specific field is transformed (e.g., first_name + last_name -> full_name).
Business Lineage: High-level view showing how data moves across departments or broad systems (SaaS -> Data Warehouse -> BI Dashboard).

3. Technical Implementation

A. SQL Parsing

Reading SQL scripts to identify INSERT INTO... SELECT FROM patterns.

Tool: sqlglot or sqlfluff.

B. OpenLineage Standard

OpenLineage is an open standard for lineage metadata collection. It uses "Jobs" and "Datasets" to represent relationships.

{
  "eventTime": "2024-01-15T12:00:00Z",
  "job": { "namespace": "prod-etl", "name": "clean_orders_job" },
  "inputs": [ { "namespace": "db_raw", "name": "raw_orders" } ],
  "outputs": [ { "namespace": "db_prod", "name": "clean_orders" } ]
}

C. dbt Lineage

dbt automatically generates a lineage graph (the "DAG") from your project dependencies.

# Generate and serve the lineage documentation
dbt docs generate
dbt docs serve

4. Tools for Data Lineage

Tool	Focus	Best For
OpenLineage	Standard	Orchestrators like Airflow, Spark, dbt.
Amundsen	Data Discovery	Built by Lyft; focusing on user-collaborative search.
DataHub	Metadata Platform	Built by LinkedIn; extensive lineage and ownership tracking.
Monte Carlo	Observability	Automatically infers lineage from query logs.
Marquez	Metadata Store	Reference implementation for OpenLineage.

5. Root Cause Analysis (RCA) with Lineage

Imagine a "Monthly Revenue" dashboard shows $0.

Check Output: Dashboard uses gold_monthly_revenue table.
Trace Upstream: gold_monthly_revenue is populated from silver_orders.
Investigate Link: silver_orders is 10GB but usually 50GB.
Identify Source: silver_orders gets data from raw_stripe_api.
Conclusion: The Stripe API extraction job failed yesterday, causing missing data downstream.

6. Impact Analysis Workflow

Before running DROP COLUMN ccv in a production database:

Query Lineage Tool: "Search for usages of transactions.ccv."
Identify Consumers: Discovery shows it is used by the fraud_detection_model.
Coordinate: Contact the Fraud Team lead to ensure the model no longer needs the column.
Action: Proceed with the "Tombstoning" strategy (see schema-management).

7. Tracking PII Flow

Lineage is the primary tool for privacy compliance. You can "Tag" a source field as PII and the lineage tool will propagate that tag down the flow.

Source: users.email (Tagged: PII)
Transformation: LOWER(email) (Propagated: PII)
Target: marketing_leads.contact (Auto-Propagated: PII)

This allows security teams to identify which S3 buckets or BigQuery datasets require encryption at rest without manual audits.

8. Automated vs. Manual Lineage

Automated: Captured from query logs or orchestrator (Preferred). Low maintenance, 100% accurate.
Manual: Documented in a Wiki. High maintenance, quickly becomes outdated, unreliable.

9. Data Lineage Checklist

Completeness: Does our lineage cover cross-system boundaries (e.g., Salesforce to Snowflake)?
Granularity: Do we have column-level lineage for our most sensitive data?
Orchestration: Is lineage captured automatically from every Airflow/dbt run?
Impact Analysis: Is there a standard process to check lineage before a schema change?
Ownership: Does every table in the lineage graph have a defined team/individual owner?

Related Skills

43-data-reliability/data-contracts
43-data-reliability/schema-management
44-ai-governance/ai-compliance

GitHub Repository

majiayu000/claude-skill-registry

Path: skills/data-lineage

Related Skills

content-collections

Meta

This skill provides a production-tested setup for Content Collections, a TypeScript-first tool that transforms Markdown/MDX files into type-safe data collections with Zod validation. Use it when building blogs, documentation sites, or content-heavy Vite + React applications to ensure type safety and automatic content validation. It covers everything from Vite plugin configuration and MDX compilation to deployment optimization and schema validation.

View skill

llamaindex

Meta

LlamaIndex is a data framework for building RAG-powered LLM applications, specializing in document ingestion, indexing, and querying. It provides key features like vector indices, query engines, and agents, and supports over 300 data connectors. Use it for document Q&A, chatbots, and knowledge retrieval when building data-centric applications.

View skill

hybrid-cloud-networking

Meta

This skill configures secure hybrid cloud networking between on-premises infrastructure and cloud platforms like AWS, Azure, and GCP. Use it when connecting data centers to the cloud, building hybrid architectures, or implementing secure cross-premises connectivity. It supports key capabilities such as VPNs and dedicated connections like AWS Direct Connect for high-performance, reliable setups.

View skill

polymarket

Meta

This skill enables developers to build applications with the Polymarket prediction markets platform, including API integration for trading and market data. It also provides real-time data streaming via WebSocket to monitor live trades and market activity. Use it for implementing trading strategies or creating tools that process live market updates.

View skill