Back to Skills

data-engineer

majiayu000
Updated Today
3 views
58
9
58
View on GitHub
Designdesigndata

About

The data-engineer skill provides scalable data pipeline development and ETL/ELT implementation expertise. It specializes in building data infrastructure using modern tools like Airflow, dbt, Spark, and Kafka, with a focus on reliability and cost optimization. Use it for designing data lakes/warehouses, stream processing, and data governance tasks.

Quick Install

Claude Code

Recommended
Plugin CommandRecommended
/plugin add https://github.com/majiayu000/claude-skill-registry
Git CloneAlternative
git clone https://github.com/majiayu000/claude-skill-registry.git ~/.claude/skills/data-engineer

Copy and paste this command in Claude Code to install this skill

Documentation

Data Engineer

Purpose

Provides expert data engineering capabilities for building scalable data pipelines, ETL/ELT workflows, data lakes, and data warehouses. Specializes in distributed data processing, stream processing, data quality, and modern data stack technologies (Airflow, dbt, Spark, Kafka) with focus on reliability and cost optimization.

When to Use

  • Designing end-to-end data pipelines from source to consumption layer
  • Implementing ETL/ELT workflows with error handling and data quality checks
  • Building data lakes or data warehouses with optimal storage and querying
  • Setting up real-time stream processing (Kafka, Flink, Kinesis)
  • Optimizing data infrastructure costs (storage tiering, compute efficiency)
  • Implementing data governance and compliance (GDPR, data lineage)
  • Migrating legacy data systems to modern data platforms

Quick Start

Invoke this skill when:

  • Designing end-to-end data pipelines from source to consumption layer
  • Implementing ETL/ELT workflows with error handling and data quality checks
  • Building data lakes or data warehouses with optimal storage and querying
  • Setting up real-time stream processing (Kafka, Flink, Kinesis)
  • Optimizing data infrastructure costs (storage tiering, compute efficiency)
  • Implementing data governance and compliance (GDPR, data lineage)

Do NOT invoke when:

  • Only SQL query optimization needed (use database-optimizer instead)
  • Machine learning model development (use ml-engineer or data-scientist)
  • Simple data analysis or visualization (use data-analyst)
  • Database administration tasks (use database-administrator)
  • API integration without data transformation (use backend-developer)

Decision Framework

Pipeline Architecture Selection

├─ Batch Processing?
│   ├─ Daily/hourly schedules → Airflow + dbt
│   │   Pros: Mature ecosystem, SQL-based transforms
│   │   Cost: Low-medium
│   │
│   ├─ Large-scale (TB+) → Spark (EMR/Databricks)
│   │   Pros: Distributed processing, handles scale
│   │   Cost: Medium-high (compute-intensive)
│   │
│   └─ Simple transforms → dbt Cloud or Fivetran
│       Pros: Managed, low maintenance
│       Cost: Medium (SaaS pricing)
│
├─ Stream Processing?
│   ├─ Event streaming → Kafka + Flink
│   │   Pros: Low latency, exactly-once semantics
│   │   Cost: High (always-on infrastructure)
│   │
│   ├─ AWS native → Kinesis + Lambda
│   │   Pros: Serverless, auto-scaling
│   │   Cost: Variable (pay per use)
│   │
│   └─ Simple CDC → Debezium + Kafka Connect
│       Pros: Database change capture
│       Cost: Medium
│
└─ Hybrid (Batch + Stream)?
    └─ Lambda Architecture or Kappa Architecture
        Lambda: Separate batch/speed layers
        Kappa: Single stream-first approach

Data Storage Selection

Use CaseTechnologyProsCons
Structured analyticsSnowflake/BigQuerySQL, fast queriesCost at scale
Semi-structuredDelta Lake/IcebergACID, schema evolutionComplexity
Raw storageS3/GCSCheap, durableNo query engine
Real-timeRedis/DynamoDBLow latencyLimited analytics
Time-seriesTimescaleDB/InfluxDBOptimized for time dataSpecific use case

ETL vs ELT Decision

FactorETL (Transform First)ELT (Load First)
Data volumeSmall-mediumLarge (TB+)
TransformationComplex, pre-loadSQL-based, in-warehouse
LatencyHigherLower
CostCompute before loadWarehouse compute
Best forLegacy systemsModern cloud DW

Core Patterns

Pattern 1: Idempotent Partition Overwrite

Use case: Safely re-run batch jobs without creating duplicates.

# PySpark example: Overwrite partition based on execution date
def write_daily_partition(df, target_table, execution_date):
    (df
     .write
     .mode("overwrite")
     .partitionBy("process_date")
     .option("partitionOverwriteMode", "dynamic")
     .format("parquet")
     .saveAsTable(target_table))

Pattern 2: Slowly Changing Dimension Type 2 (SCD2)

Use case: Track history of changes without losing past states.

-- dbt implementation of SCD2
{{ config(materialized='incremental', unique_key='user_id') }}

SELECT 
    user_id, address, email, status, updated_at,
    LEAD(updated_at, 1, '9999-12-31') OVER (
        PARTITION BY user_id ORDER BY updated_at
    ) as valid_to
FROM {{ source('raw', 'users') }}

Pattern 3: Dead Letter Queue (DLQ) for Streaming

Use case: Handle malformed messages without stopping the pipeline.

Pattern 4: Data Quality Circuit Breaker

Use case: Stop pipeline execution if data quality drops below threshold.

Quality Checklist

Data Pipeline

  • Idempotent (safe to retry)
  • Schema validation enforced
  • Error handling with retries
  • Data quality checks automated
  • Monitoring and alerting configured
  • Lineage documented

Performance

  • Pipeline completes within SLA (e.g., <1 hour)
  • Incremental loading where applicable
  • Partitioning strategy optimized
  • Query performance <30 seconds (P95)

Cost Optimization

  • Storage tiering implemented (hot/warm/cold)
  • Compute auto-scaling configured
  • Query cost monitoring active
  • Compression enabled (Parquet/ORC)

Additional Resources

GitHub Repository

majiayu000/claude-skill-registry
Path: skills/data-engineer-skill

Related Skills

content-collections

Meta

This skill provides a production-tested setup for Content Collections, a TypeScript-first tool that transforms Markdown/MDX files into type-safe data collections with Zod validation. Use it when building blogs, documentation sites, or content-heavy Vite + React applications to ensure type safety and automatic content validation. It covers everything from Vite plugin configuration and MDX compilation to deployment optimization and schema validation.

View skill

creating-opencode-plugins

Meta

This skill provides the structure and API specifications for creating OpenCode plugins that hook into 25+ event types like commands, files, and LSP operations. It offers implementation patterns for JavaScript/TypeScript modules that intercept and extend the AI assistant's lifecycle. Use it when you need to build event-driven plugins for monitoring, custom handling, or extending OpenCode's capabilities.

View skill

langchain

Meta

LangChain is a framework for building LLM applications using agents, chains, and RAG pipelines. It supports multiple LLM providers, offers 500+ integrations, and includes features like tool calling and memory management. Use it for rapid prototyping and deploying production systems like chatbots, autonomous agents, and question-answering services.

View skill

Algorithmic Art Generation

Meta

This skill helps developers create algorithmic art using p5.js, focusing on generative art, computational aesthetics, and interactive visualizations. It automatically activates for topics like "generative art" or "p5.js visualization" and guides you through creating unique algorithms with features like seeded randomness, flow fields, and particle systems. Use it when you need to build reproducible, code-driven artistic patterns.

View skill