MCP HubMCP Hub
スキル一覧に戻る

data-pipeline-engineer

majiayu000
更新日 Today
20 閲覧
58
9
58
GitHubで表示
その他etlsparkkafkaairflowdata-warehouse

について

このスキルは、ETL/ELTパイプライン、ストリーミングシステム(Spark/Kafka)、データウェアハウスの構築と最適化に関する専門的なデータエンジニアリングを提供します。データモデリング、ワークフローオーケストレーション(Airflow/dbt)、バッチ/ストリーム処理、データ品質保証にご利用ください。API設計、機械学習モデルのトレーニング、ダッシュボード開発は、他の専門スキルを要するため、対象外としています。

クイックインストール

Claude Code

推奨
プラグインコマンド推奨
/plugin add https://github.com/majiayu000/claude-skill-registry
Git クローン代替
git clone https://github.com/majiayu000/claude-skill-registry.git ~/.claude/skills/data-pipeline-engineer

このコマンドをClaude Codeにコピー&ペーストしてスキルをインストールします

ドキュメント

Data Pipeline Engineer

Expert data engineer specializing in ETL/ELT pipelines, streaming architectures, data warehousing, and modern data stack implementation.

Quick Start

  1. Identify sources - data formats, volumes, freshness requirements
  2. Choose architecture - Medallion (Bronze/Silver/Gold), Lambda, or Kappa
  3. Design layers - staging → intermediate → marts (dbt pattern)
  4. Add quality gates - Great Expectations or dbt tests at each layer
  5. Orchestrate - Airflow DAGs with sensors and retries
  6. Monitor - lineage, freshness, anomaly detection

Core Capabilities

CapabilityTechnologiesKey Patterns
Batch ProcessingSpark, dbt, DatabricksIncremental, partitioning, Delta/Iceberg
Stream ProcessingKafka, Flink, Spark StreamingWatermarks, exactly-once, windowing
OrchestrationAirflow, Dagster, PrefectDAG design, sensors, task groups
Data Modelingdbt, SQLKimball, Data Vault, SCD
Data QualityGreat Expectations, dbt testsValidation suites, freshness

Architecture Patterns

Medallion Architecture (Recommended)

BRONZE (Raw)     → Exact source copy, schema-on-read, partitioned by ingestion
      ↓ Cleaning, Deduplication
SILVER (Cleansed) → Validated, standardized, business logic applied
      ↓ Aggregation, Enrichment
GOLD (Business)   → Dimensional models, aggregates, ready for BI/ML

Lambda vs Kappa

  • Lambda: Batch + Stream layers → merged serving layer (complex but complete)
  • Kappa: Stream-only with replay → simpler but requires robust streaming

Reference Examples

Full implementation examples in ./references/:

FileDescription
dbt-project-structure.mdComplete dbt layout with staging, intermediate, marts
airflow-dag.pyProduction DAG with sensors, task groups, quality checks
spark-streaming.pyKafka-to-Delta processor with windowing
great-expectations-suite.jsonComprehensive data quality expectation suite

Anti-Patterns (10 Critical Mistakes)

1. Full Table Refreshes

Symptom: Truncate and rebuild entire tables every run Fix: Use incremental models with is_incremental(), partition by date

2. Tight Coupling to Source Schemas

Symptom: Pipeline breaks when upstream adds/removes columns Fix: Explicit source contracts, select only needed columns in staging

3. Monolithic DAGs

Symptom: One 200-task DAG running 8 hours Fix: Domain-specific DAGs, ExternalTaskSensor for dependencies

4. No Data Quality Gates

Symptom: Bad data reaches production before detection Fix: Great Expectations or dbt tests at each layer, block on failures

5. Processing Before Archiving

Symptom: Raw data transformed without preserving original Fix: Always land raw in Bronze first, make transformations reproducible

6. Hardcoded Dates in Queries

Symptom: Manual updates needed for date filters Fix: Use Airflow templating (e.g., ds variable) or dynamic date functions

7. Missing Watermarks in Streaming

Symptom: Unbounded state growth, OOM in long-running jobs Fix: Add withWatermark() to handle late-arriving data

8. No Retry/Backoff Strategy

Symptom: Transient failures cause DAG failures Fix: retries=3, retry_exponential_backoff=True, max_retry_delay

9. Undocumented Data Lineage

Symptom: No one knows where data comes from or who uses it Fix: dbt docs, data catalog integration, column-level lineage

10. Testing Only in Production

Symptom: Bugs discovered by stakeholders, not engineers Fix: dbt --target dev, sample datasets, CI/CD for models

Quality Checklist

Pipeline Design:

  • Incremental processing where possible
  • Idempotent transformations (re-runnable safely)
  • Partitioning strategy defined and documented
  • Backfill procedures documented

Data Quality:

  • Tests at Bronze layer (schema, nulls, ranges)
  • Tests at Silver layer (business rules, referential integrity)
  • Tests at Gold layer (aggregation checks, trend monitoring)
  • Anomaly detection for volumes and distributions

Orchestration:

  • Retry and alerting configured
  • SLAs defined and monitored
  • Cross-DAG dependencies use sensors
  • max_active_runs prevents parallel conflicts

Operations:

  • Data lineage documented
  • Runbooks for common failures
  • Monitoring dashboards for pipeline health
  • On-call procedures defined

Validation Script

Run ./scripts/validate-pipeline.sh to check:

  • dbt project structure and conventions
  • Airflow DAG best practices
  • Spark job configurations
  • Data quality setup

External Resources

GitHub リポジトリ

majiayu000/claude-skill-registry
パス: skills/data-pipeline-engineer

関連スキル

content-collections

メタ

This skill provides a production-tested setup for Content Collections, a TypeScript-first tool that transforms Markdown/MDX files into type-safe data collections with Zod validation. Use it when building blogs, documentation sites, or content-heavy Vite + React applications to ensure type safety and automatic content validation. It covers everything from Vite plugin configuration and MDX compilation to deployment optimization and schema validation.

スキルを見る

creating-opencode-plugins

メタ

This skill provides the structure and API specifications for creating OpenCode plugins that hook into 25+ event types like commands, files, and LSP operations. It offers implementation patterns for JavaScript/TypeScript modules that intercept and extend the AI assistant's lifecycle. Use it when you need to build event-driven plugins for monitoring, custom handling, or extending OpenCode's capabilities.

スキルを見る

evaluating-llms-harness

テスト

This Claude Skill runs the lm-evaluation-harness to benchmark LLMs across 60+ standardized academic tasks like MMLU and GSM8K. It's designed for developers to compare model quality, track training progress, or report academic results. The tool supports various backends including HuggingFace and vLLM models.

スキルを見る

sglang

メタ

SGLang is a high-performance LLM serving framework that specializes in fast, structured generation for JSON, regex, and agentic workflows using its RadixAttention prefix caching. It delivers significantly faster inference, especially for tasks with repeated prefixes, making it ideal for complex, structured outputs and multi-turn conversations. Choose SGLang over alternatives like vLLM when you need constrained decoding or are building applications with extensive prefix sharing.

スキルを見る