MCP HubMCP Hub
スキル一覧に戻る

data-refresh-eval

majiayu000
更新日 Today
18 閲覧
58
9
58
GitHubで表示
メタdesigndata

について

このスキルは、Frontからカスタマーサポート会話を取得して評価データセットを構築・更新します。これにより、開発者はルーティング評価を実行し、それらのデータセットに対するエージェントの応答品質を分析できます。新鮮なテストデータを維持し、サポートシステムのパフォーマンスを継続的に評価するためにご活用ください。

クイックインストール

Claude Code

推奨
プラグインコマンド推奨
/plugin add https://github.com/majiayu000/claude-skill-registry
Git クローン代替
git clone https://github.com/majiayu000/claude-skill-registry.git ~/.claude/skills/data-refresh-eval

このコマンドをClaude Codeにコピー&ペーストしてスキルをインストールします

ドキュメント

Data Refresh & Eval Skill

Workflow for keeping the eval dataset fresh and running quality checks on agent responses.

Quick Start

cd ~/Code/skillrecordings/support/packages/cli

# Refresh dataset from Front (last 30 days, 200 responses max)
bun src/index.ts dataset build --since $(date -d "30 days ago" +%Y-%m-%d) --limit 200 --output data/eval-dataset.json

# Run routing eval
bun src/index.ts eval routing data/eval-dataset.json

Dataset Commands

Build fresh dataset

# Recent data (recommended for ongoing work)
bun src/index.ts dataset build --since 2025-01-01 --limit 200 --output data/eval-dataset.json

# App-specific
bun src/index.ts dataset build --app total-typescript --limit 100 --output data/tt-dataset.json

# Include conversation history for context
bun src/index.ts dataset build --since 2025-01-01 --include-history --output data/dataset-with-history.json

# Only labeled responses (good/bad)
bun src/index.ts dataset build --labeled-only --output data/labeled-only.json

Convert to evalite format

bun src/index.ts dataset to-evalite -i data/eval-dataset.json -o data/evalite-format.json

Running Evals

Routing eval (default thresholds)

bun src/index.ts eval routing data/eval-dataset.json

Custom thresholds

bun src/index.ts eval routing data/eval-dataset.json \
  --min-precision 0.95 \
  --min-recall 0.98 \
  --max-fp-rate 0.02 \
  --max-fn-rate 0.01

JSON output for CI/automation

bun src/index.ts eval routing data/eval-dataset.json --json

Response Analysis

Find bad responses for debugging

# List responses rated "bad"
bun src/index.ts responses list --rating bad

# Get details with conversation context
bun src/index.ts responses get <actionId> --context

# Export bad responses for analysis
bun src/index.ts responses export --rating bad -o bad-responses.json

Analyze unrated responses

bun src/index.ts responses list --rating unrated --limit 50

Recommended Workflow

Daily data refresh

cd ~/Code/skillrecordings/support/packages/cli

# 1. Pull fresh data
bun src/index.ts dataset build --since $(date -d "7 days ago" +%Y-%m-%d) --limit 100 --output data/eval-dataset.json

# 2. Check dataset stats
cat data/eval-dataset.json | jq 'length'

# 3. Run eval
bun src/index.ts eval routing data/eval-dataset.json

# 4. Check for failures
bun src/index.ts responses list --rating bad --limit 10

Pre-deploy validation

# 1. Build comprehensive dataset
bun src/index.ts dataset build --since 2025-01-01 --limit 500 --output data/full-dataset.json

# 2. Run eval with strict thresholds
bun src/index.ts eval routing data/full-dataset.json --min-precision 0.95 --min-recall 0.98 --json

# 3. Check exit code
echo "Exit code: $?"

Dataset Schema

Each eval point includes:

  • id - Action ID
  • app - App slug (total-typescript, aihero, etc.)
  • conversationId - Front conversation ID
  • customerEmail - Customer email (if available)
  • triggerMessage - The inbound message that triggered the response
    • subject, body, timestamp
  • agentResponse - The agent's drafted response
    • text, category, timestamp
  • label - "good" | "bad" | undefined
  • labeledBy - Who approved/rejected
  • conversationHistory - (optional) Full message history

Environment

Required in .env.local:

FRONT_API_TOKEN=          # Front API access
DATABASE_URL=             # Database connection

Troubleshooting

"FRONT_API_TOKEN environment variable required"

source apps/front/.env.local
# or set in .env.local at repo root

Dataset building slowly

Front API rate limits. Use --limit to control batch size.

No labeled data

Labels come from HITL approvals/rejections. New responses start unlabeled.

GitHub リポジトリ

majiayu000/claude-skill-registry
パス: skills/data-refresh-eval

関連スキル

content-collections

メタ

This skill provides a production-tested setup for Content Collections, a TypeScript-first tool that transforms Markdown/MDX files into type-safe data collections with Zod validation. Use it when building blogs, documentation sites, or content-heavy Vite + React applications to ensure type safety and automatic content validation. It covers everything from Vite plugin configuration and MDX compilation to deployment optimization and schema validation.

スキルを見る

creating-opencode-plugins

メタ

This skill provides the structure and API specifications for creating OpenCode plugins that hook into 25+ event types like commands, files, and LSP operations. It offers implementation patterns for JavaScript/TypeScript modules that intercept and extend the AI assistant's lifecycle. Use it when you need to build event-driven plugins for monitoring, custom handling, or extending OpenCode's capabilities.

スキルを見る

polymarket

メタ

This skill enables developers to build applications with the Polymarket prediction markets platform, including API integration for trading and market data. It also provides real-time data streaming via WebSocket to monitor live trades and market activity. Use it for implementing trading strategies or creating tools that process live market updates.

スキルを見る

langchain

メタ

LangChain is a framework for building LLM applications using agents, chains, and RAG pipelines. It supports multiple LLM providers, offers 500+ integrations, and includes features like tool calling and memory management. Use it for rapid prototyping and deploying production systems like chatbots, autonomous agents, and question-answering services.

スキルを見る