data-refresh-eval
について
このスキルは、Frontからカスタマーサポート会話を取得して評価データセットを構築・更新します。これにより、開発者はルーティング評価を実行し、それらのデータセットに対するエージェントの応答品質を分析できます。新鮮なテストデータを維持し、サポートシステムのパフォーマンスを継続的に評価するためにご活用ください。
クイックインストール
Claude Code
推奨/plugin add https://github.com/majiayu000/claude-skill-registrygit clone https://github.com/majiayu000/claude-skill-registry.git ~/.claude/skills/data-refresh-evalこのコマンドをClaude Codeにコピー&ペーストしてスキルをインストールします
ドキュメント
Data Refresh & Eval Skill
Workflow for keeping the eval dataset fresh and running quality checks on agent responses.
Quick Start
cd ~/Code/skillrecordings/support/packages/cli
# Refresh dataset from Front (last 30 days, 200 responses max)
bun src/index.ts dataset build --since $(date -d "30 days ago" +%Y-%m-%d) --limit 200 --output data/eval-dataset.json
# Run routing eval
bun src/index.ts eval routing data/eval-dataset.json
Dataset Commands
Build fresh dataset
# Recent data (recommended for ongoing work)
bun src/index.ts dataset build --since 2025-01-01 --limit 200 --output data/eval-dataset.json
# App-specific
bun src/index.ts dataset build --app total-typescript --limit 100 --output data/tt-dataset.json
# Include conversation history for context
bun src/index.ts dataset build --since 2025-01-01 --include-history --output data/dataset-with-history.json
# Only labeled responses (good/bad)
bun src/index.ts dataset build --labeled-only --output data/labeled-only.json
Convert to evalite format
bun src/index.ts dataset to-evalite -i data/eval-dataset.json -o data/evalite-format.json
Running Evals
Routing eval (default thresholds)
bun src/index.ts eval routing data/eval-dataset.json
Custom thresholds
bun src/index.ts eval routing data/eval-dataset.json \
--min-precision 0.95 \
--min-recall 0.98 \
--max-fp-rate 0.02 \
--max-fn-rate 0.01
JSON output for CI/automation
bun src/index.ts eval routing data/eval-dataset.json --json
Response Analysis
Find bad responses for debugging
# List responses rated "bad"
bun src/index.ts responses list --rating bad
# Get details with conversation context
bun src/index.ts responses get <actionId> --context
# Export bad responses for analysis
bun src/index.ts responses export --rating bad -o bad-responses.json
Analyze unrated responses
bun src/index.ts responses list --rating unrated --limit 50
Recommended Workflow
Daily data refresh
cd ~/Code/skillrecordings/support/packages/cli
# 1. Pull fresh data
bun src/index.ts dataset build --since $(date -d "7 days ago" +%Y-%m-%d) --limit 100 --output data/eval-dataset.json
# 2. Check dataset stats
cat data/eval-dataset.json | jq 'length'
# 3. Run eval
bun src/index.ts eval routing data/eval-dataset.json
# 4. Check for failures
bun src/index.ts responses list --rating bad --limit 10
Pre-deploy validation
# 1. Build comprehensive dataset
bun src/index.ts dataset build --since 2025-01-01 --limit 500 --output data/full-dataset.json
# 2. Run eval with strict thresholds
bun src/index.ts eval routing data/full-dataset.json --min-precision 0.95 --min-recall 0.98 --json
# 3. Check exit code
echo "Exit code: $?"
Dataset Schema
Each eval point includes:
id- Action IDapp- App slug (total-typescript, aihero, etc.)conversationId- Front conversation IDcustomerEmail- Customer email (if available)triggerMessage- The inbound message that triggered the responsesubject,body,timestamp
agentResponse- The agent's drafted responsetext,category,timestamp
label- "good" | "bad" | undefinedlabeledBy- Who approved/rejectedconversationHistory- (optional) Full message history
Environment
Required in .env.local:
FRONT_API_TOKEN= # Front API access
DATABASE_URL= # Database connection
Troubleshooting
"FRONT_API_TOKEN environment variable required"
source apps/front/.env.local
# or set in .env.local at repo root
Dataset building slowly
Front API rate limits. Use --limit to control batch size.
No labeled data
Labels come from HITL approvals/rejections. New responses start unlabeled.
GitHub リポジトリ
関連スキル
content-collections
メタThis skill provides a production-tested setup for Content Collections, a TypeScript-first tool that transforms Markdown/MDX files into type-safe data collections with Zod validation. Use it when building blogs, documentation sites, or content-heavy Vite + React applications to ensure type safety and automatic content validation. It covers everything from Vite plugin configuration and MDX compilation to deployment optimization and schema validation.
creating-opencode-plugins
メタThis skill provides the structure and API specifications for creating OpenCode plugins that hook into 25+ event types like commands, files, and LSP operations. It offers implementation patterns for JavaScript/TypeScript modules that intercept and extend the AI assistant's lifecycle. Use it when you need to build event-driven plugins for monitoring, custom handling, or extending OpenCode's capabilities.
polymarket
メタThis skill enables developers to build applications with the Polymarket prediction markets platform, including API integration for trading and market data. It also provides real-time data streaming via WebSocket to monitor live trades and market activity. Use it for implementing trading strategies or creating tools that process live market updates.
langchain
メタLangChain is a framework for building LLM applications using agents, chains, and RAG pipelines. It supports multiple LLM providers, offers 500+ integrations, and includes features like tool calling and memory management. Use it for rapid prototyping and deploying production systems like chatbots, autonomous agents, and question-answering services.
