evaluate-agent-framework
关于
This skill evaluates open-source AI agent frameworks to determine if they are worth your team's investment. It analyzes community health, architecture, and governance risks, outputting a clear INVEST/EVALUATE-FURTHER/CONTRIBUTE-CAUTIOUSLY/AVOID recommendation. Use it before committing engineering resources to a new framework to make data-driven adoption decisions.
快速安装
Claude Code
推荐npx skills add pjt222/agent-almanac -a claude-code/plugin add https://github.com/pjt222/agent-almanacgit clone https://github.com/pjt222/agent-almanac.git ~/.claude/skills/evaluate-agent-framework在 Claude Code 中复制并粘贴此命令以安装该技能
技能文档
Evaluate Agent Framework
Structured check of open-source agent framework invest-readiness. New value sits in Steps 2-3: count community health by contribution survival rate; measure supersession risk — biggest reason external engineering effort wastes. Final tier (INVEST / EVALUATE-FURTHER / CONTRIBUTE-CAUTIOUSLY / AVOID) sets resource spend before commit dev cycles.
When Use
- Picking whether to adopt agent framework for prod
- Measuring dep risk on framework project leans on
- Deciding whether to give engineering effort to external project
- Compare competing frameworks for build-vs-adopt pick
- Re-check framework after big release, governance shift, or buyout
Inputs
- Required:
framework_url— GitHub URL of framework repo - Optional:
comparison_frameworks— list of other framework URLs to benchmarkuse_case— planned use case for arch alignment check (e.g., "multi-agent orchestration", "tool-use pipelines")contribution_budget— planned engineering hours, for tier calibration
Steps
Step 1: Gather Framework Census
Grab base data on project size, activity, landscape place before deep dig.
- Fetch and read
README.md,CONTRIBUTING.md,LICENSE, and any arch docs (docs/,ARCHITECTURE.md) - Grab counts:
- Stars, forks, open issues, open PRs:
gh repo view <repo> --json stargazerCount,forkCount,issues,pullRequests - Dependent repos: check GitHub's "Used by" count or
gh api repos/<owner>/<repo>/dependents - Release cadence:
gh release list --limit 10— note how often and if releases follow semver
- Stars, forks, open issues, open PRs:
- Count bus factor: find top 5 contributors by commit count over last 12 months. Top contributor do >60% of commits? Bus factor critically low
- Map landscape place:
- Pioneer: first mover, defines category (high sway, high supersession risk to followers)
- Fast-follower: launched within 6 months of pioneer, iterating on concept
- Late entrant: arrived after category stable, competing on features or governance
- If
comparison_frameworksgiven, grab same counts for each
Got: Census table with stars, forks, dependents, release cadence, bus factor, landscape place for target (and compares if given).
If fail: Repo private or API-rate-limited? Fall back to manual README read. Counts not there (e.g., self-hosted GitLab)? Note gap and go with qualitative check.
Step 2: Assess Community Health
Count whether project welcomes, supports, keeps external contributors.
- Count external contribution survival rate:
- Pull last 50 closed PRs:
gh pr list --state closed --limit 50 --json author,mergedAt,closedAt,labels - Sort each PR author as internal (org member) or external
- Compute:
survival_rate = merged_external_PRs / total_external_PRs - Healthy threshold: >50% survival rate; concerning: <30%
- Pull last 50 closed PRs:
- Measure response:
- Issue first-response time: median from issue open to first maintainer comment
- PR merge lag: median from PR open to merge for external PRs
- Healthy: <7 days first-response, <30 days merge; concerning: >30 days first-response
- Check contributor spread:
- External/internal contributor ratio over last 6 months
- Count unique external contributors with >=2 merged PRs (repeat contributors signal healthy ecosystem)
- Check governance artifacts:
CONTRIBUTING.mdexists and is actionable (not just "submit a PR")CODE_OF_CONDUCT.mdexists- Governance docs describe decision process
- Issue/PR templates guide contributors
Got: Community health scorecard with survival rate, response times, spread ratio, governance artifact checklist.
If fail: PR data thin (new project with <20 closed PRs)? Note sample-size limit and weight other signals more. Project uses non-GitHub platform? Adapt queries to that platform API.
Step 3: Calculate Supersession Risk
Figure how likely external contributions get wiped by internal dev — single biggest risk for framework adopters and contributors.
- Sample last 50-100 merged external PRs (or all if fewer)
- For each merged external PR, check if contributed code was later:
- Reverted: explicit revert commit ref-ing PR
- Rewritten: same file/module big change within 90 days by internal contributor
- Obsoleted: feature removed or replaced in later release
- Count:
supersession_rate = (reverted + rewritten + obsoleted) / total_merged_external - Map published roadmap (if out) against areas where external contributors active:
- High overlap = high supersession risk (internals will build over external work)
- Low overlap = lower supersession risk (externals fill gaps internals won't)
- Check for "contribution traps": areas look contribution-friendly but scheduled for internal rewrite
- Benchmark: NemoClaw study showed 71% external PRs superseded within 6 months — use as calibration point
Got: Supersession rate as percent, with breakdown by type (reverted/rewritten/obsoleted). Roadmap overlap check.
If fail: Commit history shallow or squash-merged (losing author info)? Estimate supersession by compare external PR file paths vs files changed in later releases. Note lower confidence.
Step 4: Evaluate Architecture Alignment
Check whether framework arch supports your use case with no heavy lock-in.
- Map extension points:
- Plugin/extension API: does framework expose documented plugin interface?
- Config surface: can behavior be tuned without fork?
- Hook/callback system: can intercept and change framework behavior at key points?
- Check lock-in risk:
- Rewrite cost: estimate engineering effort to move away (days/weeks/months)
- Data portability: can data/state export in standard formats?
- Standard compliance: does framework use open standards (agentskills.io, MCP, A2A) or custom protocols?
- Check API stability:
- Count breaking changes per major release (CHANGELOG, migration guides)
- Check deprecation policy (heads-up before removal)
- Check semver (breaking changes only in major versions)
- Check fit with your specific use case:
- If
use_casegiven, check whether framework arch naturally supports it - Spot any arch mismatch that would need workarounds
- If
- Check interop:
- agentskills.io compat (skill model fit)
- MCP support (tool integration)
- A2A protocol support (agent-to-agent talk)
Got: Architecture fit report with extension point list, lock-in risk rate (low/medium/high), API stability score, use-case fit check.
If fail: Arch docs thin? Derive check from code shape and public API surface. Framework too young for stability history? Note this and weight governance signals more.
Step 5: Assess Governance and Sustainability
Check whether project governance model supports long-term life and fair treat of external contributors.
- Sort governance model:
- BDFL (Benevolent Dictator for Life): one decider — fast calls, bus factor risk
- Committee/Core team: spread decision — slower but tougher
- Foundation-backed: formal governance (Apache, Linux Foundation, CNCF) — most durable
- Corporate-controlled: one company drives dev — watch for rug-pull risk
- Check funding and sustainability:
- Funding sources: VC-backed, corporate-sponsored, grants, community-funded, unfunded
- Full-time maintainer count: >=2 is healthy; 0 is red flag
- Revenue model (if any): how does project keep going?
- Check contributor protections:
- License type: permissive (MIT, Apache-2.0) vs copyleft (GPL) vs custom
- CLA rules: does signing CLA shift rights in way that hurt contributors?
- Contributor credit: external contributors credited in releases, changelogs, docs?
- Check security stance:
- Security disclosure policy (
SECURITY.mdor same) - Median time from CVE disclose to patch release
- Dep update patterns (Dependabot, Renovate, manual)
- Security disclosure policy (
- Check trajectory:
- Governance model shifting (e.g., moving toward foundation)?
- Recent leadership change, buyout, or relicense?
- Public conflicts between maintainers and contributors?
Got: Governance check with model class, durability rate (durable/at-risk/critical), contributor protection check, security stance summary.
If fail: Governance info not logged? Take the absence itself as yellow flag. Check for hidden governance by who merges PRs, who closes issues, who makes release picks.
Step 6: Classify Investment Readiness
Fold all finds into four-tier sort with specific reasons and actionable advice.
- Score each dimension (1-5 scale):
- Community health: survival rate, response, spread
- Supersession risk: rate, roadmap overlap, contribution traps (invert: lower is better)
- Architecture fit: extension points, lock-in, stability, use-case fit
- Governance durability: model, funding, protections, security
- Apply tier thresholds:
- INVEST (all dimensions >=4): Healthy community, low supersession (<20%), fit arch, durable governance. Safe to adopt and give engineering effort.
- EVALUATE-FURTHER (mixed, no dimension <2): Mixed signals need specific follow-ups. Log what needs clarify and set re-eval date.
- CONTRIBUTE-CAUTIOUSLY (any dimension 2, none <2): High supersession (>40%) or governance worries. Limit contributions to explicit-requested work, maintainer-approved scope, or plugin/extension dev decoupled from core.
- AVOID (any dimension 1): Critical red flags — abandoned project, hostile to externals (survival rate <15%), incompatible license, or soon rug-pull signs. Do not give engineering effort.
- Write tier report:
- Lead with tier and one-line reason
- Sum each dimension score with key evidence
- If
contribution_budgetgiven, advise how to split those hours given tier - For EVALUATE-FURTHER, list specific questions that need answers and set timeline
- For CONTRIBUTE-CAUTIOUSLY, say which contribution types safe (plugins, docs, tests) vs risky (core features)
- If
comparison_frameworkschecked, make compare matrix ranking all frameworks
Got: Tier report with label, dimension scores, evidence sum, actionable advice tuned to invest context.
If fail: Data gaps block confident sort? Default to EVALUATE-FURTHER with clear log of what data missing and how to get it. Never default to INVEST when unsure.
Validation
- Census data grabbed: stars, forks, dependents, release cadence, bus factor, landscape place
- Community health counted: survival rate, response times, contributor spread, governance artifacts
- Supersession risk counted with breakdown by type (reverted/rewritten/obsoleted)
- Architecture fit checked: extension points, lock-in risk, API stability, use-case fit
- Governance checked: model, funding, contributor protections, security stance
- Tier made: one of INVEST / EVALUATE-FURTHER / CONTRIBUTE-CAUTIOUSLY / AVOID
- Each dimension score backed with specific evidence from analysis
- Advice actionable and tuned to contribution budget (if given)
- Data gaps and confidence limits clearly logged
Pitfalls
- Mix popularity with health: High stars but low contributor spread mean single fail point. 50k-star project with one maintainer is less healthy than 2k-star project with 15 active contributors.
- Ignore supersession risk: Most common reason external contributions fail. Welcoming community means nothing if internal dev keep overwriting external work.
- Over-weight arch, skip governance: Pretty-designed framework can still fail if governance model is not durable or hostile to externals.
- Treat EVALUATE-FURTHER as AVOID: Mixed signals need dig, not reject. Set concrete re-eval date and list specific questions to answer.
- Snapshot bias: All counts are point-in-time. Declining project with great current counts is worse than improving project with meh current counts. Always check trend over 6-12 months.
- CLA complacency: Some CLAs shift copyright to project owner, meaning your contributions become their property. Read CLA text, not just checkbox.
- Anchor on single framework: With no compare frameworks, any project looks either great or awful. Always benchmark vs at least one alternative, even informal.
See Also
- polish-claw-project — contribution flow this check feeds
- review-software-architecture — used in Step 4 for arch check
- forage-solutions — other framework find for compare
- search-prior-art — landscape map and prior work check
- security-audit-codebase — security stance check from Step 5
- assess-ip-landscape — license and IP risk check
GitHub 仓库
相关推荐技能
executing-plans
设计该Skill用于当开发者提供完整实施计划时,以受控批次方式执行代码实现。它会先审阅计划并提出疑问,然后分批次执行任务(默认每批3个任务),并在批次间暂停等待审查。关键特性包括分批次执行、内置检查点和架构师审查机制,确保复杂系统实现的可控性。
requesting-code-review
设计该Skill可在完成任务、实现主要功能或合并代码前自动调度代码审查子代理,确保实现符合需求和计划。它支持通过指定git SHA范围进行精准的代码变更审查,帮助开发者在关键节点及时发现潜在问题。核心原则是"早审查、勤审查",适用于开发流程的各个关键阶段。
connect-mcp-server
设计这个Skill指导开发者如何将MCP服务器连接到Claude Code,支持HTTP、stdio和SSE三种传输协议。它涵盖了从安装配置到认证安全的完整流程,适用于集成GitHub、Notion、数据库等外部服务。当开发者需要添加集成、配置外部工具或提及MCP相关功能时,这个Skill能提供实用的操作指南。
web-cli-teleport
设计该Skill帮助开发者根据任务特性选择Claude Code的Web或CLI界面,并指导如何在两种环境间无缝迁移会话。它能分析任务复杂度、迭代需求等要素,推荐最优工作界面和工作流。关键特性包括会话状态管理、环境切换指导和上下文优化建议。
