返回技能列表

evaluate-agent-framework

pjt222
更新于 2 days ago
8 次查看
17
2
17
在 GitHub 上查看
设计aidesign

关于

This skill evaluates open-source agent frameworks for investment readiness by analyzing community health, supersession risk, architecture, and governance. It outputs a four-tier classification (INVEST, EVALUATE-FURTHER, CONTRIBUTE-CAUTIOUSLY, AVOID) to guide engineering resource allocation. Use it to perform structured due diligence before committing to a framework.

快速安装

Claude Code

推荐
主要方式
npx skills add pjt222/agent-almanac -a claude-code
插件命令备选方式
/plugin add https://github.com/pjt222/agent-almanac
Git 克隆备选方式
git clone https://github.com/pjt222/agent-almanac.git ~/.claude/skills/evaluate-agent-framework

在 Claude Code 中复制并粘贴此命令以安装该技能

技能文档

Evaluate Agent Framework

Structured assessment of an open-source agent framework's investment readiness. The novel value is in Steps 2-3: quantifying community health through contribution survival rates and measuring supersession risk — the most common reason external engineering effort is wasted. The final classification (INVEST / EVALUATE-FURTHER / CONTRIBUTE-CAUTIOUSLY / AVOID) calibrates resource allocation before committing development cycles.

When to Use

  • Evaluating whether to adopt an agent framework for production use
  • Assessing dependency risk on a framework your project relies on
  • Deciding whether to contribute engineering effort to an external project
  • Comparing competing frameworks for a build-vs-adopt decision
  • Re-evaluating a framework after a major release, governance change, or acquisition

Inputs

  • Required: framework_url — GitHub URL of the framework repository
  • Optional:
    • comparison_frameworks — list of alternative framework URLs to benchmark against
    • use_case — intended use case for architecture alignment assessment (e.g., "multi-agent orchestration", "tool-use pipelines")
    • contribution_budget — planned engineering hours, for calibrating the investment tier

Procedure

Step 1: Gather Framework Census

Collect foundational data about the project's size, activity, and landscape position before deeper analysis.

  1. Fetch and read README.md, CONTRIBUTING.md, LICENSE, and any architecture docs (docs/, ARCHITECTURE.md)
  2. Collect quantitative metrics:
    • Stars, forks, open issues, open PRs: gh repo view <repo> --json stargazerCount,forkCount,issues,pullRequests
    • Dependent repositories: check GitHub's "Used by" count or gh api repos/<owner>/<repo>/dependents
    • Release cadence: gh release list --limit 10 — note frequency and whether releases follow semver
  3. Calculate bus factor: identify top 5 contributors by commit count over the last 12 months. If the top contributor accounts for >60% of commits, bus factor is critically low
  4. Map landscape position:
    • Pioneer: first mover, defines the category (high influence, high supersession risk to followers)
    • Fast-follower: launched within 6 months of pioneer, iterating on the concept
    • Late entrant: arrived after the category stabilized, competing on features or governance
  5. If comparison_frameworks is provided, gather the same metrics for each alternative

Got: Census table with stars, forks, dependents, release cadence, bus factor, and landscape position for the target (and comparisons if provided).

If fail: If the repository is private or API-rate-limited, fall back to manual README analysis. If metrics are unavailable (e.g., self-hosted GitLab), note the gap and proceed with qualitative assessment.

Step 2: Assess Community Health

Quantify whether the project welcomes, supports, and retains external contributors.

  1. Calculate the external contribution survival rate:
    • Pull the last 50 closed PRs: gh pr list --state closed --limit 50 --json author,mergedAt,closedAt,labels
    • Classify each PR author as internal (org member) or external
    • Compute: survival_rate = merged_external_PRs / total_external_PRs
    • Healthy threshold: >50% survival rate; concerning: <30%
  2. Measure responsiveness:
    • Issue first-response time: median time from issue creation to first maintainer comment
    • PR merge latency: median time from PR open to merge for external PRs
    • Healthy: <7 days first-response, <30 days merge; concerning: >30 days first-response
  3. Assess contributor diversity:
    • External/internal contributor ratio over last 6 months
    • Number of unique external contributors with >=2 merged PRs (repeat contributors signal a healthy ecosystem)
  4. Check governance artifacts:
    • CONTRIBUTING.md exists and is actionable (not just "submit a PR")
    • CODE_OF_CONDUCT.md exists
    • Governance docs describe decision-making process
    • Issue/PR templates guide contributors

Got: Community health scorecard with survival rate, response times, diversity ratio, and governance artifact checklist.

If fail: If PR data is insufficient (new project with <20 closed PRs), note the sample size limitation and weight other signals more heavily. If the project uses a non-GitHub platform, adapt the queries to that platform's API.

Step 3: Calculate Supersession Risk

Determine how likely it is that external contributions will be rendered obsolete by internal development — the single biggest risk for framework adopters and contributors.

  1. Sample the last 50-100 merged external PRs (or all if fewer exist)
  2. For each merged external PR, check whether the contributed code was later:
    • Reverted: explicit revert commit referencing the PR
    • Rewritten: same file/module substantially changed within 90 days by an internal contributor
    • Obsoleted: feature removed or replaced in a subsequent release
  3. Calculate: supersession_rate = (reverted + rewritten + obsoleted) / total_merged_external
  4. Map the published roadmap (if available) against areas where external contributors are active:
    • High overlap = high supersession risk (internals will build over external work)
    • Low overlap = lower supersession risk (externals fill gaps internals won't)
  5. Check for "contribution traps": areas that look contribution-friendly but are scheduled for internal rewrite
  6. Reference benchmark: NemoClaw analysis showed 71% external PRs superseded within 6 months — use as a calibration point

Got: Supersession rate as a percentage, with breakdown by type (reverted/rewritten/obsoleted). Roadmap overlap assessment.

If fail: If commit history is shallow or squash-merged (losing attribution), estimate supersession by comparing external PR file paths against files changed in subsequent releases. Note reduced confidence in the estimate.

Step 4: Evaluate Architecture Alignment

Assess whether the framework's architecture supports your use case without excessive lock-in.

  1. Map extension points:
    • Plugin/extension API: does the framework expose a documented plugin interface?
    • Configuration surface: can behavior be customized without forking?
    • Hook/callback system: can you intercept and modify framework behavior at key points?
  2. Assess lock-in risk:
    • Rewrite cost: estimate engineering effort to migrate away (days/weeks/months)
    • Data portability: can data/state be exported in standard formats?
    • Standard compliance: does the framework use open standards (agentskills.io, MCP, A2A) or proprietary protocols?
  3. Evaluate API stability:
    • Count breaking changes per major release (CHANGELOG, migration guides)
    • Check for deprecation policy (advance warning before removal)
    • Assess semver compliance (breaking changes only in major versions)
  4. Check alignment with your specific use case:
    • If use_case is provided, evaluate whether the framework's architecture naturally supports it
    • Identify any architectural mismatches that would require workarounds
  5. Evaluate interoperability:
    • agentskills.io compatibility (skill model alignment)
    • MCP support (tool integration)
    • A2A protocol support (agent-to-agent communication)

Got: Architecture alignment report with extension point inventory, lock-in risk assessment (low/medium/high), API stability score, and use-case fit evaluation.

If fail: If architecture documentation is sparse, derive the assessment from code structure and public API surface. If the framework is too young for stability history, note this and weight governance signals more heavily.

Step 5: Assess Governance and Sustainability

Evaluate whether the project's governance model supports long-term viability and fair treatment of external contributors.

  1. Classify governance model:
    • BDFL (Benevolent Dictator for Life): single decision-maker — fast decisions, bus factor risk
    • Committee/Core team: distributed decision-making — slower but more resilient
    • Foundation-backed: formal governance (Apache, Linux Foundation, CNCF) — most sustainable
    • Corporate-controlled: single company drives development — watch for rug-pull risk
  2. Assess funding and sustainability:
    • Funding sources: VC-backed, corporate-sponsored, grants, community-funded, unfunded
    • Full-time maintainer count: >=2 is healthy; 0 is a red flag
    • Revenue model (if any): how does the project sustain itself?
  3. Evaluate contributor protections:
    • License type: permissive (MIT, Apache-2.0) vs copyleft (GPL) vs custom
    • CLA requirements: does signing a CLA transfer rights that disadvantage contributors?
    • Contributor recognition: are external contributors credited in releases, changelogs, docs?
  4. Check security posture:
    • Security disclosure policy (SECURITY.md or equivalent)
    • Median time from CVE disclosure to patch release
    • Dependency update practices (Dependabot, Renovate, manual)
  5. Assess trajectory:
    • Is the governance model evolving (e.g., moving toward a foundation)?
    • Has there been a recent leadership change, acquisition, or relicensing?
    • Are there public conflicts between maintainers and contributors?

Got: Governance assessment with model classification, sustainability rating (sustainable/at-risk/critical), contributor protection evaluation, and security posture summary.

If fail: If governance information is undocumented, treat the absence itself as a yellow flag. Check for implicit governance by examining who merges PRs, who closes issues, and who makes release decisions.

Step 6: Classify Investment Readiness

Synthesize all findings into a four-tier classification with specific justifications and actionable recommendations.

  1. Score each dimension (1-5 scale):
    • Community health: survival rate, responsiveness, diversity
    • Supersession risk: rate, roadmap overlap, contribution traps (invert: lower is better)
    • Architecture alignment: extension points, lock-in, stability, use-case fit
    • Governance sustainability: model, funding, protections, security
  2. Apply classification thresholds:
    • INVEST (all dimensions >=4): Healthy community, low supersession (<20%), aligned architecture, sustainable governance. Safe to adopt and contribute engineering effort.
    • EVALUATE-FURTHER (mixed, no dimension <2): Mixed signals requiring specific follow-ups. Document what needs clarification and set a re-evaluation date.
    • CONTRIBUTE-CAUTIOUSLY (any dimension 2, none <2): High supersession (>40%) or governance concerns. Limit contributions to explicitly requested work, maintainer-approved scope, or plugin/extension development that is decoupled from core.
    • AVOID (any dimension 1): Critical red flags — abandoned project, hostile to externals (survival rate <15%), incompatible license, or imminent rug-pull indicators. Do not invest engineering effort.
  3. Write the classification report:
    • Lead with the tier classification and one-sentence rationale
    • Summarize each dimension score with key evidence
    • If contribution_budget was provided, recommend how to allocate those hours given the tier
    • For EVALUATE-FURTHER, list specific questions that need answers and propose a timeline
    • For CONTRIBUTE-CAUTIOUSLY, specify which contribution types are safe (plugins, docs, tests) vs risky (core features)
  4. If comparison_frameworks were evaluated, produce a comparison matrix ranking all frameworks

Got: Classification report with tier, dimension scores, evidence summary, and actionable recommendations tailored to the investment context.

If fail: If data gaps prevent confident classification, default to EVALUATE-FURTHER with explicit documentation of what data is missing and how to obtain it. Never default to INVEST when uncertain.

Validation

  • Census data collected: stars, forks, dependents, release cadence, bus factor, landscape position
  • Community health quantified: survival rate, response times, contributor diversity, governance artifacts
  • Supersession risk calculated with breakdown by type (reverted/rewritten/obsoleted)
  • Architecture alignment assessed: extension points, lock-in risk, API stability, use-case fit
  • Governance evaluated: model, funding, contributor protections, security posture
  • Classification produced: one of INVEST / EVALUATE-FURTHER / CONTRIBUTE-CAUTIOUSLY / AVOID
  • Each dimension score justified with specific evidence from the analysis
  • Recommendations are actionable and calibrated to the contribution budget (if provided)
  • Data gaps and confidence limitations explicitly documented

Pitfalls

  • Confusing popularity with health: High stars but low contributor diversity means a single point of failure. A 50k-star project with one maintainer is less healthy than a 2k-star project with 15 active contributors.
  • Ignoring supersession risk: The most common reason external contributions fail. A welcoming community means nothing if internal development routinely overwrites external work.
  • Over-weighting architecture without checking governance: A beautifully designed framework can still fail if the governance model is unsustainable or hostile to externals.
  • Treating EVALUATE-FURTHER as AVOID: Mixed signals require investigation, not rejection. Set a concrete re-evaluation date and list the specific questions to answer.
  • Snapshot bias: All metrics are point-in-time. A declining project with great current metrics is worse than an improving project with mediocre current metrics. Always check the trend direction over 6-12 months.
  • CLA complacency: Some CLAs transfer copyright to the project owner, meaning your contributions become their proprietary asset. Read the CLA text, not just the checkbox.
  • Anchoring on a single framework: Without comparison frameworks, any project looks either great or terrible. Always benchmark against at least one alternative, even informally.

Related Skills

GitHub 仓库

pjt222/agent-almanac
路径: i18n/caveman-lite/skills/evaluate-agent-framework
0
agentsagentskillsai-assisted-developmentclaude-codeskillsteams

相关推荐技能

executing-plans

设计

该Skill用于当开发者提供完整实施计划时,以受控批次方式执行代码实现。它会先审阅计划并提出疑问,然后分批次执行任务(默认每批3个任务),并在批次间暂停等待审查。关键特性包括分批次执行、内置检查点和架构师审查机制,确保复杂系统实现的可控性。

查看技能

requesting-code-review

设计

该Skill可在完成任务、实现主要功能或合并代码前自动调度代码审查子代理,确保实现符合需求和计划。它支持通过指定git SHA范围进行精准的代码变更审查,帮助开发者在关键节点及时发现潜在问题。核心原则是"早审查、勤审查",适用于开发流程的各个关键阶段。

查看技能

connect-mcp-server

设计

这个Skill指导开发者如何将MCP服务器连接到Claude Code,支持HTTP、stdio和SSE三种传输协议。它涵盖了从安装配置到认证安全的完整流程,适用于集成GitHub、Notion、数据库等外部服务。当开发者需要添加集成、配置外部工具或提及MCP相关功能时,这个Skill能提供实用的操作指南。

查看技能

web-cli-teleport

设计

该Skill帮助开发者根据任务特性选择Claude Code的Web或CLI界面,并指导如何在两种环境间无缝迁移会话。它能分析任务复杂度、迭代需求等要素,推荐最优工作界面和工作流。关键特性包括会话状态管理、环境切换指导和上下文优化建议。

查看技能