evaluate-agent-framework
À propos
Cette compétence évalue la préparation à l'investissement des frameworks d'agents open-source en analysant la santé de la communauté, le risque de substitution, l'architecture et la gouvernance. Elle produit une classification à quatre niveaux (INVESTIR, ÉVALUER-PLUS, CONTRIBUER-PRUDEMENT, ÉVITER) pour guider l'allocation des ressources techniques. Utilisez-la pour réaliser une diligence raisonnée structurée avant de s'engager sur un framework.
Installation rapide
Claude Code
Recommandénpx skills add pjt222/agent-almanac -a claude-code/plugin add https://github.com/pjt222/agent-almanacgit clone https://github.com/pjt222/agent-almanac.git ~/.claude/skills/evaluate-agent-frameworkCopiez et collez cette commande dans Claude Code pour installer cette compétence
Documentation
Evaluate Agent Framework
Structured assessment of an open-source agent framework's investment readiness. The novel value is in Steps 2-3: quantifying community health through contribution survival rates and measuring supersession risk — the most common reason external engineering effort is wasted. The final classification (INVEST / EVALUATE-FURTHER / CONTRIBUTE-CAUTIOUSLY / AVOID) calibrates resource allocation before committing development cycles.
When to Use
- Evaluating whether to adopt an agent framework for production use
- Assessing dependency risk on a framework your project relies on
- Deciding whether to contribute engineering effort to an external project
- Comparing competing frameworks for a build-vs-adopt decision
- Re-evaluating a framework after a major release, governance change, or acquisition
Inputs
- Required:
framework_url— GitHub URL of the framework repository - Optional:
comparison_frameworks— list of alternative framework URLs to benchmark againstuse_case— intended use case for architecture alignment assessment (e.g., "multi-agent orchestration", "tool-use pipelines")contribution_budget— planned engineering hours, for calibrating the investment tier
Procedure
Step 1: Gather Framework Census
Collect foundational data about the project's size, activity, and landscape position before deeper analysis.
- Fetch and read
README.md,CONTRIBUTING.md,LICENSE, and any architecture docs (docs/,ARCHITECTURE.md) - Collect quantitative metrics:
- Stars, forks, open issues, open PRs:
gh repo view <repo> --json stargazerCount,forkCount,issues,pullRequests - Dependent repositories: check GitHub's "Used by" count or
gh api repos/<owner>/<repo>/dependents - Release cadence:
gh release list --limit 10— note frequency and whether releases follow semver
- Stars, forks, open issues, open PRs:
- Calculate bus factor: identify top 5 contributors by commit count over the last 12 months. If the top contributor accounts for >60% of commits, bus factor is critically low
- Map landscape position:
- Pioneer: first mover, defines the category (high influence, high supersession risk to followers)
- Fast-follower: launched within 6 months of pioneer, iterating on the concept
- Late entrant: arrived after the category stabilized, competing on features or governance
- If
comparison_frameworksis provided, gather the same metrics for each alternative
Got: Census table with stars, forks, dependents, release cadence, bus factor, and landscape position for the target (and comparisons if provided).
If fail: If the repository is private or API-rate-limited, fall back to manual README analysis. If metrics are unavailable (e.g., self-hosted GitLab), note the gap and proceed with qualitative assessment.
Step 2: Assess Community Health
Quantify whether the project welcomes, supports, and retains external contributors.
- Calculate the external contribution survival rate:
- Pull the last 50 closed PRs:
gh pr list --state closed --limit 50 --json author,mergedAt,closedAt,labels - Classify each PR author as internal (org member) or external
- Compute:
survival_rate = merged_external_PRs / total_external_PRs - Healthy threshold: >50% survival rate; concerning: <30%
- Pull the last 50 closed PRs:
- Measure responsiveness:
- Issue first-response time: median time from issue creation to first maintainer comment
- PR merge latency: median time from PR open to merge for external PRs
- Healthy: <7 days first-response, <30 days merge; concerning: >30 days first-response
- Assess contributor diversity:
- External/internal contributor ratio over last 6 months
- Number of unique external contributors with >=2 merged PRs (repeat contributors signal a healthy ecosystem)
- Check governance artifacts:
CONTRIBUTING.mdexists and is actionable (not just "submit a PR")CODE_OF_CONDUCT.mdexists- Governance docs describe decision-making process
- Issue/PR templates guide contributors
Got: Community health scorecard with survival rate, response times, diversity ratio, and governance artifact checklist.
If fail: If PR data is insufficient (new project with <20 closed PRs), note the sample size limitation and weight other signals more heavily. If the project uses a non-GitHub platform, adapt the queries to that platform's API.
Step 3: Calculate Supersession Risk
Determine how likely it is that external contributions will be rendered obsolete by internal development — the single biggest risk for framework adopters and contributors.
- Sample the last 50-100 merged external PRs (or all if fewer exist)
- For each merged external PR, check whether the contributed code was later:
- Reverted: explicit revert commit referencing the PR
- Rewritten: same file/module substantially changed within 90 days by an internal contributor
- Obsoleted: feature removed or replaced in a subsequent release
- Calculate:
supersession_rate = (reverted + rewritten + obsoleted) / total_merged_external - Map the published roadmap (if available) against areas where external contributors are active:
- High overlap = high supersession risk (internals will build over external work)
- Low overlap = lower supersession risk (externals fill gaps internals won't)
- Check for "contribution traps": areas that look contribution-friendly but are scheduled for internal rewrite
- Reference benchmark: NemoClaw analysis showed 71% external PRs superseded within 6 months — use as a calibration point
Got: Supersession rate as a percentage, with breakdown by type (reverted/rewritten/obsoleted). Roadmap overlap assessment.
If fail: If commit history is shallow or squash-merged (losing attribution), estimate supersession by comparing external PR file paths against files changed in subsequent releases. Note reduced confidence in the estimate.
Step 4: Evaluate Architecture Alignment
Assess whether the framework's architecture supports your use case without excessive lock-in.
- Map extension points:
- Plugin/extension API: does the framework expose a documented plugin interface?
- Configuration surface: can behavior be customized without forking?
- Hook/callback system: can you intercept and modify framework behavior at key points?
- Assess lock-in risk:
- Rewrite cost: estimate engineering effort to migrate away (days/weeks/months)
- Data portability: can data/state be exported in standard formats?
- Standard compliance: does the framework use open standards (agentskills.io, MCP, A2A) or proprietary protocols?
- Evaluate API stability:
- Count breaking changes per major release (CHANGELOG, migration guides)
- Check for deprecation policy (advance warning before removal)
- Assess semver compliance (breaking changes only in major versions)
- Check alignment with your specific use case:
- If
use_caseis provided, evaluate whether the framework's architecture naturally supports it - Identify any architectural mismatches that would require workarounds
- If
- Evaluate interoperability:
- agentskills.io compatibility (skill model alignment)
- MCP support (tool integration)
- A2A protocol support (agent-to-agent communication)
Got: Architecture alignment report with extension point inventory, lock-in risk assessment (low/medium/high), API stability score, and use-case fit evaluation.
If fail: If architecture documentation is sparse, derive the assessment from code structure and public API surface. If the framework is too young for stability history, note this and weight governance signals more heavily.
Step 5: Assess Governance and Sustainability
Evaluate whether the project's governance model supports long-term viability and fair treatment of external contributors.
- Classify governance model:
- BDFL (Benevolent Dictator for Life): single decision-maker — fast decisions, bus factor risk
- Committee/Core team: distributed decision-making — slower but more resilient
- Foundation-backed: formal governance (Apache, Linux Foundation, CNCF) — most sustainable
- Corporate-controlled: single company drives development — watch for rug-pull risk
- Assess funding and sustainability:
- Funding sources: VC-backed, corporate-sponsored, grants, community-funded, unfunded
- Full-time maintainer count: >=2 is healthy; 0 is a red flag
- Revenue model (if any): how does the project sustain itself?
- Evaluate contributor protections:
- License type: permissive (MIT, Apache-2.0) vs copyleft (GPL) vs custom
- CLA requirements: does signing a CLA transfer rights that disadvantage contributors?
- Contributor recognition: are external contributors credited in releases, changelogs, docs?
- Check security posture:
- Security disclosure policy (
SECURITY.mdor equivalent) - Median time from CVE disclosure to patch release
- Dependency update practices (Dependabot, Renovate, manual)
- Security disclosure policy (
- Assess trajectory:
- Is the governance model evolving (e.g., moving toward a foundation)?
- Has there been a recent leadership change, acquisition, or relicensing?
- Are there public conflicts between maintainers and contributors?
Got: Governance assessment with model classification, sustainability rating (sustainable/at-risk/critical), contributor protection evaluation, and security posture summary.
If fail: If governance information is undocumented, treat the absence itself as a yellow flag. Check for implicit governance by examining who merges PRs, who closes issues, and who makes release decisions.
Step 6: Classify Investment Readiness
Synthesize all findings into a four-tier classification with specific justifications and actionable recommendations.
- Score each dimension (1-5 scale):
- Community health: survival rate, responsiveness, diversity
- Supersession risk: rate, roadmap overlap, contribution traps (invert: lower is better)
- Architecture alignment: extension points, lock-in, stability, use-case fit
- Governance sustainability: model, funding, protections, security
- Apply classification thresholds:
- INVEST (all dimensions >=4): Healthy community, low supersession (<20%), aligned architecture, sustainable governance. Safe to adopt and contribute engineering effort.
- EVALUATE-FURTHER (mixed, no dimension <2): Mixed signals requiring specific follow-ups. Document what needs clarification and set a re-evaluation date.
- CONTRIBUTE-CAUTIOUSLY (any dimension 2, none <2): High supersession (>40%) or governance concerns. Limit contributions to explicitly requested work, maintainer-approved scope, or plugin/extension development that is decoupled from core.
- AVOID (any dimension 1): Critical red flags — abandoned project, hostile to externals (survival rate <15%), incompatible license, or imminent rug-pull indicators. Do not invest engineering effort.
- Write the classification report:
- Lead with the tier classification and one-sentence rationale
- Summarize each dimension score with key evidence
- If
contribution_budgetwas provided, recommend how to allocate those hours given the tier - For EVALUATE-FURTHER, list specific questions that need answers and propose a timeline
- For CONTRIBUTE-CAUTIOUSLY, specify which contribution types are safe (plugins, docs, tests) vs risky (core features)
- If
comparison_frameworkswere evaluated, produce a comparison matrix ranking all frameworks
Got: Classification report with tier, dimension scores, evidence summary, and actionable recommendations tailored to the investment context.
If fail: If data gaps prevent confident classification, default to EVALUATE-FURTHER with explicit documentation of what data is missing and how to obtain it. Never default to INVEST when uncertain.
Validation
- Census data collected: stars, forks, dependents, release cadence, bus factor, landscape position
- Community health quantified: survival rate, response times, contributor diversity, governance artifacts
- Supersession risk calculated with breakdown by type (reverted/rewritten/obsoleted)
- Architecture alignment assessed: extension points, lock-in risk, API stability, use-case fit
- Governance evaluated: model, funding, contributor protections, security posture
- Classification produced: one of INVEST / EVALUATE-FURTHER / CONTRIBUTE-CAUTIOUSLY / AVOID
- Each dimension score justified with specific evidence from the analysis
- Recommendations are actionable and calibrated to the contribution budget (if provided)
- Data gaps and confidence limitations explicitly documented
Pitfalls
- Confusing popularity with health: High stars but low contributor diversity means a single point of failure. A 50k-star project with one maintainer is less healthy than a 2k-star project with 15 active contributors.
- Ignoring supersession risk: The most common reason external contributions fail. A welcoming community means nothing if internal development routinely overwrites external work.
- Over-weighting architecture without checking governance: A beautifully designed framework can still fail if the governance model is unsustainable or hostile to externals.
- Treating EVALUATE-FURTHER as AVOID: Mixed signals require investigation, not rejection. Set a concrete re-evaluation date and list the specific questions to answer.
- Snapshot bias: All metrics are point-in-time. A declining project with great current metrics is worse than an improving project with mediocre current metrics. Always check the trend direction over 6-12 months.
- CLA complacency: Some CLAs transfer copyright to the project owner, meaning your contributions become their proprietary asset. Read the CLA text, not just the checkbox.
- Anchoring on a single framework: Without comparison frameworks, any project looks either great or terrible. Always benchmark against at least one alternative, even informally.
Related Skills
- polish-claw-project — contribution workflow this assessment informs
- review-software-architecture — used in Step 4 for architecture evaluation
- forage-solutions — alternative framework discovery for comparison
- search-prior-art — landscape mapping and prior work analysis
- security-audit-codebase — security posture assessment referenced in Step 5
- assess-ip-landscape — license and IP risk analysis
Dépôt GitHub
Compétences associées
executing-plans
DesignUtilisez la compétence executing-plans lorsque vous disposez d'un plan de mise en œuvre complet à exécuter par lots contrôlés avec des points de contrôle de revue. Elle charge et examine le plan de manière critique, puis exécute les tâches par petits lots (3 tâches par défaut) tout en rapportant la progression entre chaque lot pour une revue par l'architecte. Cela garantit une mise en œuvre systématique avec des points de contrôle de qualité intégrés.
requesting-code-review
DesignCette compétence délègue un sous-agent réviseur de code pour analyser les modifications apportées au code par rapport aux exigences avant de poursuivre. Elle doit être utilisée après avoir terminé des tâches, implémenté des fonctionnalités majeures, ou avant une fusion vers la branche principale. La revue aide à détecter précocement les problèmes en comparant l'implémentation actuelle avec le plan initial.
connect-mcp-server
DesignCette compétence fournit un guide complet permettant aux développeurs de connecter des serveurs MCP à Claude Code via les transports HTTP, stdio ou SSE. Elle couvre l'installation, la configuration, l'authentification et la sécurité pour intégrer des services externes tels que GitHub, Notion et des API personnalisées. Utilisez-la lors de la configuration d'intégrations MCP, de la configuration d'outils externes ou du travail avec le Protocole de Contexte de Modèle de Claude.
web-cli-teleport
DesignCette compétence aide les développeurs à choisir entre les interfaces Web et CLI de Claude Code en fonction de l'analyse des tâches, puis permet une téléportation transparente des sessions entre ces environnements. Elle optimise le flux de travail en gérant l'état et le contexte de la session lors du passage entre le web, la CLI ou le mobile. Utilisez-la pour des projets complexes nécessitant différents outils à diverses étapes.
