Back to Skills

evaluating-llms-harness

zechenzhangAGI
Updated 2 days ago
1543 views
62
2
62
View on GitHub
Testingaitestingapi

About

This Claude Skill runs the lm-evaluation-harness to benchmark LLMs across 60+ standardized academic tasks like MMLU and GSM8K. It's designed for developers to compare model quality, track training progress, or report academic results. The tool supports various backends including HuggingFace and vLLM models.

Quick Install

Claude Code

Recommended
Primary
npx skills add zechenzhangAGI/AI-research-SKILLs -a claude-code
Plugin CommandAlternative
/plugin add https://github.com/zechenzhangAGI/AI-research-SKILLs
Git CloneAlternative
git clone https://github.com/zechenzhangAGI/AI-research-SKILLs.git ~/.claude/skills/evaluating-llms-harness

Copy and paste this command in Claude Code to install this skill

GitHub Repository

zechenzhangAGI/AI-research-SKILLs
Path: 11-evaluation/lm-evaluation-harness
0
aiai-researchclaudeclaude-codeclaude-skillscodex

Related Skills

content-collections

Meta

This skill provides a production-tested setup for Content Collections, a TypeScript-first tool that transforms Markdown/MDX files into type-safe data collections with Zod validation. Use it when building blogs, documentation sites, or content-heavy Vite + React applications to ensure type safety and automatic content validation. It covers everything from Vite plugin configuration and MDX compilation to deployment optimization and schema validation.

View skill

polymarket

Meta

This skill enables developers to build applications with the Polymarket prediction markets platform, including API integration for trading and market data. It also provides real-time data streaming via WebSocket to monitor live trades and market activity. Use it for implementing trading strategies or creating tools that process live market updates.

View skill

creating-opencode-plugins

Meta

This skill helps developers create OpenCode plugins that hook into 25+ event types like commands, files, and LSP operations. It provides the plugin structure, event API specifications, and implementation patterns for JavaScript/TypeScript modules. Use it when you need to intercept, monitor, or extend the OpenCode AI assistant's lifecycle with custom event-driven logic.

View skill

himalaya-email-manager

Communication

This Claude Skill enables email management through the Himalaya CLI tool using IMAP. It allows developers to search, summarize, and delete emails from an IMAP account with natural language queries. Use it for automated email workflows like getting daily summaries or performing batch operations directly from Claude.

View skill