clip
About
CLIP is a vision-language model for zero-shot image classification and cross-modal retrieval, requiring no fine-tuning. It excels at general-purpose tasks like image-text matching, semantic search, and content moderation. Developers can use it for vision-language applications by providing image and text pairs for similarity scoring.
Quick Install
Claude Code
Recommendednpx skills add davila7/claude-code-templates -a claude-code/plugin add https://github.com/davila7/claude-code-templatesgit clone https://github.com/davila7/claude-code-templates.git ~/.claude/skills/clipCopy and paste this command in Claude Code to install this skill
GitHub Repository
Related Skills
blip-2-vision-language
DesignBLIP-2 is a vision-language framework that connects a frozen image encoder with a large language model for multimodal tasks. Use it for zero-shot image captioning, visual question answering, or image-text retrieval without task-specific fine-tuning. It's ideal for developers needing to add state-of-the-art visual understanding to LLM-based applications.
stable-diffusion-image-generation
MetaThis skill enables text-to-image generation and image manipulation using Stable Diffusion via HuggingFace Diffusers. It supports image generation from prompts, image-to-image translation, inpainting, and custom pipeline creation. Developers should use it when building applications requiring AI-powered visual content generation or editing.
audiocraft-audio-generation
MetaThis Claude Skill provides text-to-music and text-to-audio generation using Meta's AudioCraft PyTorch library. It enables developers to generate music from descriptions, create sound effects, and perform melody-conditioned music generation. Key capabilities include using the MusicGen and AudioGen models for controllable, high-quality stereo audio output.
whisper
OtherWhisper is OpenAI's multilingual speech recognition model for transcription and translation across 99 languages. It handles tasks like speech-to-text, podcast transcription, and processing noisy or multilingual audio. Developers should use it for robust, production-ready automatic speech recognition (ASR).
