web-scraper
Acerca de
Esta habilidad extrae páginas web a HTML o Markdown con texto e imágenes utilizando dependencias mínimas (requests y beautifulsoup4). Está diseñada para descargar y archivar contenido localmente cuando se proporciona una URL. Las características clave incluyen el guardado automático de imágenes y la opción de seguir enlaces de forma recursiva.
Instalación rápida
Claude Code
Recomendadonpx skills add agentbay-ai/agentbay-skills -a claude-code/plugin add https://github.com/agentbay-ai/agentbay-skillsgit clone https://github.com/agentbay-ai/agentbay-skills.git ~/.claude/skills/web-scraperCopia y pega este comando en Claude Code para instalar esta habilidad
Documentación
name: web-scraper description: Scrape web pages and save as HTML or Markdown (with text and images). Minimal dependencies - only requests and beautifulsoup4. Use when the user provides a URL and wants to download/archive the content locally. homepage: https://requests.readthedocs.io/ metadata: { "openclaw": { "emoji": "🕷️", "requires": { "bins": ["python3"], "env": [] }, }, }
Web Scraper
Fetch web page content (text + images) and save as HTML or Markdown locally.
Minimal dependencies: Only requires requests and beautifulsoup4 - no browser automation.
Default behavior: Downloads images to local images/ directory automatically.
Quick start
Single page
{baseDir}/scripts/scrape.py --url "https://example.com" --format html --output /tmp/page.html
{baseDir}/scripts/scrape.py --url "https://example.com" --format md --output /tmp/page.md
Recursive (follow links)
{baseDir}/scripts/scrape.py --url "https://docs.example.com" --format md --recursive --max-depth 2 --output ~/Downloads/docs-archive
Setup
Requires Python 3.8+ and minimal dependencies:
cd {baseDir}
pip install -r requirements.txt
Or install manually:
pip install requests beautifulsoup4
Note: No browser or driver needed - uses pure HTTP requests.
Inputs to collect
Single page mode
- URL: The web page to scrape (required)
- Format:
htmlormd(default:html) - Output path: Where to save the file (default: current directory with auto-generated name)
- Images: Downloads images by default (use
--no-download-imagesto disable)
Recursive mode (--recursive)
- URL: Starting point for recursive scraping
- Format:
htmlormd - Output directory: Where to save all scraped pages
- Max depth: How many levels deep to follow links (default: 2)
- Max pages: Maximum total pages to scrape (default: 50)
- Domain filter: Whether to stay within same domain (default: yes)
- Images: Downloads images by default
Conversation Flow
- Ask user for the URL to scrape
- Ask preferred output format (HTML or Markdown)
- Note: Both formats include text and images by default
- HTML: Preserves original structure with downloaded images
- Markdown: Clean text format with downloaded images in
images/folder
- For recursive mode: Ask max depth and max pages (optional, has sensible defaults)
- Ask where to save (or suggest a default path like
/tmp/or~/Downloads/) - Run the script and confirm success
- Show the saved file/directory path
Examples
Single Page Scraping
Save as HTML
{baseDir}/scripts/scrape.py --url "https://docs.openclaw.ai/start/quickstart" --format html --output ~/Downloads/openclaw-quickstart.html
Save as Markdown (with images, default)
{baseDir}/scripts/scrape.py --url "https://en.wikipedia.org/wiki/Web_scraping" --format md --output ~/Documents/web-scraping.md
Result: Creates web-scraping.md + images/ folder with all downloaded images (text + images).
Without downloading images (optional)
{baseDir}/scripts/scrape.py --url "https://example.com" --format md --no-download-images
Result: Only text + image URLs (not downloaded locally).
Auto-generate filename
{baseDir}/scripts/scrape.py --url "https://example.com" --format html
# Saves to: example-com-{timestamp}.html
Recursive Scraping
Basic recursive crawl (depth 2, same domain, with images)
{baseDir}/scripts/scrape.py --url "https://docs.example.com" --format md --recursive --output ~/Downloads/docs-archive
Output structure (text + images for all pages):
docs-archive/
├── index.md
├── getting-started.md
├── api/
│ ├── authentication.md
│ └── endpoints.md
└── images/ # Shared images from all pages
├── logo.png
└── diagram.svg
Deep crawl with custom limits
{baseDir}/scripts/scrape.py \
--url "https://blog.example.com" \
--format html \
--recursive \
--max-depth 3 \
--max-pages 100 \
--output ~/Archives/blog-backup
Ignore robots.txt (use with caution)
{baseDir}/scripts/scrape.py \
--url "https://example.com" \
--format md \
--recursive \
--no-respect-robots \
--rate-limit 1.0
Faster scraping (reduced rate limit)
{baseDir}/scripts/scrape.py \
--url "https://yoursite.com" \
--format md \
--recursive \
--rate-limit 0.2
Features
Single Page Mode
- HTML output: Preserves original page structure
- ✅ Clean, readable HTML document
- ✅ All images downloaded to
images/folder - ✅ Suitable for offline viewing
- Markdown output: Extracts clean text content
- ✅ Auto-downloads images to local
images/directory (default) - ✅ Converts image URLs to relative paths
- ✅ Clean, readable format for archiving
- ✅ Fallback to original URLs if download fails
- Use
--no-download-imagesflag to keep original URLs only
- ✅ Auto-downloads images to local
- Simple and fast: Pure HTTP requests, no browser needed
- Auto filename: Generates safe filename from URL if not specified
Recursive Mode (--recursive)
- ✅ Intelligent link discovery: Automatically follows all links on crawled pages
- ✅ Depth control:
--max-depthlimits how many levels deep to crawl (default: 2) - ✅ Page limit:
--max-pagescaps total pages to prevent runaway crawls (default: 50) - ✅ Domain filtering:
--same-domainkeeps crawl within starting domain (default: on) - ✅ robots.txt compliance: Respects site's crawling rules by default
- ✅ Rate limiting:
--rate-limitadds delay between requests (default: 0.5s) - ✅ Smart URL filtering: Skips images, scripts, CSS, and duplicate URLs
- ✅ Progress tracking: Real-time console output with success/fail/skip counts
- ✅ Organized output: Preserves URL structure in directory hierarchy
- ✅ Efficient crawling: Sequential with rate limiting to respect servers
Guardrails
Single Page Mode
- Respect robots.txt and site terms of service
- Some sites may block automated access; this tool uses standard HTTP requests
- Large pages with many images may take time to download
Recursive Mode
- Start small: Test with
--max-depth 1 --max-pages 10first - Respect robots.txt: Default is on; only use
--no-respect-robotsfor your own sites - Rate limiting: Default 0.5s is polite; don't go below 0.2s for public sites
- Same domain: Strongly recommended to keep
--same-domainenabled - Monitor progress: Watch for high fail rates (may indicate blocking)
- Storage: Recursive crawls can generate many files; ensure sufficient disk space
- Legal: Ensure you have permission to crawl and archive the target site
Troubleshooting
- Connection errors: Check your internet connection and URL validity
- 403/blocked: Some sites block scrapers; the tool uses realistic User-Agent headers
- Timeout: Increase
--timeoutflag for slow-loading pages (value in seconds) - Image download fails: Images will fall back to original URLs
- Missing images: Some sites use JavaScript to load images dynamically (not supported)
Repositorio GitHub
Habilidades relacionadas
railway-docs
DocumentaciónEsta habilidad obtiene la documentación actual de Railway para responder preguntas sobre características, funcionalidad o URLs específicas de documentación. Garantiza que los desarrolladores reciban información precisa y actualizada directamente de las fuentes oficiales de Railway. Úsala cuando los usuarios pregunten cómo funciona Railway o hagan referencia a la documentación de Railway.
n8n-code-python
DocumentaciónEsta Skill de Claude proporciona orientación experta para escribir código Python en los nodos Code de n8n, específicamente para usar la biblioteca estándar de Python y trabajar con la sintaxis especial de n8n como `_input`, `_json` y `_node`. Ayuda a los desarrolladores a comprender las limitaciones de Python dentro de n8n y recomienda usar JavaScript para la mayoría de los flujos de trabajo, mientras ofrece soluciones en Python para necesidades específicas de transformación de datos.
archon
DocumentaciónLa habilidad Archon proporciona búsqueda semántica con tecnología RAG y gestión de proyectos a través de una API REST. Úsala para consultar documentación, gestionar proyectos/tareas jerárquicos y realizar recuperación de conocimiento con capacidades de carga de documentos. Prioriza siempre a Archon en primer lugar al buscar en documentación externa antes de utilizar otras fuentes.
n8n-code-javascript
DocumentaciónEsta habilidad de Claude proporciona orientación experta para escribir código JavaScript en los nodos de Código de n8n. Cubre sintaxis esencial específica de n8n como las variables `$input`/`$json`, ayudantes HTTP y manejo de DateTime, mientras soluciona errores comunes. Úsela al desarrollar flujos de trabajo en n8n que requieran procesamiento personalizado de JavaScript en los nodos de Código.
