web-scraper
About
This skill scrapes web pages to HTML or Markdown with text and images using minimal dependencies (requests and beautifulsoup4). It's designed for downloading and archiving content locally when given a URL. Key features include automatic image saving and optional recursive link following.
Quick Install
Claude Code
Recommendednpx skills add agentbay-ai/agentbay-skills -a claude-code/plugin add https://github.com/agentbay-ai/agentbay-skillsgit clone https://github.com/agentbay-ai/agentbay-skills.git ~/.claude/skills/web-scraperCopy and paste this command in Claude Code to install this skill
Documentation
name: web-scraper description: Scrape web pages and save as HTML or Markdown (with text and images). Minimal dependencies - only requests and beautifulsoup4. Use when the user provides a URL and wants to download/archive the content locally. homepage: https://requests.readthedocs.io/ metadata: { "openclaw": { "emoji": "π·οΈ", "requires": { "bins": ["python3"], "env": [] }, }, }
Web Scraper
Fetch web page content (text + images) and save as HTML or Markdown locally.
Minimal dependencies: Only requires requests and beautifulsoup4 - no browser automation.
Default behavior: Downloads images to local images/ directory automatically.
Quick start
Single page
{baseDir}/scripts/scrape.py --url "https://example.com" --format html --output /tmp/page.html
{baseDir}/scripts/scrape.py --url "https://example.com" --format md --output /tmp/page.md
Recursive (follow links)
{baseDir}/scripts/scrape.py --url "https://docs.example.com" --format md --recursive --max-depth 2 --output ~/Downloads/docs-archive
Setup
Requires Python 3.8+ and minimal dependencies:
cd {baseDir}
pip install -r requirements.txt
Or install manually:
pip install requests beautifulsoup4
Note: No browser or driver needed - uses pure HTTP requests.
Inputs to collect
Single page mode
- URL: The web page to scrape (required)
- Format:
htmlormd(default:html) - Output path: Where to save the file (default: current directory with auto-generated name)
- Images: Downloads images by default (use
--no-download-imagesto disable)
Recursive mode (--recursive)
- URL: Starting point for recursive scraping
- Format:
htmlormd - Output directory: Where to save all scraped pages
- Max depth: How many levels deep to follow links (default: 2)
- Max pages: Maximum total pages to scrape (default: 50)
- Domain filter: Whether to stay within same domain (default: yes)
- Images: Downloads images by default
Conversation Flow
- Ask user for the URL to scrape
- Ask preferred output format (HTML or Markdown)
- Note: Both formats include text and images by default
- HTML: Preserves original structure with downloaded images
- Markdown: Clean text format with downloaded images in
images/folder
- For recursive mode: Ask max depth and max pages (optional, has sensible defaults)
- Ask where to save (or suggest a default path like
/tmp/or~/Downloads/) - Run the script and confirm success
- Show the saved file/directory path
Examples
Single Page Scraping
Save as HTML
{baseDir}/scripts/scrape.py --url "https://docs.openclaw.ai/start/quickstart" --format html --output ~/Downloads/openclaw-quickstart.html
Save as Markdown (with images, default)
{baseDir}/scripts/scrape.py --url "https://en.wikipedia.org/wiki/Web_scraping" --format md --output ~/Documents/web-scraping.md
Result: Creates web-scraping.md + images/ folder with all downloaded images (text + images).
Without downloading images (optional)
{baseDir}/scripts/scrape.py --url "https://example.com" --format md --no-download-images
Result: Only text + image URLs (not downloaded locally).
Auto-generate filename
{baseDir}/scripts/scrape.py --url "https://example.com" --format html
# Saves to: example-com-{timestamp}.html
Recursive Scraping
Basic recursive crawl (depth 2, same domain, with images)
{baseDir}/scripts/scrape.py --url "https://docs.example.com" --format md --recursive --output ~/Downloads/docs-archive
Output structure (text + images for all pages):
docs-archive/
βββ index.md
βββ getting-started.md
βββ api/
β βββ authentication.md
β βββ endpoints.md
βββ images/ # Shared images from all pages
βββ logo.png
βββ diagram.svg
Deep crawl with custom limits
{baseDir}/scripts/scrape.py \
--url "https://blog.example.com" \
--format html \
--recursive \
--max-depth 3 \
--max-pages 100 \
--output ~/Archives/blog-backup
Ignore robots.txt (use with caution)
{baseDir}/scripts/scrape.py \
--url "https://example.com" \
--format md \
--recursive \
--no-respect-robots \
--rate-limit 1.0
Faster scraping (reduced rate limit)
{baseDir}/scripts/scrape.py \
--url "https://yoursite.com" \
--format md \
--recursive \
--rate-limit 0.2
Features
Single Page Mode
- HTML output: Preserves original page structure
- β Clean, readable HTML document
- β
All images downloaded to
images/folder - β Suitable for offline viewing
- Markdown output: Extracts clean text content
- β
Auto-downloads images to local
images/directory (default) - β Converts image URLs to relative paths
- β Clean, readable format for archiving
- β Fallback to original URLs if download fails
- Use
--no-download-imagesflag to keep original URLs only
- β
Auto-downloads images to local
- Simple and fast: Pure HTTP requests, no browser needed
- Auto filename: Generates safe filename from URL if not specified
Recursive Mode (--recursive)
- β Intelligent link discovery: Automatically follows all links on crawled pages
- β
Depth control:
--max-depthlimits how many levels deep to crawl (default: 2) - β
Page limit:
--max-pagescaps total pages to prevent runaway crawls (default: 50) - β
Domain filtering:
--same-domainkeeps crawl within starting domain (default: on) - β robots.txt compliance: Respects site's crawling rules by default
- β
Rate limiting:
--rate-limitadds delay between requests (default: 0.5s) - β Smart URL filtering: Skips images, scripts, CSS, and duplicate URLs
- β Progress tracking: Real-time console output with success/fail/skip counts
- β Organized output: Preserves URL structure in directory hierarchy
- β Efficient crawling: Sequential with rate limiting to respect servers
Guardrails
Single Page Mode
- Respect robots.txt and site terms of service
- Some sites may block automated access; this tool uses standard HTTP requests
- Large pages with many images may take time to download
Recursive Mode
- Start small: Test with
--max-depth 1 --max-pages 10first - Respect robots.txt: Default is on; only use
--no-respect-robotsfor your own sites - Rate limiting: Default 0.5s is polite; don't go below 0.2s for public sites
- Same domain: Strongly recommended to keep
--same-domainenabled - Monitor progress: Watch for high fail rates (may indicate blocking)
- Storage: Recursive crawls can generate many files; ensure sufficient disk space
- Legal: Ensure you have permission to crawl and archive the target site
Troubleshooting
- Connection errors: Check your internet connection and URL validity
- 403/blocked: Some sites block scrapers; the tool uses realistic User-Agent headers
- Timeout: Increase
--timeoutflag for slow-loading pages (value in seconds) - Image download fails: Images will fall back to original URLs
- Missing images: Some sites use JavaScript to load images dynamically (not supported)
GitHub Repository
Related Skills
railway-docs
DocumentationThis skill fetches current Railway documentation to answer questions about features, functionality, or specific docs URLs. It ensures developers receive accurate, up-to-date information directly from Railway's official sources. Use it when users ask how Railway works or reference Railway documentation.
n8n-code-python
DocumentationThis Claude Skill provides expert guidance for writing Python code in n8n's Code nodes, specifically for using Python's standard library and working with n8n's special syntax like `_input`, `_json`, and `_node`. It helps developers understand Python's limitations within n8n and recommends using JavaScript for most workflows while offering Python solutions for specific data transformation needs.
archon
DocumentationThe Archon skill provides RAG-powered semantic search and project management through a REST API. Use it for querying documentation, managing hierarchical projects/tasks, and performing knowledge retrieval with document upload capabilities. Always prioritize Archon first when searching external documentation before using other sources.
n8n-code-javascript
DocumentationThis Claude Skill provides expert guidance for writing JavaScript code in n8n's Code nodes. It covers essential n8n-specific syntax like `$input`/`$json` variables, HTTP helpers, and DateTime handling, while troubleshooting common errors. Use it when developing n8n workflows that require custom JavaScript processing in Code nodes.
