web-scraper
关于
This skill scrapes web pages to HTML or Markdown with text and images using minimal dependencies (requests and beautifulsoup4). It's designed for downloading and archiving content locally when given a URL. Key features include automatic image saving and optional recursive link following.
快速安装
Claude Code
推荐npx skills add agentbay-ai/agentbay-skills -a claude-code/plugin add https://github.com/agentbay-ai/agentbay-skillsgit clone https://github.com/agentbay-ai/agentbay-skills.git ~/.claude/skills/web-scraper在 Claude Code 中复制并粘贴此命令以安装该技能
技能文档
name: web-scraper description: Scrape web pages and save as HTML or Markdown (with text and images). Minimal dependencies - only requests and beautifulsoup4. Use when the user provides a URL and wants to download/archive the content locally. homepage: https://requests.readthedocs.io/ metadata: { "openclaw": { "emoji": "🕷️", "requires": { "bins": ["python3"], "env": [] }, }, }
Web Scraper
Fetch web page content (text + images) and save as HTML or Markdown locally.
Minimal dependencies: Only requires requests and beautifulsoup4 - no browser automation.
Default behavior: Downloads images to local images/ directory automatically.
Quick start
Single page
{baseDir}/scripts/scrape.py --url "https://example.com" --format html --output /tmp/page.html
{baseDir}/scripts/scrape.py --url "https://example.com" --format md --output /tmp/page.md
Recursive (follow links)
{baseDir}/scripts/scrape.py --url "https://docs.example.com" --format md --recursive --max-depth 2 --output ~/Downloads/docs-archive
Setup
Requires Python 3.8+ and minimal dependencies:
cd {baseDir}
pip install -r requirements.txt
Or install manually:
pip install requests beautifulsoup4
Note: No browser or driver needed - uses pure HTTP requests.
Inputs to collect
Single page mode
- URL: The web page to scrape (required)
- Format:
htmlormd(default:html) - Output path: Where to save the file (default: current directory with auto-generated name)
- Images: Downloads images by default (use
--no-download-imagesto disable)
Recursive mode (--recursive)
- URL: Starting point for recursive scraping
- Format:
htmlormd - Output directory: Where to save all scraped pages
- Max depth: How many levels deep to follow links (default: 2)
- Max pages: Maximum total pages to scrape (default: 50)
- Domain filter: Whether to stay within same domain (default: yes)
- Images: Downloads images by default
Conversation Flow
- Ask user for the URL to scrape
- Ask preferred output format (HTML or Markdown)
- Note: Both formats include text and images by default
- HTML: Preserves original structure with downloaded images
- Markdown: Clean text format with downloaded images in
images/folder
- For recursive mode: Ask max depth and max pages (optional, has sensible defaults)
- Ask where to save (or suggest a default path like
/tmp/or~/Downloads/) - Run the script and confirm success
- Show the saved file/directory path
Examples
Single Page Scraping
Save as HTML
{baseDir}/scripts/scrape.py --url "https://docs.openclaw.ai/start/quickstart" --format html --output ~/Downloads/openclaw-quickstart.html
Save as Markdown (with images, default)
{baseDir}/scripts/scrape.py --url "https://en.wikipedia.org/wiki/Web_scraping" --format md --output ~/Documents/web-scraping.md
Result: Creates web-scraping.md + images/ folder with all downloaded images (text + images).
Without downloading images (optional)
{baseDir}/scripts/scrape.py --url "https://example.com" --format md --no-download-images
Result: Only text + image URLs (not downloaded locally).
Auto-generate filename
{baseDir}/scripts/scrape.py --url "https://example.com" --format html
# Saves to: example-com-{timestamp}.html
Recursive Scraping
Basic recursive crawl (depth 2, same domain, with images)
{baseDir}/scripts/scrape.py --url "https://docs.example.com" --format md --recursive --output ~/Downloads/docs-archive
Output structure (text + images for all pages):
docs-archive/
├── index.md
├── getting-started.md
├── api/
│ ├── authentication.md
│ └── endpoints.md
└── images/ # Shared images from all pages
├── logo.png
└── diagram.svg
Deep crawl with custom limits
{baseDir}/scripts/scrape.py \
--url "https://blog.example.com" \
--format html \
--recursive \
--max-depth 3 \
--max-pages 100 \
--output ~/Archives/blog-backup
Ignore robots.txt (use with caution)
{baseDir}/scripts/scrape.py \
--url "https://example.com" \
--format md \
--recursive \
--no-respect-robots \
--rate-limit 1.0
Faster scraping (reduced rate limit)
{baseDir}/scripts/scrape.py \
--url "https://yoursite.com" \
--format md \
--recursive \
--rate-limit 0.2
Features
Single Page Mode
- HTML output: Preserves original page structure
- ✅ Clean, readable HTML document
- ✅ All images downloaded to
images/folder - ✅ Suitable for offline viewing
- Markdown output: Extracts clean text content
- ✅ Auto-downloads images to local
images/directory (default) - ✅ Converts image URLs to relative paths
- ✅ Clean, readable format for archiving
- ✅ Fallback to original URLs if download fails
- Use
--no-download-imagesflag to keep original URLs only
- ✅ Auto-downloads images to local
- Simple and fast: Pure HTTP requests, no browser needed
- Auto filename: Generates safe filename from URL if not specified
Recursive Mode (--recursive)
- ✅ Intelligent link discovery: Automatically follows all links on crawled pages
- ✅ Depth control:
--max-depthlimits how many levels deep to crawl (default: 2) - ✅ Page limit:
--max-pagescaps total pages to prevent runaway crawls (default: 50) - ✅ Domain filtering:
--same-domainkeeps crawl within starting domain (default: on) - ✅ robots.txt compliance: Respects site's crawling rules by default
- ✅ Rate limiting:
--rate-limitadds delay between requests (default: 0.5s) - ✅ Smart URL filtering: Skips images, scripts, CSS, and duplicate URLs
- ✅ Progress tracking: Real-time console output with success/fail/skip counts
- ✅ Organized output: Preserves URL structure in directory hierarchy
- ✅ Efficient crawling: Sequential with rate limiting to respect servers
Guardrails
Single Page Mode
- Respect robots.txt and site terms of service
- Some sites may block automated access; this tool uses standard HTTP requests
- Large pages with many images may take time to download
Recursive Mode
- Start small: Test with
--max-depth 1 --max-pages 10first - Respect robots.txt: Default is on; only use
--no-respect-robotsfor your own sites - Rate limiting: Default 0.5s is polite; don't go below 0.2s for public sites
- Same domain: Strongly recommended to keep
--same-domainenabled - Monitor progress: Watch for high fail rates (may indicate blocking)
- Storage: Recursive crawls can generate many files; ensure sufficient disk space
- Legal: Ensure you have permission to crawl and archive the target site
Troubleshooting
- Connection errors: Check your internet connection and URL validity
- 403/blocked: Some sites block scrapers; the tool uses realistic User-Agent headers
- Timeout: Increase
--timeoutflag for slow-loading pages (value in seconds) - Image download fails: Images will fall back to original URLs
- Missing images: Some sites use JavaScript to load images dynamically (not supported)
GitHub 仓库
相关推荐技能
railway-docs
文档Railway Docs Skill可实时获取最新的Railway官方文档,确保回答的准确性。当开发者询问Railway功能特性、工作原理或分享docs.railway.com链接时,应优先使用此技能。它通过专门的LLM优化文档源提供最新信息,避免依赖过时记忆来回答技术问题。
n8n-code-python
文档该Skill为在n8n平台的Python代码节点中编写代码提供专家指导,特别适用于需要使用_input/_json/_node语法、Python标准库或了解n8n中Python限制的场景。它强调JavaScript应作为首选方案,仅当需要特定Python功能或对Python语法更熟悉时才使用Python。Skill提供了快速入门模板和关键注意事项,帮助开发者在n8n中高效编写Python代码。
archon
文档Archon Skill为开发者提供了基于RAG的语义搜索和项目任务管理功能,可通过REST API访问知识库。它支持文档搜索、网站爬取、文件上传和版本控制,适用于技术文档查询和项目管理场景。首次使用时需要配置Archon主机地址,建议在处理外部文档时优先使用该Skill。
n8n-code-javascript
文档这个Skill为n8n工作流中的JavaScript代码节点提供专业指导,涵盖数据处理、HTTP请求和日期操作等核心场景。它详细解释了如何正确使用n8n特有的`$input`/`$json`语法、`$helpers`工具以及DateTime对象,并包含关键的错误排查和模式选择建议。开发者通过该Skill能快速掌握Code节点的正确返回格式、数据访问方法和常见陷阱解决方案。
