返回技能列表

web-scraper

agentbay-ai
更新于 2 days ago
8 次查看
40
2
40
在 GitHub 上查看
文档general

关于

This skill scrapes web pages to HTML or Markdown with text and images using minimal dependencies (requests and beautifulsoup4). It's designed for downloading and archiving content locally when given a URL. Key features include automatic image saving and optional recursive link following.

快速安装

Claude Code

推荐
主要方式
npx skills add agentbay-ai/agentbay-skills -a claude-code
插件命令备选方式
/plugin add https://github.com/agentbay-ai/agentbay-skills
Git 克隆备选方式
git clone https://github.com/agentbay-ai/agentbay-skills.git ~/.claude/skills/web-scraper

在 Claude Code 中复制并粘贴此命令以安装该技能

技能文档


name: web-scraper description: Scrape web pages and save as HTML or Markdown (with text and images). Minimal dependencies - only requests and beautifulsoup4. Use when the user provides a URL and wants to download/archive the content locally. homepage: https://requests.readthedocs.io/ metadata: { "openclaw": { "emoji": "🕷️", "requires": { "bins": ["python3"], "env": [] }, }, }

Web Scraper

Fetch web page content (text + images) and save as HTML or Markdown locally.

Minimal dependencies: Only requires requests and beautifulsoup4 - no browser automation.

Default behavior: Downloads images to local images/ directory automatically.

Quick start

Single page

{baseDir}/scripts/scrape.py --url "https://example.com" --format html --output /tmp/page.html
{baseDir}/scripts/scrape.py --url "https://example.com" --format md --output /tmp/page.md

Recursive (follow links)

{baseDir}/scripts/scrape.py --url "https://docs.example.com" --format md --recursive --max-depth 2 --output ~/Downloads/docs-archive

Setup

Requires Python 3.8+ and minimal dependencies:

cd {baseDir}
pip install -r requirements.txt

Or install manually:

pip install requests beautifulsoup4

Note: No browser or driver needed - uses pure HTTP requests.

Inputs to collect

Single page mode

  • URL: The web page to scrape (required)
  • Format: html or md (default: html)
  • Output path: Where to save the file (default: current directory with auto-generated name)
  • Images: Downloads images by default (use --no-download-images to disable)

Recursive mode (--recursive)

  • URL: Starting point for recursive scraping
  • Format: html or md
  • Output directory: Where to save all scraped pages
  • Max depth: How many levels deep to follow links (default: 2)
  • Max pages: Maximum total pages to scrape (default: 50)
  • Domain filter: Whether to stay within same domain (default: yes)
  • Images: Downloads images by default

Conversation Flow

  1. Ask user for the URL to scrape
  2. Ask preferred output format (HTML or Markdown)
    • Note: Both formats include text and images by default
    • HTML: Preserves original structure with downloaded images
    • Markdown: Clean text format with downloaded images in images/ folder
  3. For recursive mode: Ask max depth and max pages (optional, has sensible defaults)
  4. Ask where to save (or suggest a default path like /tmp/ or ~/Downloads/)
  5. Run the script and confirm success
  6. Show the saved file/directory path

Examples

Single Page Scraping

Save as HTML

{baseDir}/scripts/scrape.py --url "https://docs.openclaw.ai/start/quickstart" --format html --output ~/Downloads/openclaw-quickstart.html

Save as Markdown (with images, default)

{baseDir}/scripts/scrape.py --url "https://en.wikipedia.org/wiki/Web_scraping" --format md --output ~/Documents/web-scraping.md

Result: Creates web-scraping.md + images/ folder with all downloaded images (text + images).

Without downloading images (optional)

{baseDir}/scripts/scrape.py --url "https://example.com" --format md --no-download-images

Result: Only text + image URLs (not downloaded locally).

Auto-generate filename

{baseDir}/scripts/scrape.py --url "https://example.com" --format html
# Saves to: example-com-{timestamp}.html

Recursive Scraping

Basic recursive crawl (depth 2, same domain, with images)

{baseDir}/scripts/scrape.py --url "https://docs.example.com" --format md --recursive --output ~/Downloads/docs-archive

Output structure (text + images for all pages):

docs-archive/
├── index.md
├── getting-started.md
├── api/
│   ├── authentication.md
│   └── endpoints.md
└── images/              # Shared images from all pages
    ├── logo.png
    └── diagram.svg

Deep crawl with custom limits

{baseDir}/scripts/scrape.py \
  --url "https://blog.example.com" \
  --format html \
  --recursive \
  --max-depth 3 \
  --max-pages 100 \
  --output ~/Archives/blog-backup

Ignore robots.txt (use with caution)

{baseDir}/scripts/scrape.py \
  --url "https://example.com" \
  --format md \
  --recursive \
  --no-respect-robots \
  --rate-limit 1.0

Faster scraping (reduced rate limit)

{baseDir}/scripts/scrape.py \
  --url "https://yoursite.com" \
  --format md \
  --recursive \
  --rate-limit 0.2

Features

Single Page Mode

  • HTML output: Preserves original page structure
    • ✅ Clean, readable HTML document
    • ✅ All images downloaded to images/ folder
    • ✅ Suitable for offline viewing
  • Markdown output: Extracts clean text content
    • Auto-downloads images to local images/ directory (default)
    • ✅ Converts image URLs to relative paths
    • ✅ Clean, readable format for archiving
    • ✅ Fallback to original URLs if download fails
    • Use --no-download-images flag to keep original URLs only
  • Simple and fast: Pure HTTP requests, no browser needed
  • Auto filename: Generates safe filename from URL if not specified

Recursive Mode (--recursive)

  • ✅ Intelligent link discovery: Automatically follows all links on crawled pages
  • ✅ Depth control: --max-depth limits how many levels deep to crawl (default: 2)
  • ✅ Page limit: --max-pages caps total pages to prevent runaway crawls (default: 50)
  • ✅ Domain filtering: --same-domain keeps crawl within starting domain (default: on)
  • ✅ robots.txt compliance: Respects site's crawling rules by default
  • ✅ Rate limiting: --rate-limit adds delay between requests (default: 0.5s)
  • ✅ Smart URL filtering: Skips images, scripts, CSS, and duplicate URLs
  • ✅ Progress tracking: Real-time console output with success/fail/skip counts
  • ✅ Organized output: Preserves URL structure in directory hierarchy
  • ✅ Efficient crawling: Sequential with rate limiting to respect servers

Guardrails

Single Page Mode

  • Respect robots.txt and site terms of service
  • Some sites may block automated access; this tool uses standard HTTP requests
  • Large pages with many images may take time to download

Recursive Mode

  • Start small: Test with --max-depth 1 --max-pages 10 first
  • Respect robots.txt: Default is on; only use --no-respect-robots for your own sites
  • Rate limiting: Default 0.5s is polite; don't go below 0.2s for public sites
  • Same domain: Strongly recommended to keep --same-domain enabled
  • Monitor progress: Watch for high fail rates (may indicate blocking)
  • Storage: Recursive crawls can generate many files; ensure sufficient disk space
  • Legal: Ensure you have permission to crawl and archive the target site

Troubleshooting

  • Connection errors: Check your internet connection and URL validity
  • 403/blocked: Some sites block scrapers; the tool uses realistic User-Agent headers
  • Timeout: Increase --timeout flag for slow-loading pages (value in seconds)
  • Image download fails: Images will fall back to original URLs
  • Missing images: Some sites use JavaScript to load images dynamically (not supported)

GitHub 仓库

agentbay-ai/agentbay-skills
路径: web-scraper
0

相关推荐技能

railway-docs

文档

Railway Docs Skill可实时获取最新的Railway官方文档,确保回答的准确性。当开发者询问Railway功能特性、工作原理或分享docs.railway.com链接时,应优先使用此技能。它通过专门的LLM优化文档源提供最新信息,避免依赖过时记忆来回答技术问题。

查看技能

n8n-code-python

文档

该Skill为在n8n平台的Python代码节点中编写代码提供专家指导,特别适用于需要使用_input/_json/_node语法、Python标准库或了解n8n中Python限制的场景。它强调JavaScript应作为首选方案,仅当需要特定Python功能或对Python语法更熟悉时才使用Python。Skill提供了快速入门模板和关键注意事项,帮助开发者在n8n中高效编写Python代码。

查看技能

archon

文档

Archon Skill为开发者提供了基于RAG的语义搜索和项目任务管理功能,可通过REST API访问知识库。它支持文档搜索、网站爬取、文件上传和版本控制,适用于技术文档查询和项目管理场景。首次使用时需要配置Archon主机地址,建议在处理外部文档时优先使用该Skill。

查看技能

n8n-code-javascript

文档

这个Skill为n8n工作流中的JavaScript代码节点提供专业指导,涵盖数据处理、HTTP请求和日期操作等核心场景。它详细解释了如何正确使用n8n特有的`$input`/`$json`语法、`$helpers`工具以及DateTime对象,并包含关键的错误排查和模式选择建议。开发者通过该Skill能快速掌握Code节点的正确返回格式、数据访问方法和常见陷阱解决方案。

查看技能