SKILL·416024

web-scraper

Name: web-scraper
Author: agentbay-ai

agentbay-ai

업데이트됨 1 month ago

9 조회

문서general

정보

이 스킬은 최소한의 의존성(requests와 beautifulsoup4)을 사용하여 웹 페이지를 텍스트와 이미지가 포함된 HTML 또는 마크다운으로 스크랩합니다. URL이 주어지면 콘텐츠를 로컬에 다운로드하고 보관하도록 설계되었습니다. 주요 기능으로는 자동 이미지 저장과 선택적 재귀적 링크 추적이 포함됩니다.

빠른 설치

Claude Code

문서

name: web-scraper description: Scrape web pages and save as HTML or Markdown (with text and images). Minimal dependencies - only requests and beautifulsoup4. Use when the user provides a URL and wants to download/archive the content locally. homepage: https://requests.readthedocs.io/ metadata: { "openclaw": { "emoji": "🕷️", "requires": { "bins": ["python3"], "env": [] }, }, }

Web Scraper

Fetch web page content (text + images) and save as HTML or Markdown locally.

Minimal dependencies: Only requires requests and beautifulsoup4 - no browser automation.

Default behavior: Downloads images to local images/ directory automatically.

Quick start

Single page

{baseDir}/scripts/scrape.py --url "https://example.com" --format html --output /tmp/page.html
{baseDir}/scripts/scrape.py --url "https://example.com" --format md --output /tmp/page.md

Recursive (follow links)

{baseDir}/scripts/scrape.py --url "https://docs.example.com" --format md --recursive --max-depth 2 --output ~/Downloads/docs-archive

Setup

Requires Python 3.8+ and minimal dependencies:

cd {baseDir}
pip install -r requirements.txt

Or install manually:

pip install requests beautifulsoup4

Note: No browser or driver needed - uses pure HTTP requests.

Inputs to collect

Single page mode

URL: The web page to scrape (required)
Format: html or md (default: html)
Output path: Where to save the file (default: current directory with auto-generated name)
Images: Downloads images by default (use --no-download-images to disable)

Recursive mode (--recursive)

URL: Starting point for recursive scraping
Format: html or md
Output directory: Where to save all scraped pages
Max depth: How many levels deep to follow links (default: 2)
Max pages: Maximum total pages to scrape (default: 50)
Domain filter: Whether to stay within same domain (default: yes)
Images: Downloads images by default

Conversation Flow

Ask user for the URL to scrape
Ask preferred output format (HTML or Markdown)
- Note: Both formats include text and images by default
- HTML: Preserves original structure with downloaded images
- Markdown: Clean text format with downloaded images in images/ folder
For recursive mode: Ask max depth and max pages (optional, has sensible defaults)
Ask where to save (or suggest a default path like /tmp/ or ~/Downloads/)
Run the script and confirm success
Show the saved file/directory path

Examples

Single Page Scraping

Save as HTML

{baseDir}/scripts/scrape.py --url "https://docs.openclaw.ai/start/quickstart" --format html --output ~/Downloads/openclaw-quickstart.html

Save as Markdown (with images, default)

{baseDir}/scripts/scrape.py --url "https://en.wikipedia.org/wiki/Web_scraping" --format md --output ~/Documents/web-scraping.md

Result: Creates web-scraping.md + images/ folder with all downloaded images (text + images).

Without downloading images (optional)

{baseDir}/scripts/scrape.py --url "https://example.com" --format md --no-download-images

Result: Only text + image URLs (not downloaded locally).

Auto-generate filename

{baseDir}/scripts/scrape.py --url "https://example.com" --format html
# Saves to: example-com-{timestamp}.html

Recursive Scraping

Basic recursive crawl (depth 2, same domain, with images)

{baseDir}/scripts/scrape.py --url "https://docs.example.com" --format md --recursive --output ~/Downloads/docs-archive

Output structure (text + images for all pages):

docs-archive/
├── index.md
├── getting-started.md
├── api/
│   ├── authentication.md
│   └── endpoints.md
└── images/              # Shared images from all pages
    ├── logo.png
    └── diagram.svg

Deep crawl with custom limits

{baseDir}/scripts/scrape.py \
  --url "https://blog.example.com" \
  --format html \
  --recursive \
  --max-depth 3 \
  --max-pages 100 \
  --output ~/Archives/blog-backup

Ignore robots.txt (use with caution)

{baseDir}/scripts/scrape.py \
  --url "https://example.com" \
  --format md \
  --recursive \
  --no-respect-robots \
  --rate-limit 1.0

Faster scraping (reduced rate limit)

{baseDir}/scripts/scrape.py \
  --url "https://yoursite.com" \
  --format md \
  --recursive \
  --rate-limit 0.2

Features

Single Page Mode

HTML output: Preserves original page structure
- ✅ Clean, readable HTML document
- ✅ All images downloaded to images/ folder
- ✅ Suitable for offline viewing
Markdown output: Extracts clean text content
- ✅ Auto-downloads images to local images/ directory (default)
- ✅ Converts image URLs to relative paths
- ✅ Clean, readable format for archiving
- ✅ Fallback to original URLs if download fails
- Use --no-download-images flag to keep original URLs only
Simple and fast: Pure HTTP requests, no browser needed
Auto filename: Generates safe filename from URL if not specified

Recursive Mode (`--recursive`)

✅ Intelligent link discovery: Automatically follows all links on crawled pages
✅ Depth control: --max-depth limits how many levels deep to crawl (default: 2)
✅ Page limit: --max-pages caps total pages to prevent runaway crawls (default: 50)
✅ Domain filtering: --same-domain keeps crawl within starting domain (default: on)
✅ robots.txt compliance: Respects site's crawling rules by default
✅ Rate limiting: --rate-limit adds delay between requests (default: 0.5s)
✅ Smart URL filtering: Skips images, scripts, CSS, and duplicate URLs
✅ Progress tracking: Real-time console output with success/fail/skip counts
✅ Organized output: Preserves URL structure in directory hierarchy
✅ Efficient crawling: Sequential with rate limiting to respect servers

Guardrails

Single Page Mode

Respect robots.txt and site terms of service
Some sites may block automated access; this tool uses standard HTTP requests
Large pages with many images may take time to download

Recursive Mode

Start small: Test with --max-depth 1 --max-pages 10 first
Respect robots.txt: Default is on; only use --no-respect-robots for your own sites
Rate limiting: Default 0.5s is polite; don't go below 0.2s for public sites
Same domain: Strongly recommended to keep --same-domain enabled
Monitor progress: Watch for high fail rates (may indicate blocking)
Storage: Recursive crawls can generate many files; ensure sufficient disk space
Legal: Ensure you have permission to crawl and archive the target site

Troubleshooting

Connection errors: Check your internet connection and URL validity
403/blocked: Some sites block scrapers; the tool uses realistic User-Agent headers
Timeout: Increase --timeout flag for slow-loading pages (value in seconds)
Image download fails: Images will fall back to original URLs
Missing images: Some sites use JavaScript to load images dynamically (not supported)

GitHub 저장소

agentbay-ai/agentbay-skills

경로: web-scraper

FAQ

Frequently asked questions

What is the web-scraper skill?

web-scraper is a Claude Skill by agentbay-ai. Skills package instructions and resources that Claude loads on demand, so Claude can perform web-scraper-related tasks without extra prompting.

How do I install web-scraper?

Use the install commands on this page: add web-scraper to Claude Code as a plugin, or clone its repository into your skills directory, then restart Claude so it picks up the skill.

What category does web-scraper belong to?

web-scraper is in the Documentation category, tagged general.

Is web-scraper free to use?

Yes. web-scraper is listed on AIMCP and free to install. It runs inside Claude, so no separate service account is required to use the skill itself.

연관 스킬

railway-docs

문서

이 스킬은 Railway의 기능, 작동 방식 또는 특정 문서 URL에 대한 질문에 답하기 위해 최신 Railway 문서를 가져옵니다. 개발자들이 Railway의 공식 소스로부터 정확하고 최신 정보를 직접 받을 수 있도록 보장합니다. 사용자가 Railway의 작동 방식을 묻거나 Railway 문서를 참조할 때 사용하세요.

스킬 보기

n8n-code-python

문서

이 Claude Skill은 n8n의 Code 노드에서 Python 코드를 작성할 때 전문적인 지침을 제공하며, 특히 Python 표준 라이브러리 사용과 n8n의 특수 구문인 `_input`, `_json`, `_node` 작업에 중점을 둡니다. 이는 개발자가 n8n 내에서 Python의 제한 사항을 이해하도록 돕고, 대부분의 워크플로에는 JavaScript 사용을 권장하면서도 특정 데이터 변환 요구사항에 대한 Python 솔루션을 제안합니다.

스킬 보기

archon

문서

Archon 스킬은 REST API를 통해 RAG 기반 시맨틱 검색과 프로젝트 관리를 제공합니다. 이 스킬을 사용하여 문서 검색, 계층적 프로젝트/태스크 관리, 문서 업로드 기능을 갖춘 지식 검색을 수행할 수 있습니다. 외부 문서를 검색할 때는 다른 소스를 사용하기 전에 항상 Archon을 최우선으로 활용하세요.

스킬 보기

n8n-code-javascript

문서

이 Claude Skill은 n8n의 Code 노드에서 JavaScript 코드 작성에 대한 전문적인 지침을 제공합니다. `$input`/`$json` 변수, HTTP 헬퍼, DateTime 처리와 같은 필수적인 n8n 특정 구문을 다루며 일반적인 오류를 해결합니다. Code 노드에서 사용자 정의 JavaScript 처리가 필요한 n8n 워크플로우를 개발할 때 활용하세요.

스킬 보기

web-scraper

정보

빠른 설치

Claude Code

문서

Web Scraper

Quick start

Single page

Recursive (follow links)

Setup

Inputs to collect

Single page mode

Recursive mode (--recursive)

Conversation Flow

Examples

Single Page Scraping

Save as HTML

Save as Markdown (with images, default)

Without downloading images (optional)

Auto-generate filename

Recursive Scraping

Basic recursive crawl (depth 2, same domain, with images)

Deep crawl with custom limits

Ignore robots.txt (use with caution)

Faster scraping (reduced rate limit)

Features

Single Page Mode

Recursive Mode (--recursive)

Guardrails

Single Page Mode

Recursive Mode

Troubleshooting

GitHub 저장소

Frequently asked questions

What is the web-scraper skill?

How do I install web-scraper?

What category does web-scraper belong to?

Is web-scraper free to use?

연관 스킬

Recursive Mode (`--recursive`)