返回技能列表

rotate-scraping-proxies

pjt222
更新于 2 days ago
4 次查看
17
2
17
在 GitHub 上查看
开发aiapidata

关于

This skill provides proxy rotation for web scraping when client-side stealth techniques are insufficient. It allows developers to choose between datacenter, residential, and mobile proxy pools while managing session stickiness, costs, and health monitoring. Use it as an escalation step for legitimate scraping campaigns that face blocking despite proper rate limiting and robots.txt compliance.

快速安装

Claude Code

推荐
主要方式
npx skills add pjt222/agent-almanac -a claude-code
插件命令备选方式
/plugin add https://github.com/pjt222/agent-almanac
Git 克隆备选方式
git clone https://github.com/pjt222/agent-almanac.git ~/.claude/skills/rotate-scraping-proxies

在 Claude Code 中复制并粘贴此命令以安装该技能

技能文档

Rotate Scraping Proxies

Network-layer escalation for scraping when client-side stealth exhausted. Proxy rotation = last resort, not default. Expensive, ethically charged, easily misused. Skill teaches when not to use as much as how.

When Use

  • headless-web-scraping (Fetcher → StealthyFetcher → DynamicFetcher) tried, target still returns 403/429/geo-blocks
  • Rate limit ≥ 3s intervals, robots.txt permits path
  • User-Agent and TLS fingerprint realistic (not default python-requests)
  • Scraping legitimate: public data, no auth bypass, no paywall bypass, no personal data without legal basis
  • Can budget proxy traffic, accept ops complexity

Do not use when: public API exists (use it), ToS forbids automation, would bypass geo-licensing, goal = fraud/credential stuffing/sneaker bots/piracy.

Inputs

  • Required: Target URLs, legal basis for scraping
  • Required: Proxy pool credentials (env, never hard-code)
  • Required: Pool type — datacenter, residential, mobile
  • Optional: Geo targeting (country/region/city)
  • Optional: Rotation granularity — per-request (default) or sticky
  • Optional: Daily traffic/spend cap
  • Optional: Rate limit delay (default 1s, even with rotation)

Steps

Step 1: Pre-flight Legality and Ethics Check

Gate workflow on documented legal+ethical review. Skip = biggest source of harm.

# Inputs to confirm before writing any code:
# 1. Is the data public (no login required)?
# 2. Does robots.txt permit the path?
# 3. Does the site's ToS prohibit automated access? (read it)
# 4. Would the scraping process personal data? If yes, what is the legal basis?
# 5. Could this access circumvent geo-licensing, paywalls, or auth?
# 6. Is there a public API or data dump that would make scraping unnecessary?
# 7. Have you contacted the site owner if scope is large?

Got: Every question has defensible written answer. First "no" or "unknown" stops procedure.

If fail:

  • ToS forbids automation — stop; contact owner or use API/licensed dataset
  • Personal data, no legal basis — stop; engage privacy counsel
  • Bypass auth or geo-licensing — never proceed

Step 2: Choose Pool Type

Different pools = different cost, detectability, ethics. Pick cheapest tier that solves block.

Pool typeDetectabilityCostBest for
DatacenterHigh (easily blocked by Cloudflare/Akamai)$Sites with no real anti-bot, geo-shifting only
ResidentialLow (real ISP IPs)$$$Sites that block datacenter ASNs
MobileVery low (carrier-grade NAT, shared with thousands)$$$$Sites that even block residential (rare)

Ethical caveat for residential and mobile: route traffic through real consumer connections. Operator consent model varies — some pay users, some bundle exit-node consent into "free VPN" EULAs users do not read. Prefer providers with audited, opt-in consent. Would not be comfortable with stranger sending scraping traffic through your home router? Do not send yours through theirs.

Got: Documented choice with cheapest viable tier, brief note on why higher tiers rejected (or needed).

If fail:

  • Datacenter blocked, residential over budget — narrow scope (fewer URLs, slower) before upgrade tier
  • No provider with documented opt-in consent — reconsider whether scraping necessary

Step 3: Integrate Rotation with Scrapling

Wire proxy into scrapling fetchers. Read creds from env vars — never hard-code, never commit .env.

import os
import random
from scrapling import Fetcher, StealthyFetcher

# Pattern A: provider-managed rotating endpoint (one URL, provider rotates per request)
PROXY_URL = os.environ["SCRAPING_PROXY_URL"]  # http://user:[email protected]:7777

fetcher = StealthyFetcher()
fetcher.configure(
    headless=True,
    timeout=60,
    network_idle=True,
    proxy=PROXY_URL,
)

# Pattern B: explicit pool, rotate yourself
POOL = os.environ["SCRAPING_PROXY_POOL"].split(",")  # comma-separated URLs

def fetch_with_rotation(url):
    proxy = random.choice(POOL)
    fetcher = StealthyFetcher()
    fetcher.configure(headless=True, timeout=60, proxy=proxy)
    return fetcher.get(url)

Got: Requests succeed, egress IP varies between calls. Hit IP-echo (https://api.ipify.org) to confirm before real scrape.

If fail:

  • 407 Proxy Authentication Required — wrong creds or password URL-encoding broke (re-encode special chars)
  • Same IP every call — provider endpoint sticky by default; check docs for -rotating or per-request flag
  • Massive latency increase — expected; rotation adds 200–2000ms per request

Step 4: Sticky Sessions and Pool Health

Decide rotation granularity per workload, then keep pool healthy.

# Sticky session for stateful flows (login, multi-page checkout-like crawls)
# Most providers expose a session ID via the username:
#   user-session-abc123:[email protected]:7777
# All requests with the same session ID exit through the same IP for ~10 min.

# Per-request rotation for anonymous bulk scraping (default)

# Pool health check — call before bulk run
def check_pool(pool, sample_size=5):
    sample = random.sample(pool, min(sample_size, len(pool)))
    alive = []
    for proxy in sample:
        try:
            r = StealthyFetcher().configure(proxy=proxy, timeout=10).get(
                "https://api.ipify.org"
            )
            if r.status == 200:
                alive.append(proxy)
        except Exception:
            pass
    return alive

# Backoff on transient proxy failures
def fetch_with_backoff(url, max_attempts=3):
    for attempt in range(max_attempts):
        try:
            r = fetch_with_rotation(url)
            if r.status not in (407, 502, 503):
                return r
        except Exception:
            pass
        time.sleep(2 ** attempt)
    return None

Got: Stateful flows preserve cookies; bulk anonymous shows IP variance; dead proxies skipped not looped.

If fail:

  • Login breaks mid-flow — rotation inside session; switch to sticky-session creds
  • All proxies in sample fail health — pool exhausted or creds expired; rotate creds or contact provider

Step 5: Monitoring, Cost Control, Kill Switch

Proxy traffic = per-GB + per-request cost. Runaway scrapers = runaway invoices. Always include limits + abort.

import time

class ScrapeBudget:
    def __init__(self, max_requests, max_duration_seconds, max_failures):
        self.max_requests = max_requests
        self.max_duration = max_duration_seconds
        self.max_failures = max_failures
        self.requests = 0
        self.failures = 0
        self.start = time.monotonic()

    def allow(self):
        if self.requests >= self.max_requests:
            return False, "request cap reached"
        if time.monotonic() - self.start >= self.max_duration:
            return False, "time cap reached"
        if self.failures >= self.max_failures:
            return False, "failure cap reached (circuit breaker)"
        return True, None

    def record(self, success):
        self.requests += 1
        if not success:
            self.failures += 1

budget = ScrapeBudget(max_requests=1000, max_duration_seconds=3600, max_failures=20)

for url in target_urls:
    ok, reason = budget.allow()
    if not ok:
        print(f"Aborting: {reason}")
        break
    response = fetch_with_backoff(url)
    budget.record(success=response is not None)
    time.sleep(1)  # rate limiting still applies even with rotation

Got: Budget caps trigger before runaway cost. Logs show per-proxy success rate so bad egress IP can be identified, excluded.

If fail:

  • Failure rate climbs above 20% — pause; site detected rotation pattern (e.g. all IPs share subnet); switch pool type or stop
  • Cost-per-record exceeds expectations 5x — cache aggressive, dedupe URLs, batch where possible

Checks

  • Step 1 legality check documented in writing before code runs
  • No proxy creds, pool URLs, session IDs in tracked files (grep gateway., proxy=, provider hostname)
  • .env (or equiv) in .gitignore
  • Pool choice justified: cheapest viable tier, consent model verified for residential/mobile
  • IP variance confirmed against echo endpoint before real run
  • Stateful flows use sticky sessions; bulk anonymous use per-request
  • Budget caps (requests, duration, failures) wired and tested
  • Rate limit (≥1s) preserved — rotation not excuse to flood
  • robots.txt still respected — rotation does not override

Pitfalls

  • Rotate before stealth exhausted: site often does not need new IP — needs realistic User-Agent, TLS fingerprint, slower cadence. Try StealthyFetcher and rate limit first; rotation expensive, unethical to deploy unnecessarily.
  • Hard-coded creds: pasting proxy URL into source leaks to git, container images, stack traces. Read from env vars or secrets manager.
  • Rotate mid-session: per-request rotation breaks any flow with cookies, CSRF, cart state. Use sticky for stateful work.
  • Treat rotation as "ethical anonymity": rotation hides you, does not make harmful scraping ethical. ToS, copyright, privacy law, rate-limit ethics still apply.
  • Use residential for high-risk activity: credential stuffing, sneaker bots, geo-pirating streams, fraud — out of scope. Stop if your case looks like this.
  • Ignore robots.txt because "we have rotation now": rotation does not grant permission. Directive is directive.
  • No kill switch: unsupervised loop on metered pool = four-figure invoice overnight. Cap requests, duration, failures.
  • Residential pool with opaque consent: some providers source exit nodes from "free VPN" EULAs real users never read. Pay premium for audited, opt-in consent.

See Also

<!-- Keep under 500 lines. Extract large examples to references/EXAMPLES.md if needed. -->

GitHub 仓库

pjt222/agent-almanac
路径: i18n/caveman/skills/rotate-scraping-proxies
0
agentsagentskillsai-assisted-developmentclaude-codeskillsteams

相关推荐技能

qmd

开发

这是一个本地搜索和索引的CLI工具,支持BM25、向量搜索和重排序功能。开发者可以用它快速索引本地文件(如Markdown文档)并进行混合搜索,特别适合代码库或文档的本地检索。它还提供MCP模式,能轻松集成到Claude开发环境中使用。

查看技能

subagent-driven-development

开发

该Skill用于在当前会话中执行包含独立任务的实施计划,它会为每个任务分派一个全新的子代理并在任务间进行代码审查。这种"全新子代理+任务间审查"的模式既能保障代码质量,又能实现快速迭代。适合需要在当前会话中连续执行独立任务,并希望在每个任务后都有质量把关的开发场景。

查看技能

mcporter

开发

mcporter Skill 让开发者能在Claude中直接管理和调用MCP服务器。它支持列出可用服务器、调用工具、处理OAuth认证以及管理服务器守护进程。开发者可以通过命令行式交互快速执行`mcporter list`查看服务器,或使用`mcporter call`直接调用工具,简化了MCP工作流程。

查看技能

adk-deployment-specialist

开发

这是一个用于部署和编排Google Vertex AI ADK智能体的Claude Skill,专为构建生产级多智能体系统而设计。它支持通过A2A协议进行智能体通信,提供代码执行沙箱和记忆库功能,并能处理智能体发现与任务提交。当开发者需要部署ADK智能体或编排多智能体协作时,可使用此Skill来简化Vertex AI Agent Engine的部署流程。

查看技能