rotate-scraping-proxies
关于
This skill implements provider-neutral proxy rotation to overcome blocking when client-side stealth techniques fail. It allows selecting from datacenter, residential, or mobile proxy pools with session stickiness for stateful workflows. Developers should use it only for legitimate traffic while monitoring costs and adhering to ethical limits.
快速安装
Claude Code
推荐npx skills add pjt222/agent-almanac -a claude-code/plugin add https://github.com/pjt222/agent-almanacgit clone https://github.com/pjt222/agent-almanac.git ~/.claude/skills/rotate-scraping-proxies在 Claude Code 中复制并粘贴此命令以安装该技能
技能文档
Rotate Scraping Proxies
Network-layer escalation for scraping where client-side stealth is exhausted. Proxy rotation is a last resort, not a default — expensive, ethically charged, and easily misused. This skill teaches when not to use it as much as how.
When to Use
headless-web-scraping(Fetcher → StealthyFetcher → DynamicFetcher) tried and target still returns 403/429/geo-blocks- Rate limiting already at 3+ second intervals and
robots.txtpermits the path - User-Agent and TLS fingerprint already realistic (not default
python-requests) - Scraping is legitimate: public data, no auth circumvention, no paywall bypass, no personal data harvested without legal basis
- You can budget proxy traffic and accept operational complexity
Do not use when: a public API exists (use it), the site's ToS forbids automated access, you would circumvent geo-licensing, or the goal is fraud / credential stuffing / sneaker bots / content piracy.
Inputs
- Required: Target URLs and the legal basis for scraping them
- Required: Proxy pool credentials (read from environment, never hard-coded)
- Required: Pool type — datacenter, residential, or mobile
- Optional: Geographic targeting (country / region / city)
- Optional: Rotation granularity — per-request (default) or sticky session
- Optional: Daily traffic / spend cap
- Optional: Rate limit delay in seconds (default: 1, even with rotation)
Procedure
Step 1: Pre-flight Legality and Ethics Check
Gate the workflow on a documented legal and ethical review. Skipping this is the single biggest source of harm.
# Inputs to confirm before writing any code:
# 1. Is the data public (no login required)?
# 2. Does robots.txt permit the path?
# 3. Does the site's ToS prohibit automated access? (read it)
# 4. Would the scraping process personal data? If yes, what is the legal basis?
# 5. Could this access circumvent geo-licensing, paywalls, or auth?
# 6. Is there a public API or data dump that would make scraping unnecessary?
# 7. Have you contacted the site owner if scope is large?
Got: Every question has a defensible written answer. The first "no" or "unknown" stops the procedure until resolved.
If fail:
- ToS forbids automated access — do not proceed; contact the site owner or use an official API or licensed dataset
- Personal data with no legal basis — do not proceed; engage privacy counsel
- Circumvents auth or geo-licensing — do not proceed under any circumstances
Step 2: Choose a Pool Type
Different pool types have different cost, detectability, and ethics. Pick the cheapest tier that solves your block.
| Pool type | Detectability | Cost | Best for |
|---|---|---|---|
| Datacenter | High (easily blocked by Cloudflare/Akamai) | $ | Sites with no real anti-bot, geo-shifting only |
| Residential | Low (real ISP IPs) | $$$ | Sites that block datacenter ASNs |
| Mobile | Very low (carrier-grade NAT, shared with thousands) | $$$$ | Sites that even block residential (rare) |
Ethical caveat for residential and mobile: these pools route your traffic through real consumer connections. Operator consent models vary — some pay users, some bundle exit-node consent into "free VPN" EULAs that users do not read. Prefer providers with audited, opt-in consent. If you would not be comfortable with a stranger sending scraping traffic through your home router, do not send yours through theirs.
Got: A documented choice with the cheapest viable tier and a brief note on why higher tiers were rejected (or why a higher tier is needed).
If fail:
- Datacenter blocked but residential over budget — narrow scraping scope (fewer URLs, slower cadence) before upgrading the tier
- Cannot find a provider with documented opt-in consent — reconsider whether the scraping is necessary
Step 3: Integrate Rotation with Scrapling
Wire the proxy into scrapling fetchers. Read credentials from environment
variables — never hard-code, never commit a .env.
import os
import random
from scrapling import Fetcher, StealthyFetcher
# Pattern A: provider-managed rotating endpoint (one URL, provider rotates per request)
PROXY_URL = os.environ["SCRAPING_PROXY_URL"] # http://user:[email protected]:7777
fetcher = StealthyFetcher()
fetcher.configure(
headless=True,
timeout=60,
network_idle=True,
proxy=PROXY_URL,
)
# Pattern B: explicit pool, rotate yourself
POOL = os.environ["SCRAPING_PROXY_POOL"].split(",") # comma-separated URLs
def fetch_with_rotation(url):
proxy = random.choice(POOL)
fetcher = StealthyFetcher()
fetcher.configure(headless=True, timeout=60, proxy=proxy)
return fetcher.get(url)
Got: Requests succeed and the egress IP varies between calls. Confirm by
hitting an IP-echo endpoint (e.g. https://api.ipify.org) before running the
real scrape.
If fail:
- 407 Proxy Authentication Required — credentials wrong or password URL-encoding broke (re-encode special characters)
- Same IP on every call — provider endpoint may be sticky by default; check
docs for a
-rotatingor per-request flag - Massive latency increase — expected; rotation adds 200–2000ms per request
Step 4: Sticky Sessions and Pool Health
Decide rotation granularity per workload, then keep the pool healthy.
# Sticky session for stateful flows (login, multi-page checkout-like crawls)
# Most providers expose a session ID via the username:
# user-session-abc123:[email protected]:7777
# All requests with the same session ID exit through the same IP for ~10 min.
# Per-request rotation for anonymous bulk scraping (default)
# Pool health check — call before bulk run
def check_pool(pool, sample_size=5):
sample = random.sample(pool, min(sample_size, len(pool)))
alive = []
for proxy in sample:
try:
r = StealthyFetcher().configure(proxy=proxy, timeout=10).get(
"https://api.ipify.org"
)
if r.status == 200:
alive.append(proxy)
except Exception:
pass
return alive
# Backoff on transient proxy failures
def fetch_with_backoff(url, max_attempts=3):
for attempt in range(max_attempts):
try:
r = fetch_with_rotation(url)
if r.status not in (407, 502, 503):
return r
except Exception:
pass
time.sleep(2 ** attempt)
return None
Got: Stateful flows preserve cookies across requests; bulk anonymous scraping shows IP variance across requests; dead proxies skipped instead of looping.
If fail:
- Login breaks mid-flow — rotation happening inside the session; switch to sticky-session credentials
- All proxies in sample fail health check — pool exhausted or credentials expired; rotate credentials or contact provider
Step 5: Monitoring, Cost Control, and Kill Switch
Proxy traffic has a per-GB cost and a per-request cost. Runaway scrapers generate runaway invoices. Always include limits and an abort.
import time
class ScrapeBudget:
def __init__(self, max_requests, max_duration_seconds, max_failures):
self.max_requests = max_requests
self.max_duration = max_duration_seconds
self.max_failures = max_failures
self.requests = 0
self.failures = 0
self.start = time.monotonic()
def allow(self):
if self.requests >= self.max_requests:
return False, "request cap reached"
if time.monotonic() - self.start >= self.max_duration:
return False, "time cap reached"
if self.failures >= self.max_failures:
return False, "failure cap reached (circuit breaker)"
return True, None
def record(self, success):
self.requests += 1
if not success:
self.failures += 1
budget = ScrapeBudget(max_requests=1000, max_duration_seconds=3600, max_failures=20)
for url in target_urls:
ok, reason = budget.allow()
if not ok:
print(f"Aborting: {reason}")
break
response = fetch_with_backoff(url)
budget.record(success=response is not None)
time.sleep(1) # rate limiting still applies even with rotation
Got: Budget caps trigger before runaway cost. Logs show per-proxy success rate so a bad egress IP can be identified and excluded.
If fail:
- Failure rate climbs above 20% — pause; the site has detected the rotation pattern (e.g. all your IPs share a subnet); switch pool type or stop
- Cost-per-record exceeds expectations by 5x — cache aggressively, deduplicate URLs, batch where possible
Validation
- Step 1 legality check is documented in writing before any code runs
- No proxy credentials, pool URLs, or session IDs in tracked files
(grep for
gateway.,proxy=, the provider hostname) -
.env(or equivalent) is in.gitignore - Pool choice justified: cheapest viable tier, with consent model verified for residential/mobile
- IP variance confirmed against an echo endpoint before the real run
- Stateful flows use sticky sessions; bulk anonymous use per-request
- Budget caps (requests, duration, failures) wired and tested
- Rate limiting (≥1s) preserved — rotation is not an excuse to flood
-
robots.txtstill respected — rotation does not override it
Pitfalls
- Rotating before stealth is exhausted: the site often does not need a new
IP — it needs a realistic User-Agent, TLS fingerprint, and slower cadence.
Try
StealthyFetcherand rate limiting first; rotation is expensive and unethical to deploy unnecessarily. - Hard-coded credentials: pasting the proxy URL into source leaks it to git, container images, and stack traces. Read from environment variables or a secrets manager.
- Rotating mid-session: per-request rotation breaks any flow that depends on cookies, CSRF tokens, or shopping-cart state. Use sticky sessions for stateful work.
- Treating rotation as "ethical anonymity": rotation hides you from the target, but does not make harmful scraping ethical. ToS, copyright, privacy law, and rate-limit ethics still apply unchanged.
- Using residential proxies for high-risk activity: credential stuffing, sneaker botting, geo-pirating streaming content, fraud — explicitly out of scope. If your use case looks like this, stop.
- Ignoring
robots.txtbecause "we have rotation now": rotation does not grant permission. The directive is the directive. - No kill switch: an unsupervised loop on a metered proxy pool turns into a four-figure invoice overnight. Always cap requests, duration, and failures.
- Choosing a residential pool with opaque consent: some providers source exit nodes from "free VPN" EULAs that real users never read. Pay the premium for an audited, opt-in consent model.
Related Skills
- headless-web-scraping — parent skill; always start there. Use this skill only as escalation.
- use-graphql-api — prefer official APIs to scraping when one exists.
- deploy-searxng — self-hosted search avoids scraping search engines entirely.
- configure-reverse-proxy — opposite network direction (serving instead of fetching), useful neighbor reference.
- security-audit-codebase — run after integrating credentials to confirm none leaked into the repo.
GitHub 仓库
相关推荐技能
qmd
开发这是一个本地搜索和索引的CLI工具,支持BM25、向量搜索和重排序功能。开发者可以用它快速索引本地文件(如Markdown文档)并进行混合搜索,特别适合代码库或文档的本地检索。它还提供MCP模式,能轻松集成到Claude开发环境中使用。
subagent-driven-development
开发该Skill用于在当前会话中执行包含独立任务的实施计划,它会为每个任务分派一个全新的子代理并在任务间进行代码审查。这种"全新子代理+任务间审查"的模式既能保障代码质量,又能实现快速迭代。适合需要在当前会话中连续执行独立任务,并希望在每个任务后都有质量把关的开发场景。
mcporter
开发mcporter Skill 让开发者能在Claude中直接管理和调用MCP服务器。它支持列出可用服务器、调用工具、处理OAuth认证以及管理服务器守护进程。开发者可以通过命令行式交互快速执行`mcporter list`查看服务器,或使用`mcporter call`直接调用工具,简化了MCP工作流程。
adk-deployment-specialist
开发这是一个用于部署和编排Google Vertex AI ADK智能体的Claude Skill,专为构建生产级多智能体系统而设计。它支持通过A2A协议进行智能体通信,提供代码执行沙箱和记忆库功能,并能处理智能体发现与任务提交。当开发者需要部署ADK智能体或编排多智能体协作时,可使用此Skill来简化Vertex AI Agent Engine的部署流程。
