rotate-scraping-proxies
정보
이 스킬은 표준적인 스텔스 기법이 실패할 때 웹 스크래핑을 위한 프록시 회전 기능을 제공하며, 데이터센터, 주거용, 모바일 프록시 풀에 대한 접근과 세션 관리 및 모니터링을 지원합니다. 이는 헤드리스 스크래핑 접근 방식이 차단된 후 최후의 수단으로 에스컬레이션하기 위해 설계되었으며, 윤리적 사용과 비용 인식을 강조합니다. 개발자는 지속적인 403/429 오류에 직면하고 자신의 스크래핑 트래픽이 합법적인 경우에만 이를 구현해야 합니다.
빠른 설치
Claude Code
추천npx skills add pjt222/agent-almanac -a claude-code/plugin add https://github.com/pjt222/agent-almanacgit clone https://github.com/pjt222/agent-almanac.git ~/.claude/skills/rotate-scraping-proxiesClaude Code에서 이 명령을 복사하여 붙여넣어 스킬을 설치하세요
문서
Rotate Scraping Proxies
Network-layer escalation when client stealth exhausted. Last resort, not default — expensive, ethically charged, easily misused. Skill teaches when not to use as much as how.
Use When
headless-web-scraping(Fetcher → StealthyFetcher → DynamicFetcher) tried + still 403/429/geo-block- Rate limit ≥3s +
robots.txtpermits - UA + TLS fingerprint realistic (not default
python-requests) - Scrape legit: public data, no auth bypass, no paywall, no personal data w/o legal basis
- Budget for proxy traffic + accept ops complexity
Don't use → public API exists, ToS forbids automation, geo-license circumvention, fraud|cred-stuff|sneaker-bot|piracy.
In
- Required: Target URLs + legal basis
- Required: Proxy creds (env var, never hard-code)
- Required: Pool type — datacenter|residential|mobile
- Optional: Geo target (country|region|city)
- Optional: Rotation granularity — per-req (default)|sticky session
- Optional: Daily traffic|spend cap
- Optional: Rate delay s (default 1, even w/ rotation)
Do
Step 1: Pre-flight Legality+Ethics
Gate workflow on documented review. Skip = biggest harm source.
# Inputs to confirm before writing any code:
# 1. Is the data public (no login required)?
# 2. Does robots.txt permit the path?
# 3. Does the site's ToS prohibit automated access? (read it)
# 4. Would the scraping process personal data? If yes, what is the legal basis?
# 5. Could this access circumvent geo-licensing, paywalls, or auth?
# 6. Is there a public API or data dump that would make scraping unnecessary?
# 7. Have you contacted the site owner if scope is large?
→ Every Q has defensible written ans. First "no"|"unknown" stops proc.
If err:
- ToS forbids → don't proceed; contact owner|use API|licensed dataset
- Personal data no basis → don't proceed; engage privacy counsel
- Auth|geo-license bypass → don't proceed under any circumstances
Step 2: Pool Type
Diff cost, detect, ethics. Cheapest tier solving block.
| Pool type | Detectability | Cost | Best for |
|---|---|---|---|
| Datacenter | High (easily blocked by Cloudflare/Akamai) | $ | Sites with no real anti-bot, geo-shifting only |
| Residential | Low (real ISP IPs) | $$$ | Sites that block datacenter ASNs |
| Mobile | Very low (carrier-grade NAT, shared with thousands) | $$$$ | Sites that even block residential (rare) |
Ethical caveat residential+mobile: routes via real consumer connections. Provider consent varies — some pay, some bundle exit-node consent into "free VPN" EULAs unread. Prefer audited opt-in. If wouldn't be comfortable w/ stranger sending scrape via your home router → don't send via theirs.
→ Documented choice + cheapest viable + brief why higher rejected (or needed).
If err:
- Datacenter blocked, residential over budget → narrow scope (fewer URLs, slower) before upgrade
- No documented opt-in consent → reconsider whether scrape needed at all
Step 3: Integrate Rotation w/ Scrapling
Wire proxy → scrapling fetcher. Read creds from env, never hard-code, never commit .env.
import os
import random
from scrapling import Fetcher, StealthyFetcher
# Pattern A: provider-managed rotating endpoint (one URL, provider rotates per request)
PROXY_URL = os.environ["SCRAPING_PROXY_URL"] # http://user:[email protected]:7777
fetcher = StealthyFetcher()
fetcher.configure(
headless=True,
timeout=60,
network_idle=True,
proxy=PROXY_URL,
)
# Pattern B: explicit pool, rotate yourself
POOL = os.environ["SCRAPING_PROXY_POOL"].split(",") # comma-separated URLs
def fetch_with_rotation(url):
proxy = random.choice(POOL)
fetcher = StealthyFetcher()
fetcher.configure(headless=True, timeout=60, proxy=proxy)
return fetcher.get(url)
→ Reqs succeed + egress IP varies. Confirm via IP echo (https://api.ipify.org) before real scrape.
If err:
- 407 Proxy Auth Required → wrong creds|URL-encoding broke pwd (re-encode special chars)
- Same IP every call → endpoint sticky default; check docs for
-rotatingor per-req flag - Massive latency → expected; rotation adds 200–2000ms/req
Step 4: Sticky Sessions + Pool Health
Granularity per workload + keep pool healthy.
# Sticky session for stateful flows (login, multi-page checkout-like crawls)
# Most providers expose a session ID via the username:
# user-session-abc123:[email protected]:7777
# All requests with the same session ID exit through the same IP for ~10 min.
# Per-request rotation for anonymous bulk scraping (default)
# Pool health check — call before bulk run
def check_pool(pool, sample_size=5):
sample = random.sample(pool, min(sample_size, len(pool)))
alive = []
for proxy in sample:
try:
r = StealthyFetcher().configure(proxy=proxy, timeout=10).get(
"https://api.ipify.org"
)
if r.status == 200:
alive.append(proxy)
except Exception:
pass
return alive
# Backoff on transient proxy failures
def fetch_with_backoff(url, max_attempts=3):
for attempt in range(max_attempts):
try:
r = fetch_with_rotation(url)
if r.status not in (407, 502, 503):
return r
except Exception:
pass
time.sleep(2 ** attempt)
return None
→ Stateful preserves cookies; bulk shows IP variance; dead proxies skipped not looped.
If err:
- Login breaks mid-flow → rotating in session; switch to sticky-session creds
- All sample fail health → pool exhausted|creds expired; rotate|contact provider
Step 5: Monitor + Cost + Kill Switch
Per-GB + per-req cost. Runaway scrape → runaway invoice. Always cap+abort.
import time
class ScrapeBudget:
def __init__(self, max_requests, max_duration_seconds, max_failures):
self.max_requests = max_requests
self.max_duration = max_duration_seconds
self.max_failures = max_failures
self.requests = 0
self.failures = 0
self.start = time.monotonic()
def allow(self):
if self.requests >= self.max_requests:
return False, "request cap reached"
if time.monotonic() - self.start >= self.max_duration:
return False, "time cap reached"
if self.failures >= self.max_failures:
return False, "failure cap reached (circuit breaker)"
return True, None
def record(self, success):
self.requests += 1
if not success:
self.failures += 1
budget = ScrapeBudget(max_requests=1000, max_duration_seconds=3600, max_failures=20)
for url in target_urls:
ok, reason = budget.allow()
if not ok:
print(f"Aborting: {reason}")
break
response = fetch_with_backoff(url)
budget.record(success=response is not None)
time.sleep(1) # rate limiting still applies even with rotation
→ Caps trigger before runaway. Logs show per-proxy success → identify+exclude bad IP.
If err:
- Fail rate >20% → pause; site detected pattern (shared subnet); switch type|stop
- Cost-per-record 5x → cache, dedup URLs, batch
Check
- Step 1 legality documented written before code
- No proxy creds|pool URLs|session IDs in tracked files (grep
gateway.,proxy=, hostname) -
.envin.gitignore - Pool justified: cheapest viable + consent verified for residential|mobile
- IP variance confirmed vs echo before real run
- Stateful → sticky; bulk → per-req
- Budget caps (req|dur|fail) wired+tested
- Rate limit (≥1s) preserved — rotation ≠ flood excuse
-
robots.txtrespected — rotation doesn't override
Traps
- Rotate before stealth exhausted: Site needs realistic UA, TLS, slower cadence — not new IP. Try
StealthyFetcher+rate first. - Hard-coded creds: Source file leaks → git, images, traces. Always env|secrets manager.
- Rotate mid-session: Per-req breaks cookies|CSRF|cart. Sticky for stateful.
- "Ethical anonymity" myth: Rotation hides you from target → doesn't make harmful scrape ethical. ToS, copyright, privacy law, rate-ethics still apply.
- Residential for high-risk: Cred stuff, sneaker, geo-pirate, fraud → out of scope. Stop.
- Ignore
robots.txtbecause rotation: Doesn't grant permission. Directive=directive. - No kill switch: Unsupervised loop on metered pool → 4-figure invoice overnight. Always cap.
- Opaque consent residential: Some src exit nodes from "free VPN" EULAs unread. Pay premium for audited opt-in.
→
- headless-web-scraping — parent; always start there
- use-graphql-api — prefer official APIs
- deploy-searxng — self-host search → no scrape
- configure-reverse-proxy — opposite direction reference
- security-audit-codebase — run after creds → confirm no leak
GitHub 저장소
연관 스킬
qmd
개발qmd는 BM25, 벡터 임베딩, 재순위화를 결합한 하이브리드 검색을 통해 로컬 파일을 색인화하고 검색할 수 있는 로컬 검색 및 색인화 CLI 도구입니다. 명령줄 사용과 Claude 통합을 위한 MCP(Model Context Protocol) 모드를 모두 지원합니다. 이 도구는 임베딩에 Ollama를 사용하고 색인을 로컬에 저장하여 터미널에서 직접 문서나 코드베이스를 검색하는 데 이상적입니다.
subagent-driven-development
개발이 스킬은 각 독립적인 작업마다 새로운 하위 에이전트를 배치하고 작업 사이에 코드 리뷰를 진행하여 구현 계획을 실행합니다. 이 리뷰 프로세스를 통해 품질 게이트를 유지하면서 빠른 반복 작업을 가능하게 합니다. 동일한 세션 내에서 대부분 독립적인 작업을 진행할 때 내장된 품질 검증과 함께 지속적인 진행을 보장하기 위해 사용하세요.
mcporter
개발mcporter 스킬은 개발자가 Claude에서 직접 Model Context Protocol(MCP) 서버를 관리하고 호출할 수 있도록 합니다. 이 스킬은 사용 가능한 서버를 나열하고, 인수를 사용해 해당 서버의 도구를 호출하며, 인증 및 데몬 생명주기를 처리하는 명령어를 제공합니다. 개발 워크플로우에서 MCP 서버 기능을 통합하고 테스트할 때 이 스킬을 사용하세요.
adk-deployment-specialist
개발이 스킬은 A2A 프로토콜을 사용하여 Vertex AI ADK 에이전트를 배포하고 오케스트레이션하며, AgentCard 검색, 작업 제출, 코드 실행 샌드박스 및 메모리 뱅크와 같은 지원 도구를 관리합니다. Python, Java 또는 Go 언어로 순차, 병렬 또는 루프 오케스트레이션 패턴을 갖춘 다중 에이전트 시스템 구축을 가능하게 합니다. Google Cloud에서 ADK 에이전트 배포 또는 에이전트 워크플로우 오케스트레이션을 요청받았을 때 사용하세요.
