MCP HubMCP Hub
스킬 목록으로 돌아가기

rotate-scraping-proxies

pjt222
업데이트됨 2 days ago
1 조회
17
2
17
GitHub에서 보기
개발aiapidata

정보

이 스킬은 클라이언트 측 은닉 기술이 실패할 때 차단을 극복하기 위해 공급자 중립적 프록시 순환을 구현합니다. 데이터센터, 주거용, 모바일 프록시 풀 중에서 선택할 수 있으며 상태 유지 워크플로우를 위한 세션 지속성을 지원합니다. 개발자는 비용을 모니터링하고 윤리적 한계를 준수하면서 합법적인 트래픽에만 사용해야 합니다.

빠른 설치

Claude Code

추천
기본
npx skills add pjt222/agent-almanac -a claude-code
플러그인 명령대체
/plugin add https://github.com/pjt222/agent-almanac
Git 클론대체
git clone https://github.com/pjt222/agent-almanac.git ~/.claude/skills/rotate-scraping-proxies

Claude Code에서 이 명령을 복사하여 붙여넣어 스킬을 설치하세요

문서

Rotate Scraping Proxies

Network-layer escalation for scraping where client-side stealth is exhausted. Proxy rotation is a last resort, not a default — expensive, ethically charged, and easily misused. This skill teaches when not to use it as much as how.

When to Use

  • headless-web-scraping (Fetcher → StealthyFetcher → DynamicFetcher) tried and target still returns 403/429/geo-blocks
  • Rate limiting already at 3+ second intervals and robots.txt permits the path
  • User-Agent and TLS fingerprint already realistic (not default python-requests)
  • Scraping is legitimate: public data, no auth circumvention, no paywall bypass, no personal data harvested without legal basis
  • You can budget proxy traffic and accept operational complexity

Do not use when: a public API exists (use it), the site's ToS forbids automated access, you would circumvent geo-licensing, or the goal is fraud / credential stuffing / sneaker bots / content piracy.

Inputs

  • Required: Target URLs and the legal basis for scraping them
  • Required: Proxy pool credentials (read from environment, never hard-coded)
  • Required: Pool type — datacenter, residential, or mobile
  • Optional: Geographic targeting (country / region / city)
  • Optional: Rotation granularity — per-request (default) or sticky session
  • Optional: Daily traffic / spend cap
  • Optional: Rate limit delay in seconds (default: 1, even with rotation)

Procedure

Step 1: Pre-flight Legality and Ethics Check

Gate the workflow on a documented legal and ethical review. Skipping this is the single biggest source of harm.

# Inputs to confirm before writing any code:
# 1. Is the data public (no login required)?
# 2. Does robots.txt permit the path?
# 3. Does the site's ToS prohibit automated access? (read it)
# 4. Would the scraping process personal data? If yes, what is the legal basis?
# 5. Could this access circumvent geo-licensing, paywalls, or auth?
# 6. Is there a public API or data dump that would make scraping unnecessary?
# 7. Have you contacted the site owner if scope is large?

Got: Every question has a defensible written answer. The first "no" or "unknown" stops the procedure until resolved.

If fail:

  • ToS forbids automated access — do not proceed; contact the site owner or use an official API or licensed dataset
  • Personal data with no legal basis — do not proceed; engage privacy counsel
  • Circumvents auth or geo-licensing — do not proceed under any circumstances

Step 2: Choose a Pool Type

Different pool types have different cost, detectability, and ethics. Pick the cheapest tier that solves your block.

Pool typeDetectabilityCostBest for
DatacenterHigh (easily blocked by Cloudflare/Akamai)$Sites with no real anti-bot, geo-shifting only
ResidentialLow (real ISP IPs)$$$Sites that block datacenter ASNs
MobileVery low (carrier-grade NAT, shared with thousands)$$$$Sites that even block residential (rare)

Ethical caveat for residential and mobile: these pools route your traffic through real consumer connections. Operator consent models vary — some pay users, some bundle exit-node consent into "free VPN" EULAs that users do not read. Prefer providers with audited, opt-in consent. If you would not be comfortable with a stranger sending scraping traffic through your home router, do not send yours through theirs.

Got: A documented choice with the cheapest viable tier and a brief note on why higher tiers were rejected (or why a higher tier is needed).

If fail:

  • Datacenter blocked but residential over budget — narrow scraping scope (fewer URLs, slower cadence) before upgrading the tier
  • Cannot find a provider with documented opt-in consent — reconsider whether the scraping is necessary

Step 3: Integrate Rotation with Scrapling

Wire the proxy into scrapling fetchers. Read credentials from environment variables — never hard-code, never commit a .env.

import os
import random
from scrapling import Fetcher, StealthyFetcher

# Pattern A: provider-managed rotating endpoint (one URL, provider rotates per request)
PROXY_URL = os.environ["SCRAPING_PROXY_URL"]  # http://user:[email protected]:7777

fetcher = StealthyFetcher()
fetcher.configure(
    headless=True,
    timeout=60,
    network_idle=True,
    proxy=PROXY_URL,
)

# Pattern B: explicit pool, rotate yourself
POOL = os.environ["SCRAPING_PROXY_POOL"].split(",")  # comma-separated URLs

def fetch_with_rotation(url):
    proxy = random.choice(POOL)
    fetcher = StealthyFetcher()
    fetcher.configure(headless=True, timeout=60, proxy=proxy)
    return fetcher.get(url)

Got: Requests succeed and the egress IP varies between calls. Confirm by hitting an IP-echo endpoint (e.g. https://api.ipify.org) before running the real scrape.

If fail:

  • 407 Proxy Authentication Required — credentials wrong or password URL-encoding broke (re-encode special characters)
  • Same IP on every call — provider endpoint may be sticky by default; check docs for a -rotating or per-request flag
  • Massive latency increase — expected; rotation adds 200–2000ms per request

Step 4: Sticky Sessions and Pool Health

Decide rotation granularity per workload, then keep the pool healthy.

# Sticky session for stateful flows (login, multi-page checkout-like crawls)
# Most providers expose a session ID via the username:
#   user-session-abc123:[email protected]:7777
# All requests with the same session ID exit through the same IP for ~10 min.

# Per-request rotation for anonymous bulk scraping (default)

# Pool health check — call before bulk run
def check_pool(pool, sample_size=5):
    sample = random.sample(pool, min(sample_size, len(pool)))
    alive = []
    for proxy in sample:
        try:
            r = StealthyFetcher().configure(proxy=proxy, timeout=10).get(
                "https://api.ipify.org"
            )
            if r.status == 200:
                alive.append(proxy)
        except Exception:
            pass
    return alive

# Backoff on transient proxy failures
def fetch_with_backoff(url, max_attempts=3):
    for attempt in range(max_attempts):
        try:
            r = fetch_with_rotation(url)
            if r.status not in (407, 502, 503):
                return r
        except Exception:
            pass
        time.sleep(2 ** attempt)
    return None

Got: Stateful flows preserve cookies across requests; bulk anonymous scraping shows IP variance across requests; dead proxies skipped instead of looping.

If fail:

  • Login breaks mid-flow — rotation happening inside the session; switch to sticky-session credentials
  • All proxies in sample fail health check — pool exhausted or credentials expired; rotate credentials or contact provider

Step 5: Monitoring, Cost Control, and Kill Switch

Proxy traffic has a per-GB cost and a per-request cost. Runaway scrapers generate runaway invoices. Always include limits and an abort.

import time

class ScrapeBudget:
    def __init__(self, max_requests, max_duration_seconds, max_failures):
        self.max_requests = max_requests
        self.max_duration = max_duration_seconds
        self.max_failures = max_failures
        self.requests = 0
        self.failures = 0
        self.start = time.monotonic()

    def allow(self):
        if self.requests >= self.max_requests:
            return False, "request cap reached"
        if time.monotonic() - self.start >= self.max_duration:
            return False, "time cap reached"
        if self.failures >= self.max_failures:
            return False, "failure cap reached (circuit breaker)"
        return True, None

    def record(self, success):
        self.requests += 1
        if not success:
            self.failures += 1

budget = ScrapeBudget(max_requests=1000, max_duration_seconds=3600, max_failures=20)

for url in target_urls:
    ok, reason = budget.allow()
    if not ok:
        print(f"Aborting: {reason}")
        break
    response = fetch_with_backoff(url)
    budget.record(success=response is not None)
    time.sleep(1)  # rate limiting still applies even with rotation

Got: Budget caps trigger before runaway cost. Logs show per-proxy success rate so a bad egress IP can be identified and excluded.

If fail:

  • Failure rate climbs above 20% — pause; the site has detected the rotation pattern (e.g. all your IPs share a subnet); switch pool type or stop
  • Cost-per-record exceeds expectations by 5x — cache aggressively, deduplicate URLs, batch where possible

Validation

  • Step 1 legality check is documented in writing before any code runs
  • No proxy credentials, pool URLs, or session IDs in tracked files (grep for gateway., proxy=, the provider hostname)
  • .env (or equivalent) is in .gitignore
  • Pool choice justified: cheapest viable tier, with consent model verified for residential/mobile
  • IP variance confirmed against an echo endpoint before the real run
  • Stateful flows use sticky sessions; bulk anonymous use per-request
  • Budget caps (requests, duration, failures) wired and tested
  • Rate limiting (≥1s) preserved — rotation is not an excuse to flood
  • robots.txt still respected — rotation does not override it

Pitfalls

  • Rotating before stealth is exhausted: the site often does not need a new IP — it needs a realistic User-Agent, TLS fingerprint, and slower cadence. Try StealthyFetcher and rate limiting first; rotation is expensive and unethical to deploy unnecessarily.
  • Hard-coded credentials: pasting the proxy URL into source leaks it to git, container images, and stack traces. Read from environment variables or a secrets manager.
  • Rotating mid-session: per-request rotation breaks any flow that depends on cookies, CSRF tokens, or shopping-cart state. Use sticky sessions for stateful work.
  • Treating rotation as "ethical anonymity": rotation hides you from the target, but does not make harmful scraping ethical. ToS, copyright, privacy law, and rate-limit ethics still apply unchanged.
  • Using residential proxies for high-risk activity: credential stuffing, sneaker botting, geo-pirating streaming content, fraud — explicitly out of scope. If your use case looks like this, stop.
  • Ignoring robots.txt because "we have rotation now": rotation does not grant permission. The directive is the directive.
  • No kill switch: an unsupervised loop on a metered proxy pool turns into a four-figure invoice overnight. Always cap requests, duration, and failures.
  • Choosing a residential pool with opaque consent: some providers source exit nodes from "free VPN" EULAs that real users never read. Pay the premium for an audited, opt-in consent model.

Related Skills

<!-- Keep under 500 lines. Extract large examples to references/EXAMPLES.md if needed. -->

GitHub 저장소

pjt222/agent-almanac
경로: i18n/caveman-lite/skills/rotate-scraping-proxies
0
agentsagentskillsai-assisted-developmentclaude-codeskillsteams

연관 스킬

qmd

개발

qmd는 BM25, 벡터 임베딩, 재순위화를 결합한 하이브리드 검색을 통해 로컬 파일을 색인화하고 검색할 수 있는 로컬 검색 및 색인화 CLI 도구입니다. 명령줄 사용과 Claude 통합을 위한 MCP(Model Context Protocol) 모드를 모두 지원합니다. 이 도구는 임베딩에 Ollama를 사용하고 색인을 로컬에 저장하여 터미널에서 직접 문서나 코드베이스를 검색하는 데 이상적입니다.

스킬 보기

subagent-driven-development

개발

이 스킬은 각 독립적인 작업마다 새로운 하위 에이전트를 배치하고 작업 사이에 코드 리뷰를 진행하여 구현 계획을 실행합니다. 이 리뷰 프로세스를 통해 품질 게이트를 유지하면서 빠른 반복 작업을 가능하게 합니다. 동일한 세션 내에서 대부분 독립적인 작업을 진행할 때 내장된 품질 검증과 함께 지속적인 진행을 보장하기 위해 사용하세요.

스킬 보기

mcporter

개발

mcporter 스킬은 개발자가 Claude에서 직접 Model Context Protocol(MCP) 서버를 관리하고 호출할 수 있도록 합니다. 이 스킬은 사용 가능한 서버를 나열하고, 인수를 사용해 해당 서버의 도구를 호출하며, 인증 및 데몬 생명주기를 처리하는 명령어를 제공합니다. 개발 워크플로우에서 MCP 서버 기능을 통합하고 테스트할 때 이 스킬을 사용하세요.

스킬 보기

adk-deployment-specialist

개발

이 스킬은 A2A 프로토콜을 사용하여 Vertex AI ADK 에이전트를 배포하고 오케스트레이션하며, AgentCard 검색, 작업 제출, 코드 실행 샌드박스 및 메모리 뱅크와 같은 지원 도구를 관리합니다. Python, Java 또는 Go 언어로 순차, 병렬 또는 루프 오케스트레이션 패턴을 갖춘 다중 에이전트 시스템 구축을 가능하게 합니다. Google Cloud에서 ADK 에이전트 배포 또는 에이전트 워크플로우 오케스트레이션을 요청받았을 때 사용하세요.

스킬 보기