MCP HubMCP Hub
스킬 목록으로 돌아가기

rotate-scraping-proxies

pjt222
업데이트됨 Yesterday
1 조회
17
2
17
GitHub에서 보기
개발aiapidata

정보

이 스킬은 표준적인 스텔스 기법이 실패할 때 웹 스크래핑을 위한 프록시 회전 기능을 제공하며, 데이터센터, 주거용, 모바일 프록시 풀에 대한 접근과 세션 관리 및 모니터링을 지원합니다. 이는 헤드리스 스크래핑 접근 방식이 차단된 후 최후의 수단으로 에스컬레이션하기 위해 설계되었으며, 윤리적 사용과 비용 인식을 강조합니다. 개발자는 지속적인 403/429 오류에 직면하고 자신의 스크래핑 트래픽이 합법적인 경우에만 이를 구현해야 합니다.

빠른 설치

Claude Code

추천
기본
npx skills add pjt222/agent-almanac -a claude-code
플러그인 명령대체
/plugin add https://github.com/pjt222/agent-almanac
Git 클론대체
git clone https://github.com/pjt222/agent-almanac.git ~/.claude/skills/rotate-scraping-proxies

Claude Code에서 이 명령을 복사하여 붙여넣어 스킬을 설치하세요

문서

Rotate Scraping Proxies

Network-layer escalation when client stealth exhausted. Last resort, not default — expensive, ethically charged, easily misused. Skill teaches when not to use as much as how.

Use When

  • headless-web-scraping (Fetcher → StealthyFetcher → DynamicFetcher) tried + still 403/429/geo-block
  • Rate limit ≥3s + robots.txt permits
  • UA + TLS fingerprint realistic (not default python-requests)
  • Scrape legit: public data, no auth bypass, no paywall, no personal data w/o legal basis
  • Budget for proxy traffic + accept ops complexity

Don't use → public API exists, ToS forbids automation, geo-license circumvention, fraud|cred-stuff|sneaker-bot|piracy.

In

  • Required: Target URLs + legal basis
  • Required: Proxy creds (env var, never hard-code)
  • Required: Pool type — datacenter|residential|mobile
  • Optional: Geo target (country|region|city)
  • Optional: Rotation granularity — per-req (default)|sticky session
  • Optional: Daily traffic|spend cap
  • Optional: Rate delay s (default 1, even w/ rotation)

Do

Step 1: Pre-flight Legality+Ethics

Gate workflow on documented review. Skip = biggest harm source.

# Inputs to confirm before writing any code:
# 1. Is the data public (no login required)?
# 2. Does robots.txt permit the path?
# 3. Does the site's ToS prohibit automated access? (read it)
# 4. Would the scraping process personal data? If yes, what is the legal basis?
# 5. Could this access circumvent geo-licensing, paywalls, or auth?
# 6. Is there a public API or data dump that would make scraping unnecessary?
# 7. Have you contacted the site owner if scope is large?

→ Every Q has defensible written ans. First "no"|"unknown" stops proc.

If err:

  • ToS forbids → don't proceed; contact owner|use API|licensed dataset
  • Personal data no basis → don't proceed; engage privacy counsel
  • Auth|geo-license bypass → don't proceed under any circumstances

Step 2: Pool Type

Diff cost, detect, ethics. Cheapest tier solving block.

Pool typeDetectabilityCostBest for
DatacenterHigh (easily blocked by Cloudflare/Akamai)$Sites with no real anti-bot, geo-shifting only
ResidentialLow (real ISP IPs)$$$Sites that block datacenter ASNs
MobileVery low (carrier-grade NAT, shared with thousands)$$$$Sites that even block residential (rare)

Ethical caveat residential+mobile: routes via real consumer connections. Provider consent varies — some pay, some bundle exit-node consent into "free VPN" EULAs unread. Prefer audited opt-in. If wouldn't be comfortable w/ stranger sending scrape via your home router → don't send via theirs.

→ Documented choice + cheapest viable + brief why higher rejected (or needed).

If err:

  • Datacenter blocked, residential over budget → narrow scope (fewer URLs, slower) before upgrade
  • No documented opt-in consent → reconsider whether scrape needed at all

Step 3: Integrate Rotation w/ Scrapling

Wire proxy → scrapling fetcher. Read creds from env, never hard-code, never commit .env.

import os
import random
from scrapling import Fetcher, StealthyFetcher

# Pattern A: provider-managed rotating endpoint (one URL, provider rotates per request)
PROXY_URL = os.environ["SCRAPING_PROXY_URL"]  # http://user:[email protected]:7777

fetcher = StealthyFetcher()
fetcher.configure(
    headless=True,
    timeout=60,
    network_idle=True,
    proxy=PROXY_URL,
)

# Pattern B: explicit pool, rotate yourself
POOL = os.environ["SCRAPING_PROXY_POOL"].split(",")  # comma-separated URLs

def fetch_with_rotation(url):
    proxy = random.choice(POOL)
    fetcher = StealthyFetcher()
    fetcher.configure(headless=True, timeout=60, proxy=proxy)
    return fetcher.get(url)

→ Reqs succeed + egress IP varies. Confirm via IP echo (https://api.ipify.org) before real scrape.

If err:

  • 407 Proxy Auth Required → wrong creds|URL-encoding broke pwd (re-encode special chars)
  • Same IP every call → endpoint sticky default; check docs for -rotating or per-req flag
  • Massive latency → expected; rotation adds 200–2000ms/req

Step 4: Sticky Sessions + Pool Health

Granularity per workload + keep pool healthy.

# Sticky session for stateful flows (login, multi-page checkout-like crawls)
# Most providers expose a session ID via the username:
#   user-session-abc123:[email protected]:7777
# All requests with the same session ID exit through the same IP for ~10 min.

# Per-request rotation for anonymous bulk scraping (default)

# Pool health check — call before bulk run
def check_pool(pool, sample_size=5):
    sample = random.sample(pool, min(sample_size, len(pool)))
    alive = []
    for proxy in sample:
        try:
            r = StealthyFetcher().configure(proxy=proxy, timeout=10).get(
                "https://api.ipify.org"
            )
            if r.status == 200:
                alive.append(proxy)
        except Exception:
            pass
    return alive

# Backoff on transient proxy failures
def fetch_with_backoff(url, max_attempts=3):
    for attempt in range(max_attempts):
        try:
            r = fetch_with_rotation(url)
            if r.status not in (407, 502, 503):
                return r
        except Exception:
            pass
        time.sleep(2 ** attempt)
    return None

→ Stateful preserves cookies; bulk shows IP variance; dead proxies skipped not looped.

If err:

  • Login breaks mid-flow → rotating in session; switch to sticky-session creds
  • All sample fail health → pool exhausted|creds expired; rotate|contact provider

Step 5: Monitor + Cost + Kill Switch

Per-GB + per-req cost. Runaway scrape → runaway invoice. Always cap+abort.

import time

class ScrapeBudget:
    def __init__(self, max_requests, max_duration_seconds, max_failures):
        self.max_requests = max_requests
        self.max_duration = max_duration_seconds
        self.max_failures = max_failures
        self.requests = 0
        self.failures = 0
        self.start = time.monotonic()

    def allow(self):
        if self.requests >= self.max_requests:
            return False, "request cap reached"
        if time.monotonic() - self.start >= self.max_duration:
            return False, "time cap reached"
        if self.failures >= self.max_failures:
            return False, "failure cap reached (circuit breaker)"
        return True, None

    def record(self, success):
        self.requests += 1
        if not success:
            self.failures += 1

budget = ScrapeBudget(max_requests=1000, max_duration_seconds=3600, max_failures=20)

for url in target_urls:
    ok, reason = budget.allow()
    if not ok:
        print(f"Aborting: {reason}")
        break
    response = fetch_with_backoff(url)
    budget.record(success=response is not None)
    time.sleep(1)  # rate limiting still applies even with rotation

→ Caps trigger before runaway. Logs show per-proxy success → identify+exclude bad IP.

If err:

  • Fail rate >20% → pause; site detected pattern (shared subnet); switch type|stop
  • Cost-per-record 5x → cache, dedup URLs, batch

Check

  • Step 1 legality documented written before code
  • No proxy creds|pool URLs|session IDs in tracked files (grep gateway., proxy=, hostname)
  • .env in .gitignore
  • Pool justified: cheapest viable + consent verified for residential|mobile
  • IP variance confirmed vs echo before real run
  • Stateful → sticky; bulk → per-req
  • Budget caps (req|dur|fail) wired+tested
  • Rate limit (≥1s) preserved — rotation ≠ flood excuse
  • robots.txt respected — rotation doesn't override

Traps

  • Rotate before stealth exhausted: Site needs realistic UA, TLS, slower cadence — not new IP. Try StealthyFetcher+rate first.
  • Hard-coded creds: Source file leaks → git, images, traces. Always env|secrets manager.
  • Rotate mid-session: Per-req breaks cookies|CSRF|cart. Sticky for stateful.
  • "Ethical anonymity" myth: Rotation hides you from target → doesn't make harmful scrape ethical. ToS, copyright, privacy law, rate-ethics still apply.
  • Residential for high-risk: Cred stuff, sneaker, geo-pirate, fraud → out of scope. Stop.
  • Ignore robots.txt because rotation: Doesn't grant permission. Directive=directive.
  • No kill switch: Unsupervised loop on metered pool → 4-figure invoice overnight. Always cap.
  • Opaque consent residential: Some src exit nodes from "free VPN" EULAs unread. Pay premium for audited opt-in.

<!-- Keep under 500 lines. Extract large examples to references/EXAMPLES.md if needed. -->

GitHub 저장소

pjt222/agent-almanac
경로: i18n/caveman-ultra/skills/rotate-scraping-proxies
0
agentsagentskillsai-assisted-developmentclaude-codeskillsteams

연관 스킬

qmd

개발

qmd는 BM25, 벡터 임베딩, 재순위화를 결합한 하이브리드 검색을 통해 로컬 파일을 색인화하고 검색할 수 있는 로컬 검색 및 색인화 CLI 도구입니다. 명령줄 사용과 Claude 통합을 위한 MCP(Model Context Protocol) 모드를 모두 지원합니다. 이 도구는 임베딩에 Ollama를 사용하고 색인을 로컬에 저장하여 터미널에서 직접 문서나 코드베이스를 검색하는 데 이상적입니다.

스킬 보기

subagent-driven-development

개발

이 스킬은 각 독립적인 작업마다 새로운 하위 에이전트를 배치하고 작업 사이에 코드 리뷰를 진행하여 구현 계획을 실행합니다. 이 리뷰 프로세스를 통해 품질 게이트를 유지하면서 빠른 반복 작업을 가능하게 합니다. 동일한 세션 내에서 대부분 독립적인 작업을 진행할 때 내장된 품질 검증과 함께 지속적인 진행을 보장하기 위해 사용하세요.

스킬 보기

mcporter

개발

mcporter 스킬은 개발자가 Claude에서 직접 Model Context Protocol(MCP) 서버를 관리하고 호출할 수 있도록 합니다. 이 스킬은 사용 가능한 서버를 나열하고, 인수를 사용해 해당 서버의 도구를 호출하며, 인증 및 데몬 생명주기를 처리하는 명령어를 제공합니다. 개발 워크플로우에서 MCP 서버 기능을 통합하고 테스트할 때 이 스킬을 사용하세요.

스킬 보기

adk-deployment-specialist

개발

이 스킬은 A2A 프로토콜을 사용하여 Vertex AI ADK 에이전트를 배포하고 오케스트레이션하며, AgentCard 검색, 작업 제출, 코드 실행 샌드박스 및 메모리 뱅크와 같은 지원 도구를 관리합니다. Python, Java 또는 Go 언어로 순차, 병렬 또는 루프 오케스트레이션 패턴을 갖춘 다중 에이전트 시스템 구축을 가능하게 합니다. Google Cloud에서 ADK 에이전트 배포 또는 에이전트 워크플로우 오케스트레이션을 요청받았을 때 사용하세요.

스킬 보기