スキル一覧に戻る

rotate-scraping-proxies

pjt222
更新日 2 days ago
3 閲覧
17
2
17
GitHubで表示
開発aiapidata

について

このスキルは、クライアントサイドのステルス技術が失敗した際のブロック回避のために、プロバイダー中立のプロキシローテーションを実装します。データセンター、レジデンシャル、モバイルの各プロキシブールから選択可能で、ステートフルなワークフローのためのセッション持続性を備えています。開発者は、コストを監視し倫理的限界を順守しながら、正当なトラフィックに対してのみ使用すべきです。

クイックインストール

Claude Code

推奨
メイン
npx skills add pjt222/agent-almanac -a claude-code
プラグインコマンド代替
/plugin add https://github.com/pjt222/agent-almanac
Git クローン代替
git clone https://github.com/pjt222/agent-almanac.git ~/.claude/skills/rotate-scraping-proxies

このコマンドをClaude Codeにコピー&ペーストしてスキルをインストールします

ドキュメント

Rotate Scraping Proxies

Network-layer escalation for scraping where client-side stealth is exhausted. Proxy rotation is a last resort, not a default — expensive, ethically charged, and easily misused. This skill teaches when not to use it as much as how.

When to Use

  • headless-web-scraping (Fetcher → StealthyFetcher → DynamicFetcher) tried and target still returns 403/429/geo-blocks
  • Rate limiting already at 3+ second intervals and robots.txt permits the path
  • User-Agent and TLS fingerprint already realistic (not default python-requests)
  • Scraping is legitimate: public data, no auth circumvention, no paywall bypass, no personal data harvested without legal basis
  • You can budget proxy traffic and accept operational complexity

Do not use when: a public API exists (use it), the site's ToS forbids automated access, you would circumvent geo-licensing, or the goal is fraud / credential stuffing / sneaker bots / content piracy.

Inputs

  • Required: Target URLs and the legal basis for scraping them
  • Required: Proxy pool credentials (read from environment, never hard-coded)
  • Required: Pool type — datacenter, residential, or mobile
  • Optional: Geographic targeting (country / region / city)
  • Optional: Rotation granularity — per-request (default) or sticky session
  • Optional: Daily traffic / spend cap
  • Optional: Rate limit delay in seconds (default: 1, even with rotation)

Procedure

Step 1: Pre-flight Legality and Ethics Check

Gate the workflow on a documented legal and ethical review. Skipping this is the single biggest source of harm.

# Inputs to confirm before writing any code:
# 1. Is the data public (no login required)?
# 2. Does robots.txt permit the path?
# 3. Does the site's ToS prohibit automated access? (read it)
# 4. Would the scraping process personal data? If yes, what is the legal basis?
# 5. Could this access circumvent geo-licensing, paywalls, or auth?
# 6. Is there a public API or data dump that would make scraping unnecessary?
# 7. Have you contacted the site owner if scope is large?

Got: Every question has a defensible written answer. The first "no" or "unknown" stops the procedure until resolved.

If fail:

  • ToS forbids automated access — do not proceed; contact the site owner or use an official API or licensed dataset
  • Personal data with no legal basis — do not proceed; engage privacy counsel
  • Circumvents auth or geo-licensing — do not proceed under any circumstances

Step 2: Choose a Pool Type

Different pool types have different cost, detectability, and ethics. Pick the cheapest tier that solves your block.

Pool typeDetectabilityCostBest for
DatacenterHigh (easily blocked by Cloudflare/Akamai)$Sites with no real anti-bot, geo-shifting only
ResidentialLow (real ISP IPs)$$$Sites that block datacenter ASNs
MobileVery low (carrier-grade NAT, shared with thousands)$$$$Sites that even block residential (rare)

Ethical caveat for residential and mobile: these pools route your traffic through real consumer connections. Operator consent models vary — some pay users, some bundle exit-node consent into "free VPN" EULAs that users do not read. Prefer providers with audited, opt-in consent. If you would not be comfortable with a stranger sending scraping traffic through your home router, do not send yours through theirs.

Got: A documented choice with the cheapest viable tier and a brief note on why higher tiers were rejected (or why a higher tier is needed).

If fail:

  • Datacenter blocked but residential over budget — narrow scraping scope (fewer URLs, slower cadence) before upgrading the tier
  • Cannot find a provider with documented opt-in consent — reconsider whether the scraping is necessary

Step 3: Integrate Rotation with Scrapling

Wire the proxy into scrapling fetchers. Read credentials from environment variables — never hard-code, never commit a .env.

import os
import random
from scrapling import Fetcher, StealthyFetcher

# Pattern A: provider-managed rotating endpoint (one URL, provider rotates per request)
PROXY_URL = os.environ["SCRAPING_PROXY_URL"]  # http://user:[email protected]:7777

fetcher = StealthyFetcher()
fetcher.configure(
    headless=True,
    timeout=60,
    network_idle=True,
    proxy=PROXY_URL,
)

# Pattern B: explicit pool, rotate yourself
POOL = os.environ["SCRAPING_PROXY_POOL"].split(",")  # comma-separated URLs

def fetch_with_rotation(url):
    proxy = random.choice(POOL)
    fetcher = StealthyFetcher()
    fetcher.configure(headless=True, timeout=60, proxy=proxy)
    return fetcher.get(url)

Got: Requests succeed and the egress IP varies between calls. Confirm by hitting an IP-echo endpoint (e.g. https://api.ipify.org) before running the real scrape.

If fail:

  • 407 Proxy Authentication Required — credentials wrong or password URL-encoding broke (re-encode special characters)
  • Same IP on every call — provider endpoint may be sticky by default; check docs for a -rotating or per-request flag
  • Massive latency increase — expected; rotation adds 200–2000ms per request

Step 4: Sticky Sessions and Pool Health

Decide rotation granularity per workload, then keep the pool healthy.

# Sticky session for stateful flows (login, multi-page checkout-like crawls)
# Most providers expose a session ID via the username:
#   user-session-abc123:[email protected]:7777
# All requests with the same session ID exit through the same IP for ~10 min.

# Per-request rotation for anonymous bulk scraping (default)

# Pool health check — call before bulk run
def check_pool(pool, sample_size=5):
    sample = random.sample(pool, min(sample_size, len(pool)))
    alive = []
    for proxy in sample:
        try:
            r = StealthyFetcher().configure(proxy=proxy, timeout=10).get(
                "https://api.ipify.org"
            )
            if r.status == 200:
                alive.append(proxy)
        except Exception:
            pass
    return alive

# Backoff on transient proxy failures
def fetch_with_backoff(url, max_attempts=3):
    for attempt in range(max_attempts):
        try:
            r = fetch_with_rotation(url)
            if r.status not in (407, 502, 503):
                return r
        except Exception:
            pass
        time.sleep(2 ** attempt)
    return None

Got: Stateful flows preserve cookies across requests; bulk anonymous scraping shows IP variance across requests; dead proxies skipped instead of looping.

If fail:

  • Login breaks mid-flow — rotation happening inside the session; switch to sticky-session credentials
  • All proxies in sample fail health check — pool exhausted or credentials expired; rotate credentials or contact provider

Step 5: Monitoring, Cost Control, and Kill Switch

Proxy traffic has a per-GB cost and a per-request cost. Runaway scrapers generate runaway invoices. Always include limits and an abort.

import time

class ScrapeBudget:
    def __init__(self, max_requests, max_duration_seconds, max_failures):
        self.max_requests = max_requests
        self.max_duration = max_duration_seconds
        self.max_failures = max_failures
        self.requests = 0
        self.failures = 0
        self.start = time.monotonic()

    def allow(self):
        if self.requests >= self.max_requests:
            return False, "request cap reached"
        if time.monotonic() - self.start >= self.max_duration:
            return False, "time cap reached"
        if self.failures >= self.max_failures:
            return False, "failure cap reached (circuit breaker)"
        return True, None

    def record(self, success):
        self.requests += 1
        if not success:
            self.failures += 1

budget = ScrapeBudget(max_requests=1000, max_duration_seconds=3600, max_failures=20)

for url in target_urls:
    ok, reason = budget.allow()
    if not ok:
        print(f"Aborting: {reason}")
        break
    response = fetch_with_backoff(url)
    budget.record(success=response is not None)
    time.sleep(1)  # rate limiting still applies even with rotation

Got: Budget caps trigger before runaway cost. Logs show per-proxy success rate so a bad egress IP can be identified and excluded.

If fail:

  • Failure rate climbs above 20% — pause; the site has detected the rotation pattern (e.g. all your IPs share a subnet); switch pool type or stop
  • Cost-per-record exceeds expectations by 5x — cache aggressively, deduplicate URLs, batch where possible

Validation

  • Step 1 legality check is documented in writing before any code runs
  • No proxy credentials, pool URLs, or session IDs in tracked files (grep for gateway., proxy=, the provider hostname)
  • .env (or equivalent) is in .gitignore
  • Pool choice justified: cheapest viable tier, with consent model verified for residential/mobile
  • IP variance confirmed against an echo endpoint before the real run
  • Stateful flows use sticky sessions; bulk anonymous use per-request
  • Budget caps (requests, duration, failures) wired and tested
  • Rate limiting (≥1s) preserved — rotation is not an excuse to flood
  • robots.txt still respected — rotation does not override it

Pitfalls

  • Rotating before stealth is exhausted: the site often does not need a new IP — it needs a realistic User-Agent, TLS fingerprint, and slower cadence. Try StealthyFetcher and rate limiting first; rotation is expensive and unethical to deploy unnecessarily.
  • Hard-coded credentials: pasting the proxy URL into source leaks it to git, container images, and stack traces. Read from environment variables or a secrets manager.
  • Rotating mid-session: per-request rotation breaks any flow that depends on cookies, CSRF tokens, or shopping-cart state. Use sticky sessions for stateful work.
  • Treating rotation as "ethical anonymity": rotation hides you from the target, but does not make harmful scraping ethical. ToS, copyright, privacy law, and rate-limit ethics still apply unchanged.
  • Using residential proxies for high-risk activity: credential stuffing, sneaker botting, geo-pirating streaming content, fraud — explicitly out of scope. If your use case looks like this, stop.
  • Ignoring robots.txt because "we have rotation now": rotation does not grant permission. The directive is the directive.
  • No kill switch: an unsupervised loop on a metered proxy pool turns into a four-figure invoice overnight. Always cap requests, duration, and failures.
  • Choosing a residential pool with opaque consent: some providers source exit nodes from "free VPN" EULAs that real users never read. Pay the premium for an audited, opt-in consent model.

Related Skills

<!-- Keep under 500 lines. Extract large examples to references/EXAMPLES.md if needed. -->

GitHub リポジトリ

pjt222/agent-almanac
パス: i18n/caveman-lite/skills/rotate-scraping-proxies
0
agentsagentskillsai-assisted-developmentclaude-codeskillsteams

関連スキル

qmd

開発

qmdは、BM25、ベクトル埋め込み、およびリランキングを組み合わせたハイブリッド検索を用いて、ローカルファイルのインデックス作成と検索を可能にするローカル検索・インデックス作成CLIツールです。コマンドラインでの使用と、Claudeとの統合のためのMCP(Model Context Protocol)モードの両方をサポートしています。このツールは埋め込みにOllamaを使用し、インデックスをローカルに保存するため、ターミナルから直接ドキュメントやコードベースを検索するのに最適です。

スキルを見る

subagent-driven-development

開発

このスキルは、各独立したタスクに対して新規のサブエージェントを起動し、タスク間でコードレビューを実施しながら実装計画を実行します。レビュープロセスを通じて品質基準を維持しつつ、迅速な反復を可能にします。同一セッション内で主に独立したタスクに取り組む際に本スキルをご利用いただくことで、組み込まれた品質チェックを伴う継続的な進捗を確保できます。

スキルを見る

mcporter

開発

mcporterスキルは、開発者がClaudeから直接Model Context Protocol(MCP)サーバーを管理および呼び出せるようにします。このスキルは、利用可能なサーバーの一覧表示、引数を指定したツールの呼び出し、認証およびデーモンのライフサイクル管理を行うコマンドを提供します。開発ワークフローにおいてMCPサーバーの機能を統合およびテストする際に、このスキルをご利用ください。

スキルを見る

adk-deployment-specialist

開発

このスキルは、A2Aプロトコルを使用してVertex AI ADKエージェントをデプロイおよびオーケストレーションし、AgentCardの発見、タスク送信、およびコード実行サンドボックスやメモリバンクなどのサポートツールを管理します。Python、Java、またはGoで、順次、並列、またはループのオーケストレーションパターンを用いたマルチエージェントシステムの構築を可能にします。Google Cloud上でADKエージェントのデプロイやエージェントワークフローのオーケストレーションを求められた際にご利用ください。

スキルを見る