SKILL·145F38

rotate-scraping-proxies

Name: rotate-scraping-proxies
Author: pjt222

pjt222

Actualizado 1 month ago

10 vistas

Desarrolloaiapidata

Acerca de

Esta habilidad implementa rotación de proxies neutrales de proveedor para superar bloqueos cuando fallan las técnicas de sigilo del lado del cliente. Permite seleccionar entre grupos de proxies de centros de datos, residenciales o móviles con persistencia de sesión para flujos de trabajo con estado. Los desarrolladores deben usarla solo para tráfico legítimo, monitoreando costos y respetando límites éticos.

Instalación rápida

Claude Code

Recomendado

Principal

npx skills add pjt222/agent-almanac -a claude-code

Comando PluginAlternativo

/plugin add https://github.com/pjt222/agent-almanac

Git CloneAlternativo

git clone https://github.com/pjt222/agent-almanac.git ~/.claude/skills/rotate-scraping-proxies

Copia y pega este comando en Claude Code para instalar esta habilidad

Documentación

Rotate Scraping Proxies

Network-layer escalation for scraping where client-side stealth is exhausted. Proxy rotation is a last resort, not a default — expensive, ethically charged, and easily misused. This skill teaches when not to use it as much as how.

When to Use

headless-web-scraping (Fetcher → StealthyFetcher → DynamicFetcher) tried and target still returns 403/429/geo-blocks
Rate limiting already at 3+ second intervals and robots.txt permits the path
User-Agent and TLS fingerprint already realistic (not default python-requests)
Scraping is legitimate: public data, no auth circumvention, no paywall bypass, no personal data harvested without legal basis
You can budget proxy traffic and accept operational complexity

Do not use when: a public API exists (use it), the site's ToS forbids automated access, you would circumvent geo-licensing, or the goal is fraud / credential stuffing / sneaker bots / content piracy.

Inputs

Required: Target URLs and the legal basis for scraping them
Required: Proxy pool credentials (read from environment, never hard-coded)
Required: Pool type — datacenter, residential, or mobile
Optional: Geographic targeting (country / region / city)
Optional: Rotation granularity — per-request (default) or sticky session
Optional: Daily traffic / spend cap
Optional: Rate limit delay in seconds (default: 1, even with rotation)

Procedure

Step 1: Pre-flight Legality and Ethics Check

Gate the workflow on a documented legal and ethical review. Skipping this is the single biggest source of harm.

# Inputs to confirm before writing any code:
# 1. Is the data public (no login required)?
# 2. Does robots.txt permit the path?
# 3. Does the site's ToS prohibit automated access? (read it)
# 4. Would the scraping process personal data? If yes, what is the legal basis?
# 5. Could this access circumvent geo-licensing, paywalls, or auth?
# 6. Is there a public API or data dump that would make scraping unnecessary?
# 7. Have you contacted the site owner if scope is large?

Got: Every question has a defensible written answer. The first "no" or "unknown" stops the procedure until resolved.

If fail:

ToS forbids automated access — do not proceed; contact the site owner or use an official API or licensed dataset
Personal data with no legal basis — do not proceed; engage privacy counsel
Circumvents auth or geo-licensing — do not proceed under any circumstances

Step 2: Choose a Pool Type

Different pool types have different cost, detectability, and ethics. Pick the cheapest tier that solves your block.

Pool type	Detectability	Cost	Best for
Datacenter	High (easily blocked by Cloudflare/Akamai)	$	Sites with no real anti-bot, geo-shifting only
Residential	Low (real ISP IPs)	$$$	Sites that block datacenter ASNs
Mobile	Very low (carrier-grade NAT, shared with thousands)	$$$$	Sites that even block residential (rare)

Ethical caveat for residential and mobile: these pools route your traffic through real consumer connections. Operator consent models vary — some pay users, some bundle exit-node consent into "free VPN" EULAs that users do not read. Prefer providers with audited, opt-in consent. If you would not be comfortable with a stranger sending scraping traffic through your home router, do not send yours through theirs.

Got: A documented choice with the cheapest viable tier and a brief note on why higher tiers were rejected (or why a higher tier is needed).

If fail:

Datacenter blocked but residential over budget — narrow scraping scope (fewer URLs, slower cadence) before upgrading the tier
Cannot find a provider with documented opt-in consent — reconsider whether the scraping is necessary

Step 3: Integrate Rotation with Scrapling

Wire the proxy into scrapling fetchers. Read credentials from environment variables — never hard-code, never commit a .env.

import os
import random
from scrapling import Fetcher, StealthyFetcher

# Pattern A: provider-managed rotating endpoint (one URL, provider rotates per request)
PROXY_URL = os.environ["SCRAPING_PROXY_URL"]  # http://user:[email protected]:7777

fetcher = StealthyFetcher()
fetcher.configure(
    headless=True,
    timeout=60,
    network_idle=True,
    proxy=PROXY_URL,
)

# Pattern B: explicit pool, rotate yourself
POOL = os.environ["SCRAPING_PROXY_POOL"].split(",")  # comma-separated URLs

def fetch_with_rotation(url):
    proxy = random.choice(POOL)
    fetcher = StealthyFetcher()
    fetcher.configure(headless=True, timeout=60, proxy=proxy)
    return fetcher.get(url)

Got: Requests succeed and the egress IP varies between calls. Confirm by hitting an IP-echo endpoint (e.g. https://api.ipify.org) before running the real scrape.

If fail:

407 Proxy Authentication Required — credentials wrong or password URL-encoding broke (re-encode special characters)
Same IP on every call — provider endpoint may be sticky by default; check docs for a -rotating or per-request flag
Massive latency increase — expected; rotation adds 200–2000ms per request

Step 4: Sticky Sessions and Pool Health

Decide rotation granularity per workload, then keep the pool healthy.

# Sticky session for stateful flows (login, multi-page checkout-like crawls)
# Most providers expose a session ID via the username:
#   user-session-abc123:[email protected]:7777
# All requests with the same session ID exit through the same IP for ~10 min.

# Per-request rotation for anonymous bulk scraping (default)

# Pool health check — call before bulk run
def check_pool(pool, sample_size=5):
    sample = random.sample(pool, min(sample_size, len(pool)))
    alive = []
    for proxy in sample:
        try:
            r = StealthyFetcher().configure(proxy=proxy, timeout=10).get(
                "https://api.ipify.org"
            )
            if r.status == 200:
                alive.append(proxy)
        except Exception:
            pass
    return alive

# Backoff on transient proxy failures
def fetch_with_backoff(url, max_attempts=3):
    for attempt in range(max_attempts):
        try:
            r = fetch_with_rotation(url)
            if r.status not in (407, 502, 503):
                return r
        except Exception:
            pass
        time.sleep(2 ** attempt)
    return None

Got: Stateful flows preserve cookies across requests; bulk anonymous scraping shows IP variance across requests; dead proxies skipped instead of looping.

If fail:

Login breaks mid-flow — rotation happening inside the session; switch to sticky-session credentials
All proxies in sample fail health check — pool exhausted or credentials expired; rotate credentials or contact provider

Step 5: Monitoring, Cost Control, and Kill Switch

Proxy traffic has a per-GB cost and a per-request cost. Runaway scrapers generate runaway invoices. Always include limits and an abort.

import time

class ScrapeBudget:
    def __init__(self, max_requests, max_duration_seconds, max_failures):
        self.max_requests = max_requests
        self.max_duration = max_duration_seconds
        self.max_failures = max_failures
        self.requests = 0
        self.failures = 0
        self.start = time.monotonic()

    def allow(self):
        if self.requests >= self.max_requests:
            return False, "request cap reached"
        if time.monotonic() - self.start >= self.max_duration:
            return False, "time cap reached"
        if self.failures >= self.max_failures:
            return False, "failure cap reached (circuit breaker)"
        return True, None

    def record(self, success):
        self.requests += 1
        if not success:
            self.failures += 1

budget = ScrapeBudget(max_requests=1000, max_duration_seconds=3600, max_failures=20)

for url in target_urls:
    ok, reason = budget.allow()
    if not ok:
        print(f"Aborting: {reason}")
        break
    response = fetch_with_backoff(url)
    budget.record(success=response is not None)
    time.sleep(1)  # rate limiting still applies even with rotation

Got: Budget caps trigger before runaway cost. Logs show per-proxy success rate so a bad egress IP can be identified and excluded.

If fail:

Failure rate climbs above 20% — pause; the site has detected the rotation pattern (e.g. all your IPs share a subnet); switch pool type or stop
Cost-per-record exceeds expectations by 5x — cache aggressively, deduplicate URLs, batch where possible

Validation

Step 1 legality check is documented in writing before any code runs
No proxy credentials, pool URLs, or session IDs in tracked files (grep for gateway., proxy=, the provider hostname)
.env (or equivalent) is in .gitignore
Pool choice justified: cheapest viable tier, with consent model verified for residential/mobile
IP variance confirmed against an echo endpoint before the real run
Stateful flows use sticky sessions; bulk anonymous use per-request
Budget caps (requests, duration, failures) wired and tested
Rate limiting (≥1s) preserved — rotation is not an excuse to flood
robots.txt still respected — rotation does not override it

Pitfalls

Rotating before stealth is exhausted: the site often does not need a new IP — it needs a realistic User-Agent, TLS fingerprint, and slower cadence. Try StealthyFetcher and rate limiting first; rotation is expensive and unethical to deploy unnecessarily.
Hard-coded credentials: pasting the proxy URL into source leaks it to git, container images, and stack traces. Read from environment variables or a secrets manager.
Rotating mid-session: per-request rotation breaks any flow that depends on cookies, CSRF tokens, or shopping-cart state. Use sticky sessions for stateful work.
Treating rotation as "ethical anonymity": rotation hides you from the target, but does not make harmful scraping ethical. ToS, copyright, privacy law, and rate-limit ethics still apply unchanged.
Using residential proxies for high-risk activity: credential stuffing, sneaker botting, geo-pirating streaming content, fraud — explicitly out of scope. If your use case looks like this, stop.
Ignoring robots.txt because "we have rotation now": rotation does not grant permission. The directive is the directive.
No kill switch: an unsupervised loop on a metered proxy pool turns into a four-figure invoice overnight. Always cap requests, duration, and failures.
Choosing a residential pool with opaque consent: some providers source exit nodes from "free VPN" EULAs that real users never read. Pay the premium for an audited, opt-in consent model.

Related Skills

headless-web-scraping — parent skill; always start there. Use this skill only as escalation.
use-graphql-api — prefer official APIs to scraping when one exists.
deploy-searxng — self-hosted search avoids scraping search engines entirely.
configure-reverse-proxy — opposite network direction (serving instead of fetching), useful neighbor reference.
security-audit-codebase — run after integrating credentials to confirm none leaked into the repo.

Repositorio GitHub

pjt222/agent-almanac

Ruta: i18n/caveman-lite/skills/rotate-scraping-proxies

agentsagentskillsai-assisted-developmentclaude-codeskillsteams

FAQ

Frequently asked questions

What is the rotate-scraping-proxies skill?

rotate-scraping-proxies is a Claude Skill by pjt222. Skills package instructions and resources that Claude loads on demand, so Claude can perform rotate-scraping-proxies-related tasks without extra prompting.

How do I install rotate-scraping-proxies?

Use the install commands on this page: add rotate-scraping-proxies to Claude Code as a plugin, or clone its repository into your skills directory, then restart Claude so it picks up the skill.

What category does rotate-scraping-proxies belong to?

rotate-scraping-proxies is in the Development category, tagged ai, api and data.

Is rotate-scraping-proxies free to use?

Yes. rotate-scraping-proxies is listed on AIMCP and free to install. It runs inside Claude, so no separate service account is required to use the skill itself.

Habilidades relacionadas

qmd

Desarrollo

qmd es una herramienta CLI de búsqueda e indexación local que permite a los desarrolladores indexar y buscar en archivos locales mediante búsqueda híbrida que combina BM25, embeddings vectoriales y reranking. Es compatible tanto con uso desde la línea de comandos como con modo MCP (Model Context Protocol) para integración con Claude. La herramienta utiliza Ollama para los embeddings y almacena los índices localmente, lo que la hace ideal para buscar documentación o bases de código directamente desde la terminal.

Ver habilidad

subagent-driven-development

Desarrollo

Esta habilidad ejecuta planes de implementación asignando un nuevo subagente para cada tarea independiente, con revisión de código entre tareas. Permite una iteración rápida mientras mantiene controles de calidad a través de este proceso de revisión. Úsala cuando trabajes en tareas mayormente independientes dentro de la misma sesión para garantizar un progreso continuo con verificaciones de calidad integradas.

Ver habilidad

mcporter

Desarrollo

La habilidad mcporter permite a los desarrolladores gestionar y llamar servidores del Protocolo de Contexto de Modelo (MCP) directamente desde Claude. Proporciona comandos para listar servidores disponibles, llamar a sus herramientas con argumentos, y manejar la autenticación y el ciclo de vida del daemon. Utiliza esta habilidad para integrar y probar la funcionalidad de servidores MCP en tu flujo de trabajo de desarrollo.

Ver habilidad

adk-deployment-specialist

Desarrollo

Esta habilidad despliega y orquesta agentes Vertex AI ADK utilizando el protocolo A2A, gestionando el descubrimiento de AgentCard, el envío de tareas y soportando herramientas como el Sandbox de Ejecución de Código y el Banco de Memoria. Permite construir sistemas multiagente con patrones de orquestación secuencial, paralela o en bucle en Python, Java o Go. Úsela cuando se le solicite desplegar agentes ADK u orquestar flujos de trabajo de agentes en Google Cloud.

Ver habilidad