SKILL·DF2339

rotate-scraping-proxies

Name: rotate-scraping-proxies
Author: pjt222

pjt222

Actualizado 1 month ago

9 vistas

Desarrolloaiapidata

Acerca de

Esta habilidad proporciona rotación de proxies para web scraping cuando las técnicas estándar de sigilo fallan, ofreciendo acceso a grupos de proxies de centros de datos, residenciales y móviles con gestión de sesiones y monitoreo. Está diseñada como una escalada de último recurso después de que los enfoques de scraping sin interfaz sean bloqueados, haciendo hincapié en el uso ético y la conciencia de costos. Los desarrolladores deben implementarla solo cuando enfrenten errores persistentes 403/429 y cuando su tráfico de scraping sea legítimo.

Instalación rápida

Claude Code

Recomendado

Principal

npx skills add pjt222/agent-almanac -a claude-code

Comando PluginAlternativo

/plugin add https://github.com/pjt222/agent-almanac

Git CloneAlternativo

git clone https://github.com/pjt222/agent-almanac.git ~/.claude/skills/rotate-scraping-proxies

Copia y pega este comando en Claude Code para instalar esta habilidad

Documentación

Rotate Scraping Proxies

Network-layer escalation when client stealth exhausted. Last resort, not default — expensive, ethically charged, easily misused. Skill teaches when not to use as much as how.

Use When

headless-web-scraping (Fetcher → StealthyFetcher → DynamicFetcher) tried + still 403/429/geo-block
Rate limit ≥3s + robots.txt permits
UA + TLS fingerprint realistic (not default python-requests)
Scrape legit: public data, no auth bypass, no paywall, no personal data w/o legal basis
Budget for proxy traffic + accept ops complexity

Don't use → public API exists, ToS forbids automation, geo-license circumvention, fraud|cred-stuff|sneaker-bot|piracy.

In

Required: Target URLs + legal basis
Required: Proxy creds (env var, never hard-code)
Required: Pool type — datacenter|residential|mobile
Optional: Geo target (country|region|city)
Optional: Rotation granularity — per-req (default)|sticky session
Optional: Daily traffic|spend cap
Optional: Rate delay s (default 1, even w/ rotation)

Do

Step 1: Pre-flight Legality+Ethics

Gate workflow on documented review. Skip = biggest harm source.

# Inputs to confirm before writing any code:
# 1. Is the data public (no login required)?
# 2. Does robots.txt permit the path?
# 3. Does the site's ToS prohibit automated access? (read it)
# 4. Would the scraping process personal data? If yes, what is the legal basis?
# 5. Could this access circumvent geo-licensing, paywalls, or auth?
# 6. Is there a public API or data dump that would make scraping unnecessary?
# 7. Have you contacted the site owner if scope is large?

→ Every Q has defensible written ans. First "no"|"unknown" stops proc.

If err:

ToS forbids → don't proceed; contact owner|use API|licensed dataset
Personal data no basis → don't proceed; engage privacy counsel
Auth|geo-license bypass → don't proceed under any circumstances

Step 2: Pool Type

Diff cost, detect, ethics. Cheapest tier solving block.

Pool type	Detectability	Cost	Best for
Datacenter	High (easily blocked by Cloudflare/Akamai)	$	Sites with no real anti-bot, geo-shifting only
Residential	Low (real ISP IPs)	$$$	Sites that block datacenter ASNs
Mobile	Very low (carrier-grade NAT, shared with thousands)	$$$$	Sites that even block residential (rare)

Ethical caveat residential+mobile: routes via real consumer connections. Provider consent varies — some pay, some bundle exit-node consent into "free VPN" EULAs unread. Prefer audited opt-in. If wouldn't be comfortable w/ stranger sending scrape via your home router → don't send via theirs.

→ Documented choice + cheapest viable + brief why higher rejected (or needed).

If err:

Datacenter blocked, residential over budget → narrow scope (fewer URLs, slower) before upgrade
No documented opt-in consent → reconsider whether scrape needed at all

Step 3: Integrate Rotation w/ Scrapling

Wire proxy → scrapling fetcher. Read creds from env, never hard-code, never commit .env.

import os
import random
from scrapling import Fetcher, StealthyFetcher

# Pattern A: provider-managed rotating endpoint (one URL, provider rotates per request)
PROXY_URL = os.environ["SCRAPING_PROXY_URL"]  # http://user:[email protected]:7777

fetcher = StealthyFetcher()
fetcher.configure(
    headless=True,
    timeout=60,
    network_idle=True,
    proxy=PROXY_URL,
)

# Pattern B: explicit pool, rotate yourself
POOL = os.environ["SCRAPING_PROXY_POOL"].split(",")  # comma-separated URLs

def fetch_with_rotation(url):
    proxy = random.choice(POOL)
    fetcher = StealthyFetcher()
    fetcher.configure(headless=True, timeout=60, proxy=proxy)
    return fetcher.get(url)

→ Reqs succeed + egress IP varies. Confirm via IP echo (https://api.ipify.org) before real scrape.

If err:

407 Proxy Auth Required → wrong creds|URL-encoding broke pwd (re-encode special chars)
Same IP every call → endpoint sticky default; check docs for -rotating or per-req flag
Massive latency → expected; rotation adds 200–2000ms/req

Step 4: Sticky Sessions + Pool Health

Granularity per workload + keep pool healthy.

# Sticky session for stateful flows (login, multi-page checkout-like crawls)
# Most providers expose a session ID via the username:
#   user-session-abc123:[email protected]:7777
# All requests with the same session ID exit through the same IP for ~10 min.

# Per-request rotation for anonymous bulk scraping (default)

# Pool health check — call before bulk run
def check_pool(pool, sample_size=5):
    sample = random.sample(pool, min(sample_size, len(pool)))
    alive = []
    for proxy in sample:
        try:
            r = StealthyFetcher().configure(proxy=proxy, timeout=10).get(
                "https://api.ipify.org"
            )
            if r.status == 200:
                alive.append(proxy)
        except Exception:
            pass
    return alive

# Backoff on transient proxy failures
def fetch_with_backoff(url, max_attempts=3):
    for attempt in range(max_attempts):
        try:
            r = fetch_with_rotation(url)
            if r.status not in (407, 502, 503):
                return r
        except Exception:
            pass
        time.sleep(2 ** attempt)
    return None

→ Stateful preserves cookies; bulk shows IP variance; dead proxies skipped not looped.

If err:

Login breaks mid-flow → rotating in session; switch to sticky-session creds
All sample fail health → pool exhausted|creds expired; rotate|contact provider

Step 5: Monitor + Cost + Kill Switch

Per-GB + per-req cost. Runaway scrape → runaway invoice. Always cap+abort.

import time

class ScrapeBudget:
    def __init__(self, max_requests, max_duration_seconds, max_failures):
        self.max_requests = max_requests
        self.max_duration = max_duration_seconds
        self.max_failures = max_failures
        self.requests = 0
        self.failures = 0
        self.start = time.monotonic()

    def allow(self):
        if self.requests >= self.max_requests:
            return False, "request cap reached"
        if time.monotonic() - self.start >= self.max_duration:
            return False, "time cap reached"
        if self.failures >= self.max_failures:
            return False, "failure cap reached (circuit breaker)"
        return True, None

    def record(self, success):
        self.requests += 1
        if not success:
            self.failures += 1

budget = ScrapeBudget(max_requests=1000, max_duration_seconds=3600, max_failures=20)

for url in target_urls:
    ok, reason = budget.allow()
    if not ok:
        print(f"Aborting: {reason}")
        break
    response = fetch_with_backoff(url)
    budget.record(success=response is not None)
    time.sleep(1)  # rate limiting still applies even with rotation

→ Caps trigger before runaway. Logs show per-proxy success → identify+exclude bad IP.

If err:

Fail rate >20% → pause; site detected pattern (shared subnet); switch type|stop
Cost-per-record 5x → cache, dedup URLs, batch

Check

Step 1 legality documented written before code
No proxy creds|pool URLs|session IDs in tracked files (grep gateway., proxy=, hostname)
.env in .gitignore
Pool justified: cheapest viable + consent verified for residential|mobile
IP variance confirmed vs echo before real run
Stateful → sticky; bulk → per-req
Budget caps (req|dur|fail) wired+tested
Rate limit (≥1s) preserved — rotation ≠ flood excuse
robots.txt respected — rotation doesn't override

Traps

Rotate before stealth exhausted: Site needs realistic UA, TLS, slower cadence — not new IP. Try StealthyFetcher+rate first.
Hard-coded creds: Source file leaks → git, images, traces. Always env|secrets manager.
Rotate mid-session: Per-req breaks cookies|CSRF|cart. Sticky for stateful.
"Ethical anonymity" myth: Rotation hides you from target → doesn't make harmful scrape ethical. ToS, copyright, privacy law, rate-ethics still apply.
Residential for high-risk: Cred stuff, sneaker, geo-pirate, fraud → out of scope. Stop.
Ignore robots.txt because rotation: Doesn't grant permission. Directive=directive.
No kill switch: Unsupervised loop on metered pool → 4-figure invoice overnight. Always cap.
Opaque consent residential: Some src exit nodes from "free VPN" EULAs unread. Pay premium for audited opt-in.

→

headless-web-scraping — parent; always start there
use-graphql-api — prefer official APIs
deploy-searxng — self-host search → no scrape
configure-reverse-proxy — opposite direction reference
security-audit-codebase — run after creds → confirm no leak

Repositorio GitHub

pjt222/agent-almanac

Ruta: i18n/caveman-ultra/skills/rotate-scraping-proxies

agentsagentskillsai-assisted-developmentclaude-codeskillsteams

FAQ

Frequently asked questions

What is the rotate-scraping-proxies skill?

rotate-scraping-proxies is a Claude Skill by pjt222. Skills package instructions and resources that Claude loads on demand, so Claude can perform rotate-scraping-proxies-related tasks without extra prompting.

How do I install rotate-scraping-proxies?

Use the install commands on this page: add rotate-scraping-proxies to Claude Code as a plugin, or clone its repository into your skills directory, then restart Claude so it picks up the skill.

What category does rotate-scraping-proxies belong to?

rotate-scraping-proxies is in the Development category, tagged ai, api and data.

Is rotate-scraping-proxies free to use?

Yes. rotate-scraping-proxies is listed on AIMCP and free to install. It runs inside Claude, so no separate service account is required to use the skill itself.

Habilidades relacionadas

qmd

Desarrollo

qmd es una herramienta CLI de búsqueda e indexación local que permite a los desarrolladores indexar y buscar en archivos locales mediante búsqueda híbrida que combina BM25, embeddings vectoriales y reranking. Es compatible tanto con uso desde la línea de comandos como con modo MCP (Model Context Protocol) para integración con Claude. La herramienta utiliza Ollama para los embeddings y almacena los índices localmente, lo que la hace ideal para buscar documentación o bases de código directamente desde la terminal.

Ver habilidad

subagent-driven-development

Desarrollo

Esta habilidad ejecuta planes de implementación asignando un nuevo subagente para cada tarea independiente, con revisión de código entre tareas. Permite una iteración rápida mientras mantiene controles de calidad a través de este proceso de revisión. Úsala cuando trabajes en tareas mayormente independientes dentro de la misma sesión para garantizar un progreso continuo con verificaciones de calidad integradas.

Ver habilidad

mcporter

Desarrollo

La habilidad mcporter permite a los desarrolladores gestionar y llamar servidores del Protocolo de Contexto de Modelo (MCP) directamente desde Claude. Proporciona comandos para listar servidores disponibles, llamar a sus herramientas con argumentos, y manejar la autenticación y el ciclo de vida del daemon. Utiliza esta habilidad para integrar y probar la funcionalidad de servidores MCP en tu flujo de trabajo de desarrollo.

Ver habilidad

adk-deployment-specialist

Desarrollo

Esta habilidad despliega y orquesta agentes Vertex AI ADK utilizando el protocolo A2A, gestionando el descubrimiento de AgentCard, el envío de tareas y soportando herramientas como el Sandbox de Ejecución de Código y el Banco de Memoria. Permite construir sistemas multiagente con patrones de orquestación secuencial, paralela o en bucle en Python, Java o Go. Úsela cuando se le solicite desplegar agentes ADK u orquestar flujos de trabajo de agentes en Google Cloud.

Ver habilidad