Zurück zu Fähigkeiten

rotate-scraping-proxies

pjt222
Aktualisiert Yesterday
5 Ansichten
17
2
17
Auf GitHub ansehen
Entwicklungaiapidata

Über

Diese Fähigkeit bietet Proxy-Rotation für Web-Scraping, wenn herkömmliche Stealth-Techniken versagen, und ermöglicht Zugang zu Rechenzentrums-, Wohn- und Mobilfunk-Proxy-Pools mit Sitzungsverwaltung und Überwachung. Sie ist als letzte Eskalationsstufe konzipiert, nachdem Headless-Scraping-Ansätze blockiert wurden, und betont ethischen Einsatz und Kostenbewusstsein. Entwickler sollten sie nur implementieren, wenn sie anhaltende 403/429-Fehler erhalten und ihr Scraping-Traffic legitim ist.

Schnellinstallation

Claude Code

Empfohlen
Primär
npx skills add pjt222/agent-almanac -a claude-code
Plugin-BefehlAlternativ
/plugin add https://github.com/pjt222/agent-almanac
Git CloneAlternativ
git clone https://github.com/pjt222/agent-almanac.git ~/.claude/skills/rotate-scraping-proxies

Kopieren Sie diesen Befehl und fügen Sie ihn in Claude Code ein, um diese Fähigkeit zu installieren

Dokumentation

Rotate Scraping Proxies

Network-layer escalation when client stealth exhausted. Last resort, not default — expensive, ethically charged, easily misused. Skill teaches when not to use as much as how.

Use When

  • headless-web-scraping (Fetcher → StealthyFetcher → DynamicFetcher) tried + still 403/429/geo-block
  • Rate limit ≥3s + robots.txt permits
  • UA + TLS fingerprint realistic (not default python-requests)
  • Scrape legit: public data, no auth bypass, no paywall, no personal data w/o legal basis
  • Budget for proxy traffic + accept ops complexity

Don't use → public API exists, ToS forbids automation, geo-license circumvention, fraud|cred-stuff|sneaker-bot|piracy.

In

  • Required: Target URLs + legal basis
  • Required: Proxy creds (env var, never hard-code)
  • Required: Pool type — datacenter|residential|mobile
  • Optional: Geo target (country|region|city)
  • Optional: Rotation granularity — per-req (default)|sticky session
  • Optional: Daily traffic|spend cap
  • Optional: Rate delay s (default 1, even w/ rotation)

Do

Step 1: Pre-flight Legality+Ethics

Gate workflow on documented review. Skip = biggest harm source.

# Inputs to confirm before writing any code:
# 1. Is the data public (no login required)?
# 2. Does robots.txt permit the path?
# 3. Does the site's ToS prohibit automated access? (read it)
# 4. Would the scraping process personal data? If yes, what is the legal basis?
# 5. Could this access circumvent geo-licensing, paywalls, or auth?
# 6. Is there a public API or data dump that would make scraping unnecessary?
# 7. Have you contacted the site owner if scope is large?

→ Every Q has defensible written ans. First "no"|"unknown" stops proc.

If err:

  • ToS forbids → don't proceed; contact owner|use API|licensed dataset
  • Personal data no basis → don't proceed; engage privacy counsel
  • Auth|geo-license bypass → don't proceed under any circumstances

Step 2: Pool Type

Diff cost, detect, ethics. Cheapest tier solving block.

Pool typeDetectabilityCostBest for
DatacenterHigh (easily blocked by Cloudflare/Akamai)$Sites with no real anti-bot, geo-shifting only
ResidentialLow (real ISP IPs)$$$Sites that block datacenter ASNs
MobileVery low (carrier-grade NAT, shared with thousands)$$$$Sites that even block residential (rare)

Ethical caveat residential+mobile: routes via real consumer connections. Provider consent varies — some pay, some bundle exit-node consent into "free VPN" EULAs unread. Prefer audited opt-in. If wouldn't be comfortable w/ stranger sending scrape via your home router → don't send via theirs.

→ Documented choice + cheapest viable + brief why higher rejected (or needed).

If err:

  • Datacenter blocked, residential over budget → narrow scope (fewer URLs, slower) before upgrade
  • No documented opt-in consent → reconsider whether scrape needed at all

Step 3: Integrate Rotation w/ Scrapling

Wire proxy → scrapling fetcher. Read creds from env, never hard-code, never commit .env.

import os
import random
from scrapling import Fetcher, StealthyFetcher

# Pattern A: provider-managed rotating endpoint (one URL, provider rotates per request)
PROXY_URL = os.environ["SCRAPING_PROXY_URL"]  # http://user:[email protected]:7777

fetcher = StealthyFetcher()
fetcher.configure(
    headless=True,
    timeout=60,
    network_idle=True,
    proxy=PROXY_URL,
)

# Pattern B: explicit pool, rotate yourself
POOL = os.environ["SCRAPING_PROXY_POOL"].split(",")  # comma-separated URLs

def fetch_with_rotation(url):
    proxy = random.choice(POOL)
    fetcher = StealthyFetcher()
    fetcher.configure(headless=True, timeout=60, proxy=proxy)
    return fetcher.get(url)

→ Reqs succeed + egress IP varies. Confirm via IP echo (https://api.ipify.org) before real scrape.

If err:

  • 407 Proxy Auth Required → wrong creds|URL-encoding broke pwd (re-encode special chars)
  • Same IP every call → endpoint sticky default; check docs for -rotating or per-req flag
  • Massive latency → expected; rotation adds 200–2000ms/req

Step 4: Sticky Sessions + Pool Health

Granularity per workload + keep pool healthy.

# Sticky session for stateful flows (login, multi-page checkout-like crawls)
# Most providers expose a session ID via the username:
#   user-session-abc123:[email protected]:7777
# All requests with the same session ID exit through the same IP for ~10 min.

# Per-request rotation for anonymous bulk scraping (default)

# Pool health check — call before bulk run
def check_pool(pool, sample_size=5):
    sample = random.sample(pool, min(sample_size, len(pool)))
    alive = []
    for proxy in sample:
        try:
            r = StealthyFetcher().configure(proxy=proxy, timeout=10).get(
                "https://api.ipify.org"
            )
            if r.status == 200:
                alive.append(proxy)
        except Exception:
            pass
    return alive

# Backoff on transient proxy failures
def fetch_with_backoff(url, max_attempts=3):
    for attempt in range(max_attempts):
        try:
            r = fetch_with_rotation(url)
            if r.status not in (407, 502, 503):
                return r
        except Exception:
            pass
        time.sleep(2 ** attempt)
    return None

→ Stateful preserves cookies; bulk shows IP variance; dead proxies skipped not looped.

If err:

  • Login breaks mid-flow → rotating in session; switch to sticky-session creds
  • All sample fail health → pool exhausted|creds expired; rotate|contact provider

Step 5: Monitor + Cost + Kill Switch

Per-GB + per-req cost. Runaway scrape → runaway invoice. Always cap+abort.

import time

class ScrapeBudget:
    def __init__(self, max_requests, max_duration_seconds, max_failures):
        self.max_requests = max_requests
        self.max_duration = max_duration_seconds
        self.max_failures = max_failures
        self.requests = 0
        self.failures = 0
        self.start = time.monotonic()

    def allow(self):
        if self.requests >= self.max_requests:
            return False, "request cap reached"
        if time.monotonic() - self.start >= self.max_duration:
            return False, "time cap reached"
        if self.failures >= self.max_failures:
            return False, "failure cap reached (circuit breaker)"
        return True, None

    def record(self, success):
        self.requests += 1
        if not success:
            self.failures += 1

budget = ScrapeBudget(max_requests=1000, max_duration_seconds=3600, max_failures=20)

for url in target_urls:
    ok, reason = budget.allow()
    if not ok:
        print(f"Aborting: {reason}")
        break
    response = fetch_with_backoff(url)
    budget.record(success=response is not None)
    time.sleep(1)  # rate limiting still applies even with rotation

→ Caps trigger before runaway. Logs show per-proxy success → identify+exclude bad IP.

If err:

  • Fail rate >20% → pause; site detected pattern (shared subnet); switch type|stop
  • Cost-per-record 5x → cache, dedup URLs, batch

Check

  • Step 1 legality documented written before code
  • No proxy creds|pool URLs|session IDs in tracked files (grep gateway., proxy=, hostname)
  • .env in .gitignore
  • Pool justified: cheapest viable + consent verified for residential|mobile
  • IP variance confirmed vs echo before real run
  • Stateful → sticky; bulk → per-req
  • Budget caps (req|dur|fail) wired+tested
  • Rate limit (≥1s) preserved — rotation ≠ flood excuse
  • robots.txt respected — rotation doesn't override

Traps

  • Rotate before stealth exhausted: Site needs realistic UA, TLS, slower cadence — not new IP. Try StealthyFetcher+rate first.
  • Hard-coded creds: Source file leaks → git, images, traces. Always env|secrets manager.
  • Rotate mid-session: Per-req breaks cookies|CSRF|cart. Sticky for stateful.
  • "Ethical anonymity" myth: Rotation hides you from target → doesn't make harmful scrape ethical. ToS, copyright, privacy law, rate-ethics still apply.
  • Residential for high-risk: Cred stuff, sneaker, geo-pirate, fraud → out of scope. Stop.
  • Ignore robots.txt because rotation: Doesn't grant permission. Directive=directive.
  • No kill switch: Unsupervised loop on metered pool → 4-figure invoice overnight. Always cap.
  • Opaque consent residential: Some src exit nodes from "free VPN" EULAs unread. Pay premium for audited opt-in.

<!-- Keep under 500 lines. Extract large examples to references/EXAMPLES.md if needed. -->

GitHub Repository

pjt222/agent-almanac
Pfad: i18n/caveman-ultra/skills/rotate-scraping-proxies
0
agentsagentskillsai-assisted-developmentclaude-codeskillsteams

Verwandte Skills

qmd

Entwicklung

qmd ist ein lokales Such- und Indexierungs-CLI-Tool, das Entwicklern ermöglicht, lokale Dateien mittels Hybridsuche zu indexieren und zu durchsuchen, die BM25, Vektoreinbettungen und Neuordnung kombiniert. Es unterstützt sowohl die Kommandozeilennutzung als auch den MCP-Modus (Model Context Protocol) zur Integration mit Claude. Das Tool verwendet Ollama für Einbettungen und speichert Indizes lokal, was es ideal für die direkte Suche in Dokumentationen oder Codebasen vom Terminal aus macht.

Skill ansehen

subagent-driven-development

Entwicklung

Diese Fähigkeit führt Implementierungspläne aus, indem für jede unabhängige Aufgabe ein neuer Subagent bereitgestellt wird, mit Code-Review zwischen den Aufgaben. Sie ermöglicht schnelle Iterationen, während Qualitätssicherungsschritte durch diesen Review-Prozess gewahrt bleiben. Nutzen Sie sie, wenn Sie überwiegend unabhängige Aufgaben innerhalb derselben Sitzung bearbeiten, um kontinuierlichen Fortschritt mit integrierten Qualitätsprüfungen zu gewährleisten.

Skill ansehen

mcporter

Entwicklung

Die mcporter-Skill ermöglicht es Entwicklern, Model Context Protocol (MCP)-Server direkt aus Claude heraus zu verwalten und aufzurufen. Sie bietet Befehle, um verfügbare Server aufzulisten, deren Tools mit Argumenten aufzurufen sowie Authentifizierung und Daemon-Lebenszyklus zu handhaben. Nutzen Sie diese Skill, um MCP-Server-Funktionalität in Ihren Entwicklungs-Workflow zu integrieren und zu testen.

Skill ansehen

adk-deployment-specialist

Entwicklung

Diese Fähigkeit stellt Vertex AI ADK-Agenten über das A2A-Protokoll bereit und orchestriert sie, verwaltet die AgentCard-Erkennung, Aufgabenübermittlung und unterstützende Tools wie die Code Execution Sandbox und Memory Bank. Sie ermöglicht den Aufbau von Multi-Agenten-Systemen mit sequenziellen, parallelen oder Schleifen-Orchestrierungsmustern in Python, Java oder Go. Verwenden Sie sie, wenn Sie aufgefordert werden, ADK-Agenten bereitzustellen oder Agenten-Workflows auf Google Cloud zu orchestrieren.

Skill ansehen