Proxy Support¶
scrappy supports SOCKS5 and HTTP/HTTPS proxies for distributed scraping and rate-limit avoidance. Proxies are configured at startup, health-checked via TCP dial, and rotated through the pool.
Configuration¶
Proxy resolution follows a precedence chain:
--proxy CLI flag > config.toml proxy: field > SCRAPPY_PROXIES env var
CLI flag¶
# Single SOCKS5 proxy
scrappy --sites linkedin --proxy socks5://user:pass@proxy:1080
# Multiple proxies (round-robin)
scrappy --sites indeed --proxy socks5://proxy1:1080,socks5://proxy2:1080
# HTTP proxy
scrappy --sites remoteok --proxy http://proxy:8080
Config file¶
proxy = "socks5://user:pass@proxy:1080"
Environment variable¶
export SCRAPPY_PROXIES="socks5://user:pass@proxy:1080"
Per-request rotation¶
Two env vars control rotation frequency:
# Rotate proxy every N requests
export SCRAPPY_PROXY_ROTATE_EVERY_N=50
# Keep proxy sticky for N-request window
export SCRAPPY_PROXY_STICKY_WINDOW_N=20
Both use smartRT.maybeRotateProxy() which advances the proxy index atomically.
Health checks — internal/proxy/pool.go¶
Before any scraping, the CLI performs a TCP-dial health check on every proxy:
conn, dialErr := net.DialTimeout("tcp", net.JoinHostPort(host, port), 500*time.Millisecond)
- Timeout: 500ms per proxy.
- Port defaults: HTTP→80, SOCKS5→1080.
- Unhealthy proxies are logged and excluded from the pool.
- The filtered healthy list is set as
SCRAPPY_PROXIESenv var for the engine.
Pool — proxy rotation manager¶
type Pool struct {
proxies []*ProxyURL
idx int
mu sync.Mutex
}
Next(): returns the next healthy proxy in round-robin order. Skips unhealthy ones. Returns empty string if all proxies are dead.MarkUnhealthy(raw): marks a proxy as unhealthy. Called whenNewHTTPClientdetects unreachable proxies duringparseProxyList.MarkAllHealthy(): resets all proxies to healthy at the start of a new run.Probe(ctx, px): performs a HEAD request tohttps://httpbin.org/ipthrough the proxy to verify it can make outbound HTTP requests.
ProxyURL — health state¶
type ProxyURL struct {
Raw string // original URL
Scheme string // socks5 | http | https
HostPort string
Healthy bool // thread-safe via RWMutex
}
HTTP client integration — internal/util/http.go¶
The NewHTTPClient function builds an http.Client with proxy-aware transport:
func NewHTTPClient(opts ClientOptions) *http.Client {
proxyList := parseProxyList(opts.ProxyURL)
if len(proxyList) > 0 {
base.Proxy = http.ProxyURL(proxyList[0])
}
rt := &smartRT{base: base, opts: opts, proxyList: proxyList}
return &http.Client{Transport: rt}
}
smartRT — retry-aware transport¶
The smartRT wrapper around http.Transport provides:
-
Proxy rotation:
maybeRotateProxy()advances the proxy index based onProxyRotateEveryNandProxyStickyWindowNcounters. -
Retry logic: retries on 429 (rate-limit) and 5xx (server errors) with exponential backoff + jitter:
delay = BaseDelay(300ms) * 2^attempt + jitter (80-900ms)Permanent errors (NXDOMAIN, TLS cert failures, unreachable networks) fail immediately — retrying won't help.
-
User-agent rotation: cycles through a pool of modern browser UAs.
-
Cookie jar reset: periodically resets the cookie jar every N requests to avoid session tracking.
Proxy reachability during parseProxyList¶
func proxyReachable(u *url.URL) bool {
// TCP dial to host:port with 500ms timeout
}
Note: for SOCKS5, TCP dial only confirms the port is open, not that a SOCKS handshake succeeds. Full SOCKS handshake validation requires a SOCKS library and is not attempted here.
Retry delay with jitter¶
func (s *smartRT) retryDelay(attempt int, resp *http.Response) time.Duration {
d := s.opts.BaseDelay * time.Duration(1<<attempt)
if d > s.opts.MaxDelay {
d = s.opts.MaxDelay
}
jitter := time.Duration(rand.Intn(120)) * time.Millisecond
if resp != nil {
if resp.StatusCode == http.StatusTooManyRequests {
jitter += 300 * time.Millisecond
}
if resp.StatusCode >= 500 {
jitter += 80 * time.Millisecond
}
}
return d + jitter
}
Base delay: 300ms, max delay: 4s. Attempt 1: ~300ms, attempt 2: ~600ms, attempt 3: ~1.2s, attempt 4: ~2.4s.
Best practices¶
- SOCKS5 proxies are preferred for higher anonymity.
- Set
--proxyat the CLI for one-off proxy use; useSCRAPPY_PROXIESenv var or config.toml for persistent setups. - Proxy credentials in URLs are redacted in logs via
url.URL.Redacted(). - Use multiple proxies to distribute rate limits across IPs. scrappy round-robins through the pool.
- TCP-dial health check is fast (500ms timeout) but only confirms the port is open. For production, a periodic
Probe()through the proxy is more reliable.