Rate Limiting¶
scrappy implements rate limiting at two levels: a global token bucket limiting total outbound requests, and per-site semaphores limiting concurrent requests to individual job boards. On top of this, the HTTP client adds retry logic with exponential backoff and jitter.
Per-site rate limiting — internal/rate/rate.go¶
Token bucket pool¶
type Pool struct {
mu sync.Mutex
pool map[string]*rate.Limiter
}
Uses golang.org/x/time/rate token-bucket limiters, one per hostname/site:
func (p *Pool) Get(key string, rps int) *rate.Limiter {
if lim, ok := p.pool[key]; ok {
return lim
}
lim := rate.NewLimiter(rate.Limit(rps), rps)
p.pool[key] = lim
return lim
}
- Key: site name or hostname (e.g.
"linkedin","indeed"). - Burst: equal to the RPS value (so a 5 RPS limiter allows bursts of 5).
Wait(ctx, key, rps): blocks until a token is available, respecting context cancellation.
The pool lazily creates limiters — only sites that are actually scraped get a limiter allocated.
Per-site semaphores¶
In addition to the token bucket, the engine builds per-site semaphores from --site-rps:
func buildSiteSemaphores(input model.ScraperInput) map[model.Site]chan struct{} {
out := make(map[model.Site]chan struct{})
for site, rps := range input.SiteRPS {
capN := rps
if capN <= 0 { capN = 1 }
if capN > 8 { capN = 8 }
out[site] = make(chan struct{}, capN)
}
return out
}
Usage: linkedin:1,indeed:10 limits LinkedIn to 1 concurrent request and Indeed to 10. Clamped between 1 and 8.
Global rate limiting — pkg/scrappy/engine.go¶
Global concurrency semaphore¶
The engine uses a channel-based semaphore for global concurrency. The size is determined by:
-
Memory cap (
--memory-cap): scales linearly:Memory cap Global concurrency ≤ 256 MB 3 ≤ 512 MB 5 ≤ 1 GB 8 > 1 GB 12 -
Max RPS (
--max-rps): used when no memory cap is set. Clamped between 2 and 16.
func globalConcurrency(input model.ScraperInput) int {
if input.MemoryCapMB > 0 {
switch {
case input.MemoryCapMB <= 256: return 3
case input.MemoryCapMB <= 512: return 5
case input.MemoryCapMB <= 1024: return 8
default: return 12
}
}
if input.MaxRPS > 0 {
if input.MaxRPS < 2 { return 2 }
if input.MaxRPS > 16 { return 16 }
return input.MaxRPS
}
return 8
}
Site-level concurrency¶
Each site goroutine holds the global semaphore slot plus its own site-specific semaphore slot. This means:
Global: max 8 concurrent sites
Site: max N concurrent requests to a single site
Retry logic — internal/util/http.go¶
The smartRT transport implements retry with exponential backoff and jitter:
func (s *smartRT) RoundTrip(req *http.Request) (*http.Response, error) {
attempts := s.opts.Retries + 1
if !isRetryableMethod(req.Method) {
attempts = 1 // only retry GET/HEAD/OPTIONS
}
for i := 0; i < attempts; i++ {
s.maybeRotateProxy()
resp, err := s.base.RoundTrip(req)
if err == nil && !isRetryableStatus(resp.StatusCode) {
return resp, nil // success
}
// retry with backoff
time.Sleep(s.retryDelay(i, resp))
}
}
Retryable status codes¶
func isRetryableStatus(code int) bool {
if code >= 500 && code < 600 {
return true
}
return code == http.StatusTooManyRequests
}
- 429 (Too Many Requests): retried.
- 5xx (Server Error): retried.
- 4xx (Client Error): NOT retried — 403, 401, 406 indicate permanent conditions.
Retryable methods¶
func isRetryableMethod(method string) bool {
switch strings.ToUpper(method) {
case http.MethodGet, http.MethodHead, http.MethodOptions:
return true
default:
return false
}
}
POST and PUT requests are never retried to avoid duplicate side effects.
Permanent errors (no retry)¶
func isPermanentError(err error) bool {
// DNS NXDOMAIN: domain doesn't exist
// TLS/certificate errors
// Network unreachable
return true // fail immediately
}
Backoff formula¶
delay = BaseDelay(300ms) × 2^attempt + jitter
| Attempt | Base delay | Max jitter | Total range |
|---|---|---|---|
| 1 | 300ms | 120ms + potential status-based bonus | ~300-420ms |
| 2 | 600ms | 120ms + bonus | ~600-720ms |
| 3 | 1.2s | 120ms + bonus | ~1.2-1.32s |
| 4 | 2.4s | 120ms + bonus | ~2.4-2.52s |
Status-based jitter bonuses: - 429 (rate limit): +300ms - 5xx: +80ms - 403/401/406: +600ms
Max delay is capped at 4 seconds.
Repeated 429 detection¶
If the same request gets two consecutive 429 responses, the transport fails permanently:
if resp.StatusCode == http.StatusTooManyRequests && saw429 {
return nil, fmt.Errorf("permanent: rate limited (repeated 429)")
}
This prevents infinite retry loops against aggressively rate-limiting servers.
suggestRPS — adaptive rate suggestion¶
After each scrape, the engine calls suggestRPS() to propose an adjusted RPS for the site:
func suggestRPS(current int, err error) int {
if current <= 0 { current = 3 }
if err == nil {
if current < 10 { return current + 1 } // increase on success
return current
}
if containsAny(err.Error(), "429", "rate", "too many requests", "captcha") {
if current > 1 { return current - 1 } // decrease on rate-limit
return 1
}
return current
}
- On success: gradually increase RPS (capped at 10).
- On rate-limit or captcha: decrease RPS (minimum 1).
- Stored in
RunTelemetry.SuggestedSiteRPSfor inspection but not automatically applied (the operator chooses whether to adopt the suggestion).
Usage¶
# Global max RPS
scrappy --sites linkedin,indeed --search "golang" --max-rps 5
# Per-site RPS
scrappy --sites linkedin,indeed --search "golang" --site-rps "linkedin:1,indeed:10"
# Memory-constrained (automatically sets global concurrency)
scrappy --sites all --search "engineer" --memory-cap 512MB
# Both memory cap and per-site limits
scrappy --sites all --search "engineer" --memory-cap 1GB --site-rps "linkedin:1"