CLI Reference
Usage
scrappy [flags]
scrappy <command> [flags]
Commands
The root command (scrappy with no subcommand) runs a scrape — this is the
default invocation. Two subcommands are registered:
| Command |
Description |
doctor |
Diagnose and fix setup issues (config, env, network) |
setup |
Interactive setup wizard |
Flags
Flags are grouped by purpose (matching scrappy --help). Defaults shown are
the cobra flag defaults — most are overrideable by config.toml or env vars.
Scraping flags
| Flag |
Default |
Description |
--search |
"" |
Comma-separated terms (e.g. "software engineer,AI Engineer") |
--sites |
all 141 |
Comma-separated site names (empty = all 141) |
--location |
"" |
Comma-separated locations (e.g. "Remote,New York") |
--results-wanted |
0 |
Max results total (0 = unlimited) |
--timeout |
600 |
Scrape timeout in seconds |
--proxy |
env |
Comma-separated proxy URLs (socks5://, http://); TCP-dial health-checked at startup |
--memory-cap |
"" |
Memory budget: 512MB, 1GB (0 = unlimited) |
--max-rps |
0 |
Global max requests per second |
--site-rps |
"" |
Per-site RPS overrides, e.g. linkedin:1,indeed:10 |
--site-results-wanted |
"" |
Per-site result caps, e.g. indeed:5000,linkedin:1000 |
Filter flags
| Flag |
Default |
Description |
--email |
false |
Only jobs with ≥ 1 email address |
--is-remote |
false |
Only jobs flagged as remote |
--remote-only |
false |
Only truly remote jobs (no location filter) |
--job-type |
"" |
fulltime|parttime|contract|internship |
--hours-old |
0 |
Jobs posted within N hours (0 = no filter) |
--since |
"" |
Jobs posted on or after date (RFC3339 or YYYY-MM-DD) |
--min-score |
0 |
Quality score floor (0-100) |
--enforce-annual-salary |
false |
Normalize salaries to yearly amounts |
Output flags
| Flag |
Default |
Description |
--format |
jsonl |
jsonl|csv|xlsx|parquet |
--out |
stdout |
Output file path |
--csv-emails-only |
false |
CSV: one row per email instead of one row per job |
--json-pretty |
auto |
Pretty-print JSON (stdout only) |
--json-minify |
false |
Force minified JSON even on stdout |
GitHub flags
| Flag |
Default |
Description |
--github-scrape |
false |
Discover emails from GitHub orgs/repos instead of job scraping (requires --search, see GitHub Discovery) |
Verification flag
| Flag |
Default |
Description |
--verify-concurrency |
5 |
MX-lookup concurrency. 0 skips MX verification entirely (useful when DNS is unavailable or to speed up large runs). |
Setup & debug flags
| Flag |
Default |
Description |
--config |
auto |
Path to config.toml (CWD → ~/.scrappy/config.toml) |
--log-level |
"" |
DEBUG|INFO|WARN|ERROR |
--non-interactive |
false |
Disable interactive wizard (for scripts/CI) |
--interactive |
false |
Force wizard mode |
--dedup |
true |
Deduplicate jobs by URL across sites |
--dedup-by-company |
false |
Keep only one posting per company |
--version |
|
Print version and exit |
--version-json |
false |
Print version info as JSON and exit |
--help |
|
Print help |
Examples
# Basic scrape
scrappy --sites remoteok --search "golang" --results-wanted 50
# Multiple sites with output file
scrappy --sites linkedin,indeed,remoteok \
--search "software engineer" \
--location "Remote" \
--results-wanted 200 \
--format csv \
--out ./results.csv
# With proxy
scrappy --sites linkedin \
--search "ai engineer" \
--proxy socks5://user:pass@proxy:1080
# All sites with quality filter
scrappy --search "python" \
--results-wanted 10 \
--min-score 50 \
--format jsonl
# Debug mode
scrappy --sites remoteok --search "rust" --log-level DEBUG
# Diagnose setup
scrappy doctor
# Setup wizard
scrappy setup
# Discover emails from GitHub (see 013-GitHub-Discovery.md)
GITHUB_TOKEN=ghp_xxx scrappy --github-scrape --search "torvalds" --out emails.csv
# Skip MX verification (DNS unavailable, or faster bulk runs)
scrappy --sites remoteok --search "golang" --verify-concurrency 0
Environment Variables
| Variable |
Description |
SCRAPPY_PROXIES |
Comma-separated SOCKS5/HTTP proxy URLs (lowest precedence) |
SCRAPPY_LOG_LEVEL |
Default log level: DEBUG|INFO|WARN|ERROR |
SCRAPPY_INDEED_API_KEY |
Indeed API key (paid) |
SCRAPPY_DICE_API_KEY |
Dice API key |
SCRAPPY_GREENHOUSE_SEEDS |
Company slugs for Greenhouse |
SCRAPPY_ATS_MAX_SEEDS |
Max ATS company seeds per provider (default: 20) |
SCRAPPY_INDEED_CO |
Indeed company override |
SCRAPPY_PROXY_ROTATE_EVERY_N |
Rotate proxy every N requests |
SCRAPPY_PROXY_STICKY_WINDOW_N |
Proxy stickiness window |
GITHUB_TOKEN |
GitHub API token (for --github-scrape) |
ADZUNA_APP_ID / ADZUNA_APP_KEY |
Adzuna API credentials |
CAREERJET_AFFID |
Careerjet affiliate ID |
INFOJOBS_CLIENT_ID / INFOJOBS_CLIENT_SECRET |
InfoJobs API credentials |
FINDWORK_API_KEY |
Findwork API key |
ARBEITSAGENTUR_API_KEY |
Arbeitsagentur API key |
Precedence: --proxy CLI flag > config.toml proxy: field > SCRAPPY_PROXIES env.
See .env.example for the full list.