GitHub Email Discovery¶

The GitHubDiscoverer extracts email addresses from commit author data on public GitHub repositories. This is separate from job scraping — it's invoked via the --github-scrape CLI flag.

Usage¶

# Basic discovery from a user's repos
scrappy --github-scrape --search "torvalds"

# With GitHub token (higher rate limit)
GITHUB_TOKEN=ghp_xxx scrappy --github-scrape --search "torvalds"

# Output to file (default: github_emails.csv)
scrappy --github-scrape --search "linux-foundation" --out emails.csv

The --search value is treated as a GitHub username or organization name.

Discovery flow¶

The DiscoverFromUser method runs a three-phase pipeline:

OrgsForUser(login)
  → ReposForOrg(org) per org
    → EmailsFromRepo(owner, repo) per repo

Phase 1: `OrgsForUser(ctx, login)`¶

url := fmt.Sprintf("https://api.github.com/users/%s/orgs", login)

Returns the list of organization logins the user belongs to. Uses the public GitHub API — requires the org to be public.

Phase 2: `ReposForOrg(ctx, org)`¶

url := fmt.Sprintf("https://api.github.com/orgs/%s/repos?per_page=50&sort=pushed&type=public", org)

Returns non-fork, public repository full names (owner/repo), sorted by most recently pushed.

Phase 3: `EmailsFromRepo(ctx, owner, repo, maxCommits)`¶

url := fmt.Sprintf("https://api.github.com/repos/%s/%s/commits?per_page=%d", owner, repo, maxCommits)

Extracts unique personal emails from recent commits. Key behavior:

Reads both commit.author.email (the commit metadata) and author.email (the GitHub user object).
Filters out GitHub's noreply addresses (@users.noreply.github.com, @noreply.github.com).
Filters out bot addresses (-bot, [bot], -ci, -automate suffixes).
Default: 30 commits per repo, max 100.
Falls back to EmailsFromRepo for direct org/repo lookups if DiscoverFromUser returns nothing.

API endpoints used¶

Endpoint	Purpose	Auth
`GET /users/{login}/orgs`	List user's organizations	Optional
`GET /orgs/{org}/public_members`	List org members	Optional
`GET /orgs/{org}/repos`	List org repos	Optional
`GET /users/{login}/repos`	List user's repos	Optional
`GET /repos/{owner}/{repo}/commits`	Commit history	Optional
`GET /repos/{owner}/{repo}/commits?author={login}`	Author-filtered commits	Optional

Headers set on all requests:

Accept: application/vnd.github+json
X-GitHub-Api-Version: 2022-11-28
Authorization: Bearer <token>   // only when GITHUB_TOKEN is set

Rate limiting¶

Auth	Rate limit	Per hour
Unauthenticated	60 requests/hour	60
With token	5,000 requests/hour	5,000

Unauthenticated requests are identified by the lack of Authorization header and are subject to IP-based rate limiting (60 req/h). With a GITHUB_TOKEN, the limit is 5,000 req/h per token.

The discoverer detects rate limiting by checking for HTTP 403:

if resp.StatusCode == http.StatusForbidden {
    return nil, fmt.Errorf("github_rate_limited (status %d)", resp.StatusCode)
}

Error messages include the status code for diagnostics.

Caching¶

Results are cached in an in-memory LRU cache (capacity: 256 entries):

type ghLRU struct {
    mu    sync.Mutex
    data  map[string]*list.Element
    order *list.List
    cap   int
}

Cache key: GitHub login or owner/repo.
Eviction: least-recently-used.
Thread-safe via sync.Mutex.
Empty results (no emails found) are also cached to avoid re-scanning.

CLI integration¶

The --github-scrape flag diverts execution to runGitHubScrape() in cmd/scrappy/main.go:

func runGitHubScrape(cfg *cliConfig) error {
    g := internalemail.NewGitHubDiscoverer(token)
    result, err := g.DiscoverFromUser(ctx, login, 30)
    // ...
}

Output is always CSV with two columns: repo,email. If the search term is an organization (not a user), the discoverer falls back to treating it as an org and scanning its repos directly:

if len(result) == 0 {
    repos, repoErr := g.ReposForOrg(ctx, login)
    // ...
}

Privacy considerations¶

GitHub's "Keep my email address private" setting only applies to the author email shown on the UI and in the commit web view. The commit author metadata (commit.author.email in the API) still contains the real email for historical commits. scrappy reads the API-level commit metadata, not the web UI, so it can extract real addresses even when privacy mode is enabled.

This is standard behavior: CI tools, package registries, and code review platforms all use the same API data.