Email Extraction Pipeline¶
scrappy uses a multi-stage pipeline to extract, verify, and enrich email addresses from job postings. The pipeline runs in the engine's post-processing step and operates on every JobPost before export.
Pipeline stages¶
Raw HTML description
→ ExtractFromHTML() (mailto: hrefs + standard emails)
→ Extract() (deobfuscated patterns + standard regex)
→ CompanyPageEnricher / MultiPageCompanyEnricher
→ MXVerifier (DNS MX record check)
→ SMTPVerifier (optional RCPT TO check)
→ Pattern permuter (first.last@domain inference)
1. Text Extraction — internal/email/extract.go¶
Extract(text string) []Email¶
Scans plain text for email-like strings using a regex, then validates each candidate through multiple gates.
Regex pattern:
mailRegex = regexp.MustCompile(
`[a-zA-Z0-9._%+\-]+(?:---[a-zA-Z0-9._%+\-]+)*@[a-zA-Z0-9.\-]+\.[a-zA-Z]{2,}\b`,
)
Validation gates (in order):
-
Over-consumption guard (
forOverconsumption): rejects matches where the regex consumed adjacent text (e.g.acme.com.jobswhere.jobsis an adjacent field, not part of the email). If the match was over-consumed,tryShortenEmailattempts to recover the real email by stripping the last domain segment. -
strictlyValidEmail(addr): RFC 5321-style validation via Go'smail.ParseAddress, plus: - Max 64-char local part, 255-char domain.
- No consecutive/leading/trailing dots.
- Domain must have at least one dot.
-
TLD whitelist: the last segment must be in
validTLDs— a comprehensive map of 200+ real-world TLDs (gTLDs, ccTLDs, new-gTLDs). This rejects regex false-positives likesupport@mercor.comps. -
Blocked domain list: disposes of throwaway mail providers (guerrillamail.com, mailinator.com, etc.), platform routing addresses (indeed.com, linkedin.com), and invalid suffixes (.local, .arpa, .test).
-
Role address detection: identifies
info@,admin@,support@,sales@, etc. and marks them withRole: true.
Deobfuscation detects common anti-harvesting patterns:
obfuscatedRegex = regexp.MustCompile(
`(?i)([a-zA-Z0-9._%+\-]+)\s*(?:\[at\]|\(at\)|\bat\b)\s*` +
`([a-zA-Z0-9.\-]+)\s*(?:\[dot\]|\(dot\)|\bdot\b)\s*([a-zA-Z]{2,})`,
)
Handles: name [at] domain [dot] com, name(at)domain(dot)com, name AT domain DOT com.
HTML entity normalization decodes @ → @, . → . before regex matching.
ExtractFromHTML(html string) []Email¶
For raw HTML (before HTML stripping), this function:
- Extracts
mailto:hrefattributes from<a>tags. - Strips query parameters from mailto URLs (
?subject=...). - Also runs the standard
Extract()on the HTML text. - Deduplicates results across both methods.
2. Company Page Enrichment — internal/email/company_crawl.go¶
CompanyPageEnricher (single page)¶
Fetches the company's website URL (from JobPost.CompanyURL) and runs Extract() on its content. MX-verified results are added to the job's email list with Source: "company_page".
MultiPageCompanyEnricher (multi-page)¶
A superset that probes multiple pages on the company domain:
Default page paths (in hit-rate order):
/ → /about → /about-us → /about/team → /company/about
→ /team → /people → /leadership → /our-team
→ /contact → /contact-us → /get-in-touch
→ /careers → /careers/team → /jobs
Subdomain probes (only when the host is a bare domain):
about.<domain> → team.<domain> → careers.<domain>
→ contact.<domain> → people.<domain>
Each page is fetched, Extract() is run on the body, and candidates pass through MX verification. Non-fatal errors (404, timeout) are logged and do not prevent partial results from being returned. Concurrency is bounded by a channel-based semaphore (default 3).
3. MX Verification — internal/email/smtp_verify.go¶
MXVerifier¶
Performs DNS MX record lookup to verify that a domain accepts mail:
func (v *MXVerifier) Verify(ctx context.Context, addr string) bool {
d := domainFrom(addr)
mxs, err := v.Resolver.LookupMX(lookupCtx, d)
return err == nil && len(mxs) > 0
}
- 10-second timeout per lookup.
- Supports
LookupMXstub for testing. - Nil resolver with no stub returns
true(safe mode for offline/test). - Returns
(verified bool, reason string)viaVerifyEmail()for diagnostics.
SMTPVerifier¶
Deeper verification using the AfterShip/email-verifier library. Actually connects to the mail server and attempts an SMTP RCPT TO command (without sending a message):
type SMTPResult struct {
Deliverable bool // RCPT TO returned 250
CatchAll bool // server accepts any mailbox
HasMX bool
HostExists bool
Reason string
Free bool
RoleAccount bool
Disposable bool
}
Key caveat: Gmail and Outlook return 250 for any well-formed address. The verifier reports CatchAll: true in that case.
Configured with:
- 3 concurrent workers (adjustable via WithConcurrency).
- 10-second connect/operation timeouts.
- Customizable EHLO/MAIL FROM identity.
4. GitHub Discovery — internal/email/github_discover.go¶
The GitHubDiscoverer extracts email addresses from commit author data on public GitHub repositories. Used by the --github-scrape CLI flag.
See GitHub Discovery for full details.
5. Pattern Permutation — internal/email/pattern.go¶
Permute(first, last, domain string, patterns []string) []string¶
Generates candidate email addresses from a person's name and domain, using common corporate email patterns:
{first}.{last} → john.doe@acme.com
{f}{last} → jdoe@acme.com
{first}{last} → johndoe@acme.com
{first} → john@acme.com
{first}_{last} → john_doe@acme.com
{f}.{last} → j.doe@acme.com
{last}.{first} → doe.john@acme.com
{first}-{last} → john-doe@acme.com
{first}{l} → john.d@acme.com
{f}{l} → jd@acme.com
Patterns are ordered by statistical hit rate (based on real-world email provider data).
InferPattern(known map[string][2]string) string¶
Given two or more known (email → firstName, lastName) samples, infers the most likely pattern for the domain:
InferPattern(map[string][2]string{
"john.doe@acme.com": {"john", "doe"},
"jane.smith@acme.com": {"jane", "smith"},
})
// Returns: "{first}.{last}"
If multiple patterns match (possible with short names), the first match in CommonPatterns() order wins.
How the pipeline fits together (in the engine)¶
The engine calls enrichJobEmails() for every job:
- Deduplicate: remove any emails already set by the scraper.
- Text extraction: run
Extract()on the job description + company description. - HTML extraction:
ExtractFromHTML()on the raw HTML before strip. - Company page enrichment: if
CompanyURLis set, probe it for additional emails. - MX verification: for every email collected, do a DNS MX lookup (bounded to 5 concurrent lookups by default).
- Domain derivation: set
JobPost.Domainfrom the first email's domain orCompanyURL. - Blocked domain filtering: dispose of throwaway providers and platform addresses at every stage.
The result is that each JobPost carries a list of model.Email with Addr, Verified, Source, and Role fields, ready for export and quality scoring.