Skip to content

Export Formats

scrappy supports four export formats. The format is selected via --format and output goes to --out (or stdout for JSONL).


1. JSONL — internal/export/jsonl.go

Default format. One JSON object per line, each representing a complete JobPost.

scrappy --sites remoteok --search "golang" --format jsonl --out jobs.jsonl
{
  "id": "remoteok-123",
  "title": "Go Developer",
  "company_name": "Acme Corp",
  "job_url": "https://remoteok.com/...",
  "site": "remoteok",
  "emails": [{"addr": "jobs@acme.com", "verified": true, "source": "company_page", "role": false}],
  "quality_score": 72,
  "compensation": {"interval": "yearly", "min_amount": 120000, "max_amount": 180000, "currency": "USD"}
}

Schema: Every JobPost field serialized as-is via encoding/json. Suitable for streaming ingestion into data pipelines, BigQuery, or Elasticsearch.


2. CSV — internal/export/csv.go

Standard flat table with 34 columns. One row per job.

scrappy --sites linkedin --search "engineer" --format csv --out jobs.csv

Columns:

Column Description
site Source site name
title Job title
company_name Employer name
location City, State, Country (formatted)
is_remote true/false
job_type fulltime/parttime/contract/internship
date_posted RFC 3339 timestamp
description Plain text (HTML stripped)
job_url URL on the job board
emails Semicolon-separated addresses
emails_verified Semicolon-separated boolean strings
email_source Semicolon-separated source labels
apply_method easy_apply/email/direct_url/external_url
seniority entry/mid/senior/lead
department eng/data/product/...
company_url Company website
job_url_direct Direct apply URL
company_industry Industry classification
company_logo Logo URL
company_revenue Revenue string
company_num_employees Employee count
company_addresses Office locations
company_description Company blurb
skills Semicolon-separated list
experience_range e.g. "3-5 years"
company_rating Float rating
company_reviews_count Number of reviews
vacancy_count Open positions
work_from_home_type Remote type label
quality_score 0-100 score
salary_interval yearly/monthly/weekly/daily/hourly
salary_min Minimum salary amount
salary_max Maximum salary amount
salary_currency ISO currency code

CSV Emails-Only — internal/export/csv_emails_only.go

A separate variant, activated with --csv-emails-only, writes one row per email address instead of one row per job:

scrappy --sites remoteok --search "golang" --format csv --csv-emails-only --out emails.csv

Columns: email, verified, source, role, site, job_id, title, company_name, job_url

Useful for outreach workflows where each email needs independent handling (CRM imports, mail merge, recruiter contact lists).

Each row is deduplicated by (email, job_id) to avoid repeats from multi-email jobs.


3. XLSX — internal/export/xlsx.go

Excel spreadsheet with a single sheet named jobs. Same 34 columns as CSV, written via the excelize library.

scrappy --sites indeed --search "developer" --format xlsx --out jobs.xlsx

Multi-value fields (emails, skills) are semicolon-delimited strings within single cells. No formulas or formatting — clean tabular data for spreadsheet consumption.


4. Parquet — internal/export/parquet.go

Columnar storage format optimized for analytical workloads. Uses the parquet-go library with Snappy compression.

scrappy --sites linkedin --search "data engineer" --format parquet --out jobs.parquet

Schema: 34 typed columns (string, boolean, int64). String columns use PLAIN_DICTIONARY encoding for efficient storage of repeated values. Row group size is 128 MiB.

Parquet column Type
site UTF8
title UTF8
company_name UTF8
is_remote BOOLEAN
quality_score INT64
company_reviews_count INT64
vacancy_count INT64
(all others) UTF8

Ideal for loading into Pandas, Spark, or DuckDB for analysis:

import pandas as pd
df = pd.read_parquet("jobs.parquet")
df[df.quality_score > 50].groupby("site").size()

Format selection guideline

Format Best for
JSONL Streaming, pipelines, BigQuery
CSV Spreadsheets, email outreach (with --csv-emails-only)
XLSX Excel users, quick visual scan
Parquet Analytical queries, large datasets, Pandas/Spark