Structured, real-world job data for training recruitment and labor market models
Synthetic job data has a ceiling. Models trained on it learn patterns that don't reflect how employers actually write job descriptions, structure requirements, or signal seniority. CleanJobData provides real job listings sourced directly from employer ATS systems — consistently normalized, commercially licensed, and available at scale. Whether you're training a job matching model, a salary prediction system, a skills extraction pipeline, or a labor market forecasting tool, you're working with data that reflects how the actual market behaves.
Benefits
✓Real employer data, not synthetic — listings sourced directly from Greenhouse, Lever, Ashby, and Workable. Models trained here learn from how actual companies hire.
✓Consistent schema across all sources — every listing maps to the same object. No per-source preprocessing before your training pipeline sees the data.
✓Salary fields parsed and normalized — min, max, and currency where the employer provides it. Useful signal for pay prediction and benchmarking models.
✓Seniority labels from ATS metadata — not just title parsing. Meaningful signal for experience-level classification tasks.
✓Location with lat/lng and geo hierarchy — city, state, country, timezone. Structured for geospatial ML tasks.
✓Company metadata — headcount, industry, funding stage where available — adds organizational context to training examples.
✓Commercial use permitted — standard terms allow AI and ML product development.
Features
•1M+ active job listings updated continuously
•Normalized salary fields (min, max, currency) across all sources
•Seniority field from ATS metadata — entry, mid, senior, staff, principal
•Location with lat/lng, city ID, country code, and timezone
•Employment type — full-time, part-time, contract, internship
•Historical data available for trend analysis — contact for details
•REST API and CSV export options
Frequently Asked Questions
Can I use CleanJobData for commercial AI products?
Yes — our standard terms permit use in commercial AI and ML products including job matching algorithms, salary prediction models, and recruitment automation tools. Review the full terms at cleanjobdata.com/terms.
Is this real data or synthetic?
Real data, sourced directly from employer career pages via Greenhouse, Lever, Ashby, and Workable. These are live job postings from companies actively hiring — not generated or augmented examples.
What's the best way to build a training dataset?
Use the REST API with date-range filters to pull a snapshot, or run regular syncs with the cursor parameter to build an ongoing dataset. For large historical pulls, contact us at cleanjobdata.com/support — we can discuss custom data delivery options.
Does the data include job descriptions?
Yes — the full job description text is included in the detail endpoint response. The list endpoint returns a summary. For training NLP models on job description text, use the detail endpoint or batch-fetch by ID.
How do you handle data freshness for ongoing training pipelines?
Use the max_age filter combined with cursor pagination to pull only new listings since your last sync. This keeps your training dataset current without re-downloading the full index on each run.