Normalization
Extracting Salary Data from Job Descriptions
CleanJobData Engineering
Salary transparency is becoming the law in many regions, but the data is still often provided in unstructured text blocks.
The Challenge
Salary information can appear in many formats:
- "$120,000 - $150,000 per year"
- "£50k - £70k"
- "Up to $200/hr"
- "Competitive salary + equity"
Our Extraction Pipeline
We use a combination of source-specific metadata and advanced regex patterns to extract structured pay data.
1. Structured Field Detection
Some ATS platforms (like Ashby) provide structured compensation fields. We prioritize these as they are the most accurate.
2. Regex-Based Parsing
For unstructured text, we run a series of patterns to identify:
- Currency symbols ($, £, €, etc.)
- Numeric ranges
- Pay periods (hourly, monthly, annual)
3. Normalization
Once extracted, we normalize the values:
- Annualization: Hourly and monthly rates are converted to annual equivalents for easy comparison.
- Currency Mapping: We identify the ISO currency code.
Using the Data
Our API exposes these as three clean fields:
salary_min: The lower bound of the range.salary_max: The upper bound.salary_currency: The ISO code (e.g., "USD").
This allows you to build powerful filters like "Jobs paying over $100k" without having to parse descriptions yourself.