How to Normalize Job Locations at Scale
Hiring data is notoriously messy. One of the biggest challenges in building a high-quality job board is location normalization.
The Problem: Messy Geo Data
When you scrape job postings or pull from ATS APIs, you'll encounter a wide variety of location formats:
- Abbreviations: "NYC", "SF", "LA"
- Redundant info: "San Francisco, California, United States, North America"
- Vague terms: "Remote", "Hybrid", "Anywhere"
- Non-standard names: "The Big Apple"
If you store these as raw strings, your users won't be able to filter by "California" and see jobs in "San Francisco".
The Solution: Canonical Geo IDs
At CleanJobData, we solve this by mapping every location string to a canonical ID in our global database of 200,000+ locations.
1. Parsing the String
We use a multi-step parser that identifies city, state, and country components. We prioritize source-specific hints (e.g., if the job is from a UK-based employer, we bias towards UK cities).
2. Resolution Hierarchy
We resolve locations in a hierarchy:
- City ID: The most specific match.
- State/Province ID: If the city is unknown but the state is clear.
- Country ID: The fallback for country-wide roles.
3. Handling Remote
We treat "Remote" as a first-class property, not just a location name. We normalize it into a boolean flag and a work-setting category.
Why it Matters
By using structured IDs instead of strings, you can:
- Build reliable regional filters.
- Create SEO-friendly landing pages for specific cities.
- Perform labor market analytics with high precision.
Our API handles all of this for you. Every job object returned by CleanJobData includes a normalized location object with canonical IDs.