Why location normalization is harder than it looks (and how we solved it)
CleanJobData Engineering
If you've ever scraped a job board, you know that the "Location" field is a wild west. You might see "San Francisco", "SF", "San Fran/Remote", or even just "North America".
The Problem with Strings
Strings are great for humans but terrible for filters. If your user wants to see jobs in "California", a string search for "CA" will miss "San Francisco" and might accidentally hit "Canada".
Our Solution
We built a geographic resolution engine that:
- Tokenizes the location string.
- Scores potential matches against a database of 200k+ cities.
- Validates matches using country hints from the job source.
This allows us to provide city_id, state_id, and country_id for every listing, enabling powerful regional filters that actually work.