Back to articles

Why location normalization is harder than it looks (and how we solved it)

CleanJobData Engineering

If you've ever scraped a job board, you know that the "Location" field is a wild west. You might see "San Francisco", "SF", "San Fran/Remote", or even just "North America".

The Problem with Strings

Strings are great for humans but terrible for filters. If your user wants to see jobs in "California", a string search for "CA" will miss "San Francisco" and might accidentally hit "Canada".

Our Solution

We built a geographic resolution engine that:

  1. Tokenizes the location string.
  2. Scores potential matches against a database of 200k+ cities.
  3. Validates matches using country hints from the job source.

This allows us to provide city_id, state_id, and country_id for every listing, enabling powerful regional filters that actually work.