Previous articles in this series:
In this fourth article, we’ll look at how to clean text data with the fuzzywuzzy
library from SeatGeek.
Use case
The road data I downloaded from OpenStreetMap had some obvious errors among the street names, mostly misspellings. For example, there was “Aljuneid Avenue 1” when the correct spelling is “Aljunied”. This was problematic since (1) misspellings make our ultimate goal of classification difficult, and (2) we can’t unify roads that share a name, like “Aljunied Avenue 2”, giving us more work to do. I could have gone through the list manually, but it would have been time-consuming.
My solution was to get a better list from outside OpenStreetMap, and match the less correct road names to it using a library called fuzzywuzzy
, for fuzzy string matching. Here’s how it works:
The first element of the return tuple indicates the closest match in the reference list, and the second number is a score showing how close it is. An exact match is 100.
Sometimes, when the correct road name wasn’t in the reference set either, the score would be pretty low – which is as it should be!
I decided to set a boundary of 90, above which I would accept the solution
fuzzywuzzy
came up with automatically, and below which I would just
manually review the road name to decide what it should be.
Using fuzzywuzzy in Pandas
So what we want is to apply process.extractOne()
to the roadname column
of our dataframe. This was my first attempt:
It took forever! The reason is that extractOne
is doing a pairwise
comparison of all the names in the dataframe with the correct names in
the canonical list. But when the name is correct, which is the majority
of the time, we don’t actually need to do all these pairwise comparisons.
So I did a preliminary test to see if the roadname is in the list of correct
names, and that cut down on time considerably.
You can put in other checks, for example I would only accept a >90 match if the number of words was the same. Whatever makes sense for your particular use case.
Conclusion
After getting the corrected dataframe,
I went into OpenStreetMap and edited most of the
incorrect road names, so hopefully Singapore street names are mostly
correctly spelled now. The fuzzywuzzy
library was a big help in cutting
down the number of roads I needed to manually review, so I recommend
adding it to your data cleaning arsenal.