Big Data’s Dirty Little Secret

In my youth we lived on the corner of a tiny street in suburban Philadelphia called Shepherd’s Lane.  At the other end of the street, less than half a city block away, the sign read Shepards Lane. The discrepancy bothered me at the time (though not enough to compel me to find out which was the “correct” spelling.)

A store by any other name

Years later I thought of this while analyzing a massive table of client data from several thousand retail stores.  By eyeballing the data I realized that we had a pretty serious “data thesaurus” issue — with physical sites listed under several slightly different alternative spellings/misspellings in many instances. Before we could do any meaningful analysis, we needed to concatenate all those records that called the same physical store by different names in the database.

This is intuitively obvious to a human analyst, but surprisingly challenging for a computer, primarily because of the potential number and randomness of the variations.  We could have developed a reference sub-table of each potential alternative spelling — but we had neither the budget nor the time for that.  And since this was a non-recurring analysis, it wouldn’t have been worth it.  We cleaned the data manually, iteratively, on the fly.

How data scientists spend their time

A recent article in the New York Times “For Data Scientists, ‘Janitor Work’ is Hurdle to Insights” discusses a key processing step called data wrangling or data munging — basically all the work you have to go through after you collect the data, but before you start to actually analyze it for meaning and implications.

Data scientists are said to spend between 50 and 80 percent of their time collecting and cleaning data before it can be explored to yield useful nuggets. Anyone who has worked professionally with data knows this — but, as the article points out, “It is something that is not appreciated by data civilians.” (Love that term!)

Read the rest of this entry »