Big Data’s Dirty Little Secret

In my youth we lived on the corner of a tiny street in suburban Philadelphia called Shepherd’s Lane.  At the other end of the street, less than half a city block away, the sign read Shepards Lane. The discrepancy bothered me at the time (though not enough to compel me to find out which was the “correct” spelling.)

A store by any other name

Years later I thought of this while analyzing a massive table of client data from several thousand retail stores.  By eyeballing the data I realized that we had a pretty serious “data thesaurus” issue — with physical sites listed under several slightly different alternative spellings/misspellings in many instances. Before we could do any meaningful analysis, we needed to concatenate all those records that called the same physical store by different names in the database.

This is intuitively obvious to a human analyst, but surprisingly challenging for a computer, primarily because of the potential number and randomness of the variations.  We could have developed a reference sub-table of each potential alternative spelling — but we had neither the budget nor the time for that.  And since this was a non-recurring analysis, it wouldn’t have been worth it.  We cleaned the data manually, iteratively, on the fly.

How data scientists spend their time

A recent article in the New York Times “For Data Scientists, ‘Janitor Work’ is Hurdle to Insights” discusses a key processing step called data wrangling or data munging — basically all the work you have to go through after you collect the data, but before you start to actually analyze it for meaning and implications.

Data scientists are said to spend between 50 and 80 percent of their time collecting and cleaning data before it can be explored to yield useful nuggets. Anyone who has worked professionally with data knows this — but, as the article points out, “It is something that is not appreciated by data civilians.” (Love that term!)

Read the rest of this entry »

  • Latest Posts

  • Topics

  • Archives

  • About this site

    COMPETING IN THE KNOWLEDGE ECONOMY is written by Timothy Powell, an independent researcher and consultant in knowledge strategy. Tim is president of The Knowledge Agency® (TKA) and serves on the faculty of Columbia University's Information and Knowledge Strategy (IKNS) graduate program.


    "During my more than three decades in business, I have served more than 100 organizations, ranging from Fortune 500s to government agencies to start-ups. I document my observations here with the intention that they may help you achieve your goals, both professional and personal.

    "These are my opinions, offered for your information only. They are not intended to substitute for professional advice."


    We typically publish monthly on or about the 15th of each month, subject to our client workload. Use the RSS feed links below to subscribe to posts and/or comments. Better yet, follow us on Twitter @twpowell to be notified of new posts and related developments.

    Thanks for reading! Please mention us to others and add your non-spam comments and suggestions -- we value your input.


    COMPETING IN THE KNOWLEDGE ECONOMY is sponsored by the Knowledge Value Chain® (KVC), a methodology that increases the value and ROI of Data, Information, Knowledge, and Intelligence.

    The contents herein are original, except where otherwise noted. All original contents are Copyright © TW Powell Co. All rights reserved.

    All KVC trademarks, trade names, designs, processes, manuals, and related materials are owned and deployed worldwide exclusively by The Knowledge Agency®. Reg. U.S. Pat. & TM Off.


    E SCIENTIA COPIA. Knowledge is the Engine of Value.