Big Data’s Dirty Little Secret

In my youth we lived on the corner of a tiny street in suburban Philadelphia called Shepherd’s Lane.  At the other end of the street, less than half a city block away, the sign read Shepards Lane. The discrepancy bothered me at the time (though not enough to compel me to find out which was the “correct” spelling.)

A store by any other name

Years later I thought of this while analyzing a massive table of client data from several thousand retail stores.  By eyeballing the data I realized that we had a pretty serious “data thesaurus” issue — with physical sites listed under several slightly different alternative spellings/misspellings in many instances. Before we could do any meaningful analysis, we needed to concatenate all those records that called the same physical store by different names in the database.

This is intuitively obvious to a human analyst, but surprisingly challenging for a computer, primarily because of the potential number and randomness of the variations.  We could have developed a reference sub-table of each potential alternative spelling — but we had neither the budget nor the time for that.  And since this was a non-recurring analysis, it wouldn’t have been worth it.  We cleaned the data manually, iteratively, on the fly.

How data scientists spend their time

A recent article in the New York Times “For Data Scientists, ‘Janitor Work’ is Hurdle to Insights” discusses a key processing step called data wrangling or data munging — basically all the work you have to go through after you collect the data, but before you start to actually analyze it for meaning and implications.

Data scientists are said to spend between 50 and 80 percent of their time collecting and cleaning data before it can be explored to yield useful nuggets. Anyone who has worked professionally with data knows this — but, as the article points out, “It is something that is not appreciated by data civilians.” (Love that term!)

Misunderstood and neglected

And why would they?  Wrangling (which I call Processing) is the un-fun, un-photogenic part of big data.  Nothing at all like the TV ads that depict massive data streams fusing seamlessly into a miraculous nexus of intuition, inspiration, and invention.  (IBM’s “ingredients are just data” ads, I’m looking at you.)

On page 58 of The Knowledge Value Chain® Handbook I describe Processing as “the most misunderstood — and most frequently neglected — step in the KVC.”

If the KVC model is accurate, you can’t do analysis without doing the processing first…any more than you’d want to put crude oil into a car’s gas tank before it’s been refined into gasoline.  It won’t work to get you to where you want to go (i.e., toward a meaningful analysis and results.)

Jeffrey Heer, professor of computer science at the University of Washington, concurs when he says bluntly, “It’s an absolute myth that you can send an algorithm over raw data and have insights pop up.”

Stupid, wild, and messy

Data in itself is unrelentingly literal and stupid.  It comes in different formats, it has holes in different places, it’s kind of wild — and won’t willingly submit to analysis before it’s been tamed.

Data is messy.  The “bigger” the data — as measured by its velocity, variability, and volume — the bigger the mess.

The processing step is where the data is tamed.  It’s largely a manual effort, though the Times article describes some technology initiatives that attempt to clear this hurdle.

Technology to the rescue?

ClearStory Data (“Now You See It”) provides storyboards that integrate data from internal and external sources, and are updated in real time.  You can, for example, link data from Salesforce with demographic and business census data for a geographic region.

Paxata (“Adaptive Data Preparation for Everyone”) claims to be “the first self-service data preparation application for all analysts.”  They’ve created a network of customer evangelists (“PaxPros”) willing to share their experiences and tips.

Trifacta (“transforms raw data into actionable data”) uses Predictive Interaction™, which is essentially a visualization of alternative data transformations presented to an analyst.  The analyst interactively teaches the platform which of the suggested transformations are optimal for the analytic situation.

It depends

Though each of these fits generally into the Processing category, what each solution actually accomplishes is quite different.  That’s OK, since the type of processing required depends heavily on both the nature of the data that precedes it and the analysis that follows it.  There are case studies on each of their sites that give a hands-on view of what they can do.  Each offers a free demo of their offering.

Sorry to take you into the weeds this time, but that’s the nature of the Processing step — the devil is (quite literally) in the details.  And if you fail to get this step right, it’s possible that the resulting analysis, recommendations, and actions taken will be equally flawed.

Leave a Comment