Bad Night for Big Data

I have a nightmarish pet scenario that as we as a society gain non-stop access to ever-increasing data, there is a risk that we actually get progressively dumber — as we lose the ability to process and analysis that data sufficiently.

My idea got a workout this week during election night when the polling industry, most of whom had predicted a single or even double percentage point Clinton victory, got it monumentally wrong.

When we hear on TV every ten minutes about how Watson is curing cancer, among other breathless hype about Big Data, an error of this stunning magnitude seems at first paradoxical.

But the more you think about it, the more it makes a perverse kind of sense.

“Dewey Defeats Truman”

Embarrassing election errors are nothing new — witness the iconic photo of President-elect Truman gleefully displaying the newspaper headline “Dewey Defeats Truman” the day after the 1948 election.dewey-defeats-truman

People claimed then that the error was due to a combination of slow reporting and the print-era need to prepare headlines hours in advance of publication.

What IS new is that polls are now easier and cheaper to field, and as a natural consequence there is a proliferation of them. And, as they are invariably deemed newsworthy, they feed the hungry news-cycle monster. They generate eyeballs and click-bait — and they’re fun, especially when your own pick is ahead.

Especially toward the close of this 18-month campaign, it seemed like a new poll was appearing every other day. We became so collectively absorbed in the twitching poll dashboards that we neglected the fleeting opportunity to discuss in any depth the serious challenges facing our country and our society.

Let’s figure this out

I’m confident that over the coming weeks we will see a vast, rolling post-mortem on how things went so wrong — discuss amongst yourselves — and please tell us what you came up with. It’s way too important not to figure this out.

Some of the early hypothesis include:

  • SAMPLING ERROR. The sample selection was biased by cord-cutting, the tectonic rolling shift in the US from phone landlines to wireless.
  • THE NEWS TEAM BUBBLE. Most of the major media are based in cities on the east and west coasts (New York, Washington, Atlanta, LA). People talking to like-minded people creates an echo-chamber effect, where differing perspectives tend to remain largely unheard, much less tolerated.
  • THE CHAOS EFFECT. Voter behavior is complex, and can be influenced by small, apparently non-related events — like public leaks of hacked email threads. One commentator compared it to weather forecasting in this regard, evoking chaos theory.

Not so fast, would be my retort on this last one. The weather is inanimate, and will happen regardless of what we forecast about it. In elections, by contrast, forecasts are used to allocate resources in real time. A forecast that Wisconsin would vote Democratic caused the Clinton campaign not to show up there even once after the convention — leading them to lose by a small margin. This gives new meaning to the term false positive.

As I write this, I note that intelligence expert John McGonagle, who is also my colleague and friend, has blogged about this.

Data is not deterministic

The New York Times’ Upshot column — one of my must-reads these days — was a bellwether of this. Nate Cohn’s September 20 headline says it all: “We Gave Four Good Pollsters the Same Raw Data. They Had Four Different Results.” These results, which included the Upshot’s own analysis, projected two larger and two smaller Clinton wins and one small Trump win — all from the identical data set of Florida voters.

Trump won by one percent in Florida, and only one expert team (Stanford/ Columbia/ Microsoft) called that right. Cohn attributes this primarily to two factors: (1) the team’s use of voting history, rather than stated intention to vote, to indicate likelihood of voting — a key factor when you realize only a little more than half of registered voters actually voted, and (2) their use of statistical modeling in weighting voter characteristics from the interviewed sample.

The key point here is — data by itself is not deterministic. Data does not “decide” anything by itself — the processing and analysis that follow are essential elements of the “value” equation — in this case, measured by whether or not you got it right.

Data blindness

In my recurring nightmare, we get dumber as — and even because — we get more data. In KVC terms, we cycle endlessly around the bottom levels of the chain without gaining enough momentum to leap up a level or two and see what it all means. I have termed this “data blindness,” a variation of which I think describes the Election Projection Debacle of 2016.

Of course, there is much hand-wringing, even some falling on swords (albeit soft ones).  Forecaster Dr. Sam Wong of Princeton went on national TV to fulfill his promise that he would eat a bug if his forecasts were wrong. Other forecasters are busy back-spinning their stories to explain how they actually warned people (somewhere in the fine print) that they could be wrong.

We should neither stop forecasting nor exercise the option, however tempting, of dismissing the entire forecasting industry out-of-hand. We should commit to doing much better — at polling, at critically analyzing the results, and at communicating what those polls signify — and what they do not.

Leave a Comment