Big Data, Correlation Or Causation?

Big Data, Correlation Or Causation?

Gordon Crovitz wrote about Big Data in the Wall Street Journal (25 March 2013) this week.

He cites from a book called “Big Data: A Revolution That Will Transform How We Live, Work, and Think,” an interesting notion that in processing the massive amounts of data we are capturing today, society will “shed some of its obsession for causality in exchange for simple correlation.”

The idea is that in the effort to speed decision processing and making, we will to some extent, or to a great extent, not have the time and resources for the scientific method to actually determine why something is happening, but instead will settle for knowing what is happening–through the massive data pouring in.

While seeing the trends in the data is a big step ahead of just being overwhelmed and possibly drowning in data and not knowing what to make of it, it is still important that we validate what we think we are seeing but scientifically testing it and determining if there is a real reason for what is going on.

Correlating loads of data can make for interesting conclusions like when Google Flu predicts outbreaks (before the CDC) by reaming through millions of searches for things like cough medicine, but correlations can be spurious when for example, a new cough medicine comes out and people are just looking up information about it–hence, no real outbreak of the flu. (Maybe not the best example, but you get the point).

Also, just knowing that something is happening like an epidemic, global warming, flight delays or whatever, is helpful in situational awareness, but without knowing why it’s happening (i.e. the root cause) how can we really address the issues to fix it?

It is good to know if data is pointing us to a new reality, then at least we can take some action(s) to prevent ourselves from getting sick or having to wait endlessly in the airport, but if we want to cure the disease or fix the airlines then we have to go deeper, find out the cause, and attack it–to make it right.

Correlation is good for a quick reaction, but correlation is necessary for long-term prevention and improvement.

Computing resources can be used not just to sift through petabytes of data points (e.g. to come up with neighborhood crime statistics), but to actually help test various causal factors (e.g. socio-economic conditions, community investment, law enforcement efforts, etc.) by processing the results of true scientific testing with proper controls, analysis, and drawn conclusions.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s