The Problem with Modeling Big Data

So "Big Data" has been a buzzword for a while now.  Lots of companies are now saying that they have a "Big Data" problem and they're looking for solutions to make sense of their data.

Let's back up a second and understand what "Big Data" means.  Having a lot of data isn't the same thing as having "Big Data".  Big Data refers to data that contains useful or actionable information.  Big Data allows you (or, more appropriately, an algorithm) to make predictions with high accuracy.  Big Data contains the answers to your questions.  (Just having a lot of data doesn't mean that any of those things are true.)

We're at a turning point regarding data modeling.

Historically, data modeling has been required for two reasons:

  1. The amount of data has been very small compared to what is available today.
  2. Calculations were done manually.  Pen and paper.  Chalk and blackboard. 

For example, F=m*a is a model of how a mass moves when a force is applied to it. Newton came up with this model based on Galileo's experiments, i.e. data.  As a model, F=m*a had worked phenomenally well for over 300 years, until subtle irregularities were noticed prompting the theories of both Relativity and Quantum Mechanics.  Even now, F=m*a is used for the majority of real-world problems without the need to invoke Relativity or Quantum Mechanics.

As data points increased beyond the capability of people to work out the mechanics of every particle, statistical modeling was used for predicting what happens to, say, a gas at different temperatures and pressures.  To be fair, these experiments weren't able to measure the attributes of every particle.  Instead, bulk measurements were used to describe and predict.  This lowered the number of data points back to human manageable levels at the cost of predicting the motion of each individual particle.

More modern experiments gather much more data.  For example, the Data Centre for the Large Hadron Collider at CERN "processes about one petabyte of data every day".  Surely this must be "Big Data".  No.  This is just lots of data. Scientists are using this data to validate existing models of how the universe works, rather than to predict what the next collision will look like.  If the data proves the existing models to be wrong, then scientists must come up with new models of how the universe works.  This is the progressive nature of science with a bend towards understanding why things work they way they work.  (Of course, physicists use data to create models, too, especially in complex phenomena!)

On the other hand, Big Data, especially as used in business, can make accurate predictions about, for example, what a customer will buy, or the likelihood of a project succeeding. 

Unfortunately, a rookie data scientist's mistake is to model this data.  This defeats the point of having Big Data in the first place!

Let's look at the problem again from Galileo's experiments.  Imagine if Galileo had been able to make millions and millions of experiments rolling balls down inclines and dropping objects off the Leaning Tower of Pisa.  If this were possible, then he wouldn't have needed Newton's laws of motion (i.e. F=m*a) to predict where a ball of a specified mass would be at a specified time if he rolled it down an incline of some specified grade and initial speed.  He would simply look up his very large experimenter's notebook for the closest matching initial conditions.  Then, he would move his index finger across that experiment's results looking for the distance travelled at that specific time (or closest match).  If his data set was exceptionally large, then he would probably have an exact match for all those conditions!

In this above example, no one modeled the data.  Instead, the raw data was used to make a prediction - and very likely an accurate one! 

Compare that to a human-made model.  By definition, a model doesn't represent the real-world scenario completely.

Human-made models are incomplete, error-prone, and perhaps wrong. 

I'm being generous; I suspect that most models are wrong.  If they briefly appear correct, then it is good luck - not good science.  Think about it.  Noble Prizes are awarded to people that come up with accurate models of how things work.  Do you think you can find that rare Noble Prize caliber data scientist to work for your company?

Someone may object to this no-model method by saying, "well, that doesn't tell us why the ball moves!"  This is a valid point.  Relying on Big Data alone will not give you a deeper understanding of why things happen.  My argument to counter this is three-fold:

First, the traditional modeling methods don't give you a deeper understanding, either.  Take statistics for example.  It starts off with a data model, and ends up with a possible spread of results.  It does not reveal the "why" things happen.  F=m*a doesn't reveal why the ball moves, either.  (It would take Einstein's General Relativity to explain why the ball rolls downhill, i.e. the curvature of spacetime.) To get an automated deeper understanding of the data, machine intelligence is needed.

Second, let's understand why we want to know the "why".  It is so that we can make better decisions about actions to take to maximize our desired results.  Well, if using Big Data without modeling can make better predictions, and machine intelligence such as the General Artificial Intelligence using Software (GAIuS) framework can be used to leverage those predictions to automatically make better decisions, then there is little reason for us to attempt to know the "why".  Cut out the middle-man.  Just use it and get your desired results.

Third, if the "why" is truly required, then the no-model method of GAIuS can provide that with help from a data scientist.  To make predictions, GAIuS looks at all the similar past scenarios.  It can return what is common among all those scenarios to the data scientist to review.  It is then up to the data scientist to make that leap of understanding to answer the "why".  This leap is made shorter by providing only the relevant information, rather than a statistical dump.

If your dataset has a lot of useful or actionable information, don't cause unnecessary problems by modeling it.  Keep it clean.  Use the data directly. Frameworks such as GAIuS by COGNITUUM can help you do this. 

On the other hand, if your dataset is small and there are no plans on growing it, then modeling may be a good option.  Keep in mind, though, if your dataset is too small, it doesn't matter how much you model it - your results will not be accurate.

You may get lucky with a few predictions, but this won't last.  Instead, use GAIuS' machine learning from the beginning.  It will improve automatically as more data is provided.