Big data hit mainstream over the past year or so. I know this because the BBC has produced several programmes covering it. What I’ve heard is that there is no clear definition of what big data is and why it is important. When I ask people if they have big data, they overwhelming say “yes,” whether they have a thousand or many millions of rows of data or observations. So who is right? It depends.
Nowadays, statistical software, including those designed to maximise the power of the desktop like JMP Pro, can easily handle data sets with millions of rows. What is a more important is the number of columns. Very tall and very wide data sets are truly big. These may require standard statistical methods such as sampling to build useful models, bringing model building within the power of a desktop computer.
So if big data is easily manageable, what are the real challenges faced by today's analysts, engineers and scientists? We surveyed delegates at the two model-building seminars held recently in Marlow and Edinburgh and uncovered an interesting finding: All of the delegates had messy data.
Make the most of messy data
You have messy data if you have missing data, empty cells, outliers or wrong entries. Traditional statistical methods, such as logistic and linear regression, throw out rows where cells are missing, resulting in a poorer model. Outliers also throw the model off, making it less useful.
John Sall discussed a new way of dealing with messy data called "Informative Missing" in his blog post. This takes the use of missing data beyond imputation to a new realm: Missing data might actually be informing you of something that is important and so must be included in your model. An example would be a loan applicant leaving part of their application blank in order to hide a poor credit history; this would be a critical finding for a credit analyst to model. If you are working in a manufacturing setting, data might be missing because the result was literally off-the-scale, which could be useful information to capture in the model. If you are modelling the activity of substances based on their chemical properties, you might have missing data for, say, decomposition temperature if the material was not seen to decompose over the measured temperature range; so if you include this information in the model, you would get better predictions of activity.
There is a new class of modelling techniques called shrinkage methods that are designed to provide you with a model that predicts well and has the smallest number of variables, even when you have strong correlations between input variables. The Generalised Regression personality allows you to use these methods from within the Fit Model platform. Used along with Informative Missing, it has the added benefit of using all rows of data when building the model -- even with messy data.
Decision tree-based methods are good for dealing with outliers, because the point at which the split occurs is not biased. JMP Pro users are telling us that they are building useful models without having to clean their data because of this and Informative Missing: With robust modelling techniques, you might be able to skip data cleansing and still produce a good model. Now that is truly revolutionary. Decision trees also have the added advantage of being visual, allowing you to explain your findings to execs.
What do I do if I have messy data?
JMP Pro is the software designed to deal with your messy data.
We will be running an exclusive, hands-on workshop in the UK for new users of this software on 12 June, so if you would would like join us in Marlow, let me know.
If you would like your managers to see how JMP Pro deals with these problems, you can ask them to join the webcast on 3 April when we will be showing two case studies.