Building predictive models from data tables? Watch out for these 10 villains. Image courtesy of prettysleepy on Pixaby.
I've been immersing myself in the world of data science competitions, in which predictive performance on a blinded holdout set ranks you and your teammates on a leaderboard, with prizes and street cred on the line. To succeed in these contests, you have to be very efficient and smart with your time, since there are hundreds of small decisions and different directions you can take along the way. If you mess up or waste time, somebody passes you. Sometimes, it even feels like there are dangers purposely trying to trip you up. Kind of sounds like real life, doesn't it?
For every good practice and deftness with data I have learned, I have probably made 10 times that many mistakes. Over time, I've gathered a list of 10 dangers to constantly watch out for and battle with. As I describe in a previous post, data never comes to us clean, and these troubles are always lurking. While the list focuses on tabular data, many of the problems carry over to images, text, and other kinds of structural data.
These dangers are so real and pervasive that they can take on a wicked character of their own, and in fact, for fun I've associated each with a supervillain. In the table below, see if you can picture these evildoers deviously personifying their dangers on your data tables.
Jabba the Hutt from Star Wars, happy to squish and slime you under all his disgusting blubber
a. What kind of response is it?
b. For binary or nominal, are the proportions representative of the population you want to predict?
c. For continuous, does its distribution make sense?
1. Inspect response distribution
2. Consider weighting, oversampling, undersampling; can depend on desired performance metrics
Distribution, Cols > New Columns to create a weight variable
9. Training-Test Inconsistencies
Eye of Sauron from Lord of the Rings, ever watching you to find your innermost contradictions and exploit them
a. Are distributions of the response and predictors similar across training, validation, and test splits?
b. Will future test sets stay within scope, or is there a real need for extrapolation?
1. Create a new variable indicating splits; then use it to dynamically explore distributions
2. Use low-dimensional projections to compare structures of subsets
Fit Y by X, where Y is all variables and X is split indicator, create mosaic plots for nominal variables and empirical cumulative distribution comparisons for continuous variables, Multivariate
10. Data Leakage
Ursula from The Little Mermaid, can completely blind you with her jet black ink, leaking it into undesirable places
a. Are any features in the training, validation, and test splits inappropriately using information from the response?
b. Are ID columns predictive?
1. Look for unexpected relationships and for any clues that may indicate a feature is leaky
2. Engineer new features with simple functions to further check for leaks
Fit Y by X, Graph Builder, Multivariate, Clustering, Select Columns > Right Click > New Formula Column
How in the world can we efficiently deal with these troublesome problem causers, especially when they conspire together and we do not have sufficient time to write and debug custom code? The final column contains the JMP routines that I use every day to handle them--it's my secret, indispensable weapon. With inspiration from a famous small superhero from the 1950s and the Apple MacIntosh’s forever-game-changing user interface, JMP has been steadily improving for 30 years under a principal guiding philosophy of super-tight integration of statistics and graphics, driven by a mouse.