10 dangers in your tabular data

russ_wolfinger · Apr 24, 2018 01:08 PM

Building predictive models from data tables? Watch out for these 10 villains. Image courtesy of prettysleepy on Pixaby. Building predictive models from data tables? Watch out for these 10 villains. Image courtesy of prettysleepy on Pixaby.

I've been immersing myself in the world of data science competitions, in which predictive performance on a blinded holdout set ranks you and your teammates on a leaderboard, with prizes and street cred on the line. To succeed in these contests, you have to be very efficient and smart with your time, since there are hundreds of small decisions and different directions you can take along the way. If you mess up or waste time, somebody passes you. Sometimes, it even feels like there are dangers purposely trying to trip you up. Kind of sounds like real life, doesn't it?

For every good practice and deftness with data I have learned, I have probably made 10 times that many mistakes. Over time, I've gathered a list of 10 dangers to constantly watch out for and battle with. As I describe in a previous post, data never comes to us clean, and these troubles are always lurking. While the list focuses on tabular data, many of the problems carry over to images, text, and other kinds of structural data.

These dangers are so real and pervasive that they can take on a wicked character of their own, and in fact, for fun I've associated each with a supervillain. In the table below, see if you can picture these evildoers deviously personifying their dangers on your data tables.

Danger

Supervillain

Questions

Answers

JMP Platforms

1. Improper Shape

Mystique, the alluringly deceptive shapeshifter from the X-Men

a. Is the table in wide form, not tall or stacked?

b. Do rows correspond to distinct observations to be trained and/or predicted and columns to targets and features?

1. Visual inspection

2. Perform data manipulation and reshaping as necessary

Transpose, Split, Stack, Join, Update

2. Incorrect Values

Joker from Batman, always looking to play bad tricks on you

a. Are data values correct?

b. What may have caused mistakes?

1. Visual scan

2. Univariate, bivariate, multivariate plots and stats

3. Check and verify original data sources as necessary

Distribution, Graph Builder, Multivariate

3. Missing Values

Lord Voldemort from Harry Potter, He-Who-Must-Not-Be-Named can mysteriously remove Values-That-Should-Be-There

a. Are values missing at random, or are there systematic patterns?

b. Are missing value patterns predictive?

1. Carefully study degree and patterns of missingness; engineer new features as needed

2. Create missing value indicators and check if they are correlated with the response

3. Note: Imputation typically is not necessary for gradient boosted trees

Analyze > Screen > Explore Missing Values, Cols > Recode, Graph Builder

4. Invalid Predictors

Mr. Burns from The Simpsons, selfishly scheming and plotting in ways that make your life miserable

a. Are any columns infeasible or inappropriate to use for future prediction?

b. Do all potential predictors make sense scientifically and ethically?

1. Scan through all columns and understand the meaning of each

2. Either delete invalid columns or move them to the beginning or end of the table so they are not selected later

Cols > Delete, Cols > Reorder, Move columns with mouse in left side panel

5. High Cardinality

Magneto from The X-Men, controls anything containing metal and can throw a lot of pieces at you all at once

a. Which categorical predictors have a high number of levels?

b. What kind of numerical encodings make sense?

c. Should low-frequency categories be collapsed?

1. Inspect number and frequency

counts of levels

2. Recode levels as appropriate

3. Compute numerical encodings, being especially careful if they use target information to avoid overfitting

Distribution, Cols > Recode

6. Inliers and Outliers

Winged Monkeys from The Wizard of Oz, constantly flying inside and outside, carrying out the evil biddings of the Wicked Witch

a. Are extreme values within reasonable bounds?

b. Are any values unexpectedly repeated too frequently?

1. Inspect histograms and quantiles

2. Verify unusual values are correct

3. Use log transform to assess really skewed continuous distributions

4. Mahalanobis distances

Distribution, Multivariate, Analyze > Screening > Explore Outliers

7. Duplicated Data

Siamese Cats from Lady and the Tramp, with doubly evil intentions and an annoyingly catchy song

a. Are any rows or columns accidentally duplicated?

b. Are any columns isomorphic?

c. Are ID values unique for each row?

1. Visual scan

2. Two-way clustering scatterplot matrix

Summary, Sort, Clustering (for rows), Multivariate (for columns)

8. Response Imbalance

Jabba the Hutt from Star Wars, happy to squish and slime you under all his disgusting blubber

a. What kind of response is it?

b. For binary or nominal, are the proportions representative of the population you want to predict?

c. For continuous, does its distribution make sense?

1. Inspect response distribution

2. Consider weighting, oversampling, undersampling; can depend on desired performance metrics

Distribution, Cols > New Columns to create a weight variable

9. Training-Test Inconsistencies

Eye of Sauron from Lord of the Rings, ever watching you to find your innermost contradictions and exploit them

a. Are distributions of the response and predictors similar across training, validation, and test splits?

b. Will future test sets stay within scope, or is there a real need for extrapolation?

1. Create a new variable indicating splits; then use it to dynamically explore distributions

2. Use low-dimensional projections to compare structures of subsets

Fit Y by X, where Y is all variables and X is split indicator, create mosaic plots for nominal variables and empirical cumulative distribution comparisons for continuous variables, Multivariate

10. Data Leakage

Ursula from The Little Mermaid, can completely blind you with her jet black ink, leaking it into undesirable places

a. Are any features in the training, validation, and test splits inappropriately using information from the response?

b. Are ID columns predictive?

1. Look for unexpected relationships and for any clues that may indicate a feature is leaky

2. Engineer new features with simple functions to further check for leaks

Fit Y by X, Graph Builder, Multivariate, Clustering, Select Columns > Right Click > New Formula Column

How in the world can we efficiently deal with these troublesome problem causers, especially when they conspire together and we do not have sufficient time to write and debug custom code? The final column contains the JMP routines that I use every day to handle them--it's my secret, indispensable weapon. With inspiration from a famous small superhero from the 1950s and the Apple MacIntosh’s forever-game-changing user interface, JMP has been steadily improving for 30 years under a principal guiding philosophy of super-tight integration of statistics and graphics, driven by a mouse.