Danger
|
Supervillain
|
Questions
|
Answers
|
JMP Platforms
|
1. Improper Shape
|
Mystique, the alluringly deceptive shapeshifter from the X-Men
|
a. Is the table in wide form, not tall or stacked?
b. Do rows correspond to distinct observations to be trained and/or predicted and columns to targets and features?
|
1. Visual inspection
2. Perform data manipulation and reshaping as necessary
|
Transpose, Split, Stack, Join, Update
|
2. Incorrect Values
|
Joker from Batman, always looking to play bad tricks on you
|
a. Are data values correct?
b. What may have caused mistakes?
|
1. Visual scan
2. Univariate, bivariate, multivariate plots and stats
3. Check and verify original data sources as necessary
|
Distribution, Graph Builder, Multivariate
|
3. Missing Values
|
Lord Voldemort from Harry Potter, He-Who-Must-Not-Be-Named can mysteriously remove Values-That-Should-Be-There
|
a. Are values missing at random, or are there systematic patterns?
b. Are missing value patterns predictive?
|
1. Carefully study degree and patterns of missingness; engineer new features as needed
2. Create missing value indicators and check if they are correlated with the response
3. Note: Imputation typically is not necessary for gradient boosted trees
|
Analyze > Screen > Explore Missing Values, Cols > Recode, Graph Builder
|
4. Invalid Predictors
|
Mr. Burns from The Simpsons, selfishly scheming and plotting in ways that make your life miserable
|
a. Are any columns infeasible or inappropriate to use for future prediction?
b. Do all potential predictors make sense scientifically and ethically?
|
1. Scan through all columns and understand the meaning of each
2. Either delete invalid columns or move them to the beginning or end of the table so they are not selected later
|
Cols > Delete, Cols > Reorder, Move columns with mouse in left side panel
|
5. High Cardinality
|
Magneto from The X-Men, controls anything containing metal and can throw a lot of pieces at you all at once
|
a. Which categorical predictors have a high number of levels?
b. What kind of numerical encodings make sense?
c. Should low-frequency categories be collapsed?
|
1. Inspect number and frequency
counts of levels
2. Recode levels as appropriate
3. Compute numerical encodings, being especially careful if they use target information to avoid overfitting
|
Distribution, Cols > Recode
|
6. Inliers and Outliers
|
Winged Monkeys from The Wizard of Oz, constantly flying inside and outside, carrying out the evil biddings of the Wicked Witch
|
a. Are extreme values within reasonable bounds?
b. Are any values unexpectedly repeated too frequently?
|
1. Inspect histograms and quantiles
2. Verify unusual values are correct
3. Use log transform to assess really skewed continuous distributions
4. Mahalanobis distances
|
Distribution, Multivariate, Analyze > Screening > Explore Outliers
|
7. Duplicated Data
|
Siamese Cats from Lady and the Tramp, with doubly evil intentions and an annoyingly catchy song
|
a. Are any rows or columns accidentally duplicated?
b. Are any columns isomorphic?
c. Are ID values unique for each row?
|
1. Visual scan
2. Two-way clustering scatterplot matrix
|
Summary, Sort, Clustering (for rows), Multivariate (for columns)
|
8. Response Imbalance
|
Jabba the Hutt from Star Wars, happy to squish and slime you under all his disgusting blubber
|
a. What kind of response is it?
b. For binary or nominal, are the proportions representative of the population you want to predict?
c. For continuous, does its distribution make sense?
|
1. Inspect response distribution
2. Consider weighting, oversampling, undersampling; can depend on desired performance metrics
|
Distribution, Cols > New Columns to create a weight variable
|
9. Training-Test Inconsistencies
|
Eye of Sauron from Lord of the Rings, ever watching you to find your innermost contradictions and exploit them
|
a. Are distributions of the response and predictors similar across training, validation, and test splits?
b. Will future test sets stay within scope, or is there a real need for extrapolation?
|
1. Create a new variable indicating splits; then use it to dynamically explore distributions
2. Use low-dimensional projections to compare structures of subsets
|
Fit Y by X, where Y is all variables and X is split indicator, create mosaic plots for nominal variables and empirical cumulative distribution comparisons for continuous variables, Multivariate
|
10. Data Leakage
|
Ursula from The Little Mermaid, can completely blind you with her jet black ink, leaking it into undesirable places
|
a. Are any features in the training, validation, and test splits inappropriately using information from the response?
b. Are ID columns predictive?
|
1. Look for unexpected relationships and for any clues that may indicate a feature is leaky
2. Engineer new features with simple functions to further check for leaks
|
Fit Y by X, Graph Builder, Multivariate, Clustering, Select Columns > Right Click > New Formula Column
|
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.