cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
Register for our Discovery Summit 2024 conference, Oct. 21-24, where you’ll learn, connect, and be inspired.
Choose Language Hide Translation Bar
billboard
Level I

Does the partition platform require equal variance?

I have highly right skewed data (see distribution Leaks/mi Yr attached), only about 400 leaks on ~4,000,000 segments of pipe. I use the partition platform to identify factors for our model with differences in leak rates. But my data indicates unequal variance (attached RefService2023wLeaks - FitY by X, fails 4 of 5 variance tests). Does the partition platform require equal variance? By the Central Value Theorem can I assume this large data set will give accurate results for significant differences in means (see attached Partition example)?

Thank you

2 REPLIES 2
Byron_JMP
Staff

Re: Does the partition platform require equal variance?

1st- This looks like a really fun data set.

The modeling method partition uses has very few assumptions. It's a very good entry level data mining tool, especially when you have a lot of data to work with. 

 

Also look at Analyze>Screening>predictor screening.  Just enter your response variable and as many "X" variables as you like.   Maybe adjust the number of trees to 500, just for fun. 

 

It builds out a bunch of trees with a random selection of columns and rows for each tree, so its super robust to bad data and outliers as well as X's that are very correlated with your response. Again, very few assumptions.  

JMP Systems Engineer, Health and Life Sciences (Pharma)
Victor_G
Super User

Re: Does the partition platform require equal variance?

To expand on excellent response from @Byron_JMP, Decision Tree is a very simple yet powerful algorithm. It has several advantages that makes it an effective choice for data mining : easy to understand and interpret, handling of categorical and numerical inputs and outputs, can handle missing values, is robust to outliers and can detect non-linear relationships, and enable to do features/predictors selection and ranking.

It doesn't assume traditional parametric assumptions, but there are a few points you should care about :

  • Check multicollinearity and correlations among your predictors : As the Decision Tree creates splits in your data using all predictors and choosing the one offering the best split in your Y values, correlated predictors and multicollinearity could be a problem, as some predictors could be left out and never considered as they share some common information.
  • Beware of Overfitting : Decision Tree are prone to overfitting, as you could expand the tree very deep to perfectly classify or do a regression with your numerical data. It is recommended to have a validation strategy (K-folds cross-validation, Leave-One-Out or validation set, as the data quantity doesn't seem to be an issue in your case) and to ensure the model created is not too complex for the task at hand : setting a maximum depth (number of splits) and a minimum size split enable to reduce the risk of overfitting.
  • Training data sample and representativeness : As Decision Tree are greedy algorithms (they split the data in order to maximize information gain, or reduce entropy), they are very sensitive to the training data used. Small changes in the data could cause changes in the rules provided by the decision tree. Considerations about data representativeness and noise should be adressed to ensure avoiding overfitting and ensuring generalization of the results.

 

As suggested by @Byron_JMP, it is often interesting to use Random Forest (available through Bootstrap Forest platform for JMP Pro or through Predictor Screening for JMP), as this technique reduce some drawbacks of Decision Tree thanks to various mechanisms/techniques : creation of several independent trees in parallel trained on bootstrapped samples (to reduce sensitivity to training set, ensure generalization and enable checking/validation through "Out-Of-Bag" sample results) with each split considering only a subset of the predictors (to enable each predictor, even some correlated with others, to have the same chances to be selected for the split, which enable to have more diverse trees and an ensemble of trees with good generalization performances).

 

Hope this complementary answer will help you,

Victor GUILLER
Data & Analytics