cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
Try the Materials Informatics Toolkit, which is designed to easily handle SMILES data. This and other helpful add-ins are available in the JMPĀ® Marketplace
Choose Language Hide Translation Bar
maryam_nourmand
Level III

data preprocessing

Hello
I want to apply non-parametric methods such as Decision Tree, Random Forest, Naive Bayes, SVM, Neural Network, and Boosted Tree on my dataset. What preprocessing steps should I perform before applying these models?

1 ACCEPTED SOLUTION

Accepted Solutions
Victor_G
Super User

Re: data preprocessing

Hi @maryam_nourmand,

 

When talking about pre-processing, there are usually several steps to prepare the data for analysis :

     
     0. Partitioning your data : Split your data into training, validation and test sets, or plan your validation (K-folds crossvalidation, Leave-One-Out, ...) and test strategy. This should be done first to ensure any data transformation you may do after will not lead to data leakage, meaning a part of information from validation and test sets from the preprocessing steps can be found during model training (data ranges, distribution, imputation techniques, ...). You can use Make Validation Column (jmp.com) for this step.

 

  1. Data cleaning : Identify and correct errors or inconsistencies in the data set to ensure that the data is of high quality and suitable for analysis or model training. 
    • Missing values : possibility to exclude rows, do data imputation, or do nothing. Take care to analyze your missing values with Explore Missing Values (jmp.com) platform to check if they are MAR (Missing At Random), MCAR (Missing Completely At Random) or MNAR (Missing Not At Random). Depending on the type of missing values you have / the patterns, some options might be preferable than others.
      • Tree-based models are much more flexible to missing values compared to other algorithms you mention, and can treat missing value as informative : Informative Missing (jmp.com).
      • For Neural Networks platform, you can check "Informative missing" option when launching the platform to enable missing value imputation and coding. If you do not check this option, the row will be excluded from the analysis. 
      • Naive Bayes don't have a problem with missing data when there are only few missing values.
      • SVM will by default exclude rows where there is at least one missing value.
    • Outliers : possibility to exclude them, balance/weigh them (with frequency or through transformations), or do nothing. You can detect them using a large variety of methods with Explore Outliers (jmp.com) platform.
      • Tree-based models are not very sensitive to outliers, as they split the data into slices/ranges of data, so it doesn't change the partitioning done (it will be in the highest or lowest range/slice, but won't change the cutoff value) : https://datascience.stackexchange.com/questions/37394/are-decision-trees-robust-to-outliers
      • Neural Networks are sensitive to outliers, so you need to check them and handle them.
      • Naive Bayes and SVM may be sensitive to outliers.
    • Duplicates : possibility to exclude them, reduce their frequency, or do nothing. You can detect them (and other patterns) using Explore Patterns (jmp.com) platform.
      • Duplicates may affect all algorithms, as it influences the learning done by favoring or enforcing specific patterns found in duplicate points compared to other points. It can also lead to overfitting and false evaluation of model performance if they are not taken into account during the data splitting into training, validation and test sets. Machine Learning methodology is very different from statistical modeling, where duplicates are used to evaluate variance and uncertainty.
  2. Variable scaling/normalization : Bring all features/variables values into a similar range without distorting differences in the ranges of values, to avoid features with large ranges to dominate those with smaller ranges/variation.
    This is helpful for algorithms based on distances or similarities, like SVM, KNN, Neural Networks, (linear & logistic) regression, and already done by JMP by default (example with KNN : https://community.jmp.com/t5/India-JMP-Users-Group-Library/Predictive-modelling-with-machine-learnin...)
    Tree-based models and probability-based algorithms like Naive Bayes may not require scaling.
  3. Handling categorical data : When coding, you need to encode categorical data using label encoding, one-hot encoding, dummy encoding... Fortunately in JMP, this is already done and prepared, no need to do it by yourself (example with regression models) : 
    Statistical Details for Nominal Effects Coding (jmp.com)
    Ordinal Factors (jmp.com)

 

Similar questions and answers from Stackexchange : 
machine learning - What algorithms need feature scaling, beside from SVM? - Cross Validated (stackex...

How do outliers and missing values impact these classifiers? - Data Science Stack Exchange

 

So as I already wrote in another topic (and on LinkedIn), if you want to have a first quick test with a ML algorithm, try using Random Forest :

  • Robust to outliers
  • Low sensitivity to noise (bagging help reduce the variance)
  • Less prone to overfitting, thanks to bootstrapping and available "Out-Of-Bag" (samples not used in each bootstrap samples) performances validation
  • Good generalization performance
  • Handles easily non-linear situations thanks to the ensemble of decision trees that can handle non-linear boundaries with "if-else" rules
  • Controls multicollinearity among variables with the random feature subset selection at each node
  • Low tunability: performances are not very sensitive to hyperparameters tuning (see my latest post about Random Forest hyperparameters tuning with DoE)
  • Ability to handle sparse data
  • Easier interpretability compared to other algorithms (with the use of "built-in" Feature Importance)
  • Relatively fast and efficient computation (depending on the number of trees and depth used): Since individual trees are independent of each others, the training process can be parallelized.

 

So if you're looking for an accurate, robust, versatile, scalable algorithm with low tendency to overfitting for the analysis of your datasets, Random Forest can be a good starting point and benchmark. Always start with the simple algorithms first and a good validation strategy to evaluate and compare your models, before going to complex algorithms like Neural Networks which might need several layers of validation (to build and fine-tune them).

 

I hope this long answer will help you,

Victor GUILLER

"It is not unusual for a well-designed experiment to analyze itself" (Box, Hunter and Hunter)

View solution in original post

3 REPLIES 3
Victor_G
Super User

Re: data preprocessing

Hi @maryam_nourmand,

 

When talking about pre-processing, there are usually several steps to prepare the data for analysis :

     
     0. Partitioning your data : Split your data into training, validation and test sets, or plan your validation (K-folds crossvalidation, Leave-One-Out, ...) and test strategy. This should be done first to ensure any data transformation you may do after will not lead to data leakage, meaning a part of information from validation and test sets from the preprocessing steps can be found during model training (data ranges, distribution, imputation techniques, ...). You can use Make Validation Column (jmp.com) for this step.

 

  1. Data cleaning : Identify and correct errors or inconsistencies in the data set to ensure that the data is of high quality and suitable for analysis or model training. 
    • Missing values : possibility to exclude rows, do data imputation, or do nothing. Take care to analyze your missing values with Explore Missing Values (jmp.com) platform to check if they are MAR (Missing At Random), MCAR (Missing Completely At Random) or MNAR (Missing Not At Random). Depending on the type of missing values you have / the patterns, some options might be preferable than others.
      • Tree-based models are much more flexible to missing values compared to other algorithms you mention, and can treat missing value as informative : Informative Missing (jmp.com).
      • For Neural Networks platform, you can check "Informative missing" option when launching the platform to enable missing value imputation and coding. If you do not check this option, the row will be excluded from the analysis. 
      • Naive Bayes don't have a problem with missing data when there are only few missing values.
      • SVM will by default exclude rows where there is at least one missing value.
    • Outliers : possibility to exclude them, balance/weigh them (with frequency or through transformations), or do nothing. You can detect them using a large variety of methods with Explore Outliers (jmp.com) platform.
      • Tree-based models are not very sensitive to outliers, as they split the data into slices/ranges of data, so it doesn't change the partitioning done (it will be in the highest or lowest range/slice, but won't change the cutoff value) : https://datascience.stackexchange.com/questions/37394/are-decision-trees-robust-to-outliers
      • Neural Networks are sensitive to outliers, so you need to check them and handle them.
      • Naive Bayes and SVM may be sensitive to outliers.
    • Duplicates : possibility to exclude them, reduce their frequency, or do nothing. You can detect them (and other patterns) using Explore Patterns (jmp.com) platform.
      • Duplicates may affect all algorithms, as it influences the learning done by favoring or enforcing specific patterns found in duplicate points compared to other points. It can also lead to overfitting and false evaluation of model performance if they are not taken into account during the data splitting into training, validation and test sets. Machine Learning methodology is very different from statistical modeling, where duplicates are used to evaluate variance and uncertainty.
  2. Variable scaling/normalization : Bring all features/variables values into a similar range without distorting differences in the ranges of values, to avoid features with large ranges to dominate those with smaller ranges/variation.
    This is helpful for algorithms based on distances or similarities, like SVM, KNN, Neural Networks, (linear & logistic) regression, and already done by JMP by default (example with KNN : https://community.jmp.com/t5/India-JMP-Users-Group-Library/Predictive-modelling-with-machine-learnin...)
    Tree-based models and probability-based algorithms like Naive Bayes may not require scaling.
  3. Handling categorical data : When coding, you need to encode categorical data using label encoding, one-hot encoding, dummy encoding... Fortunately in JMP, this is already done and prepared, no need to do it by yourself (example with regression models) : 
    Statistical Details for Nominal Effects Coding (jmp.com)
    Ordinal Factors (jmp.com)

 

Similar questions and answers from Stackexchange : 
machine learning - What algorithms need feature scaling, beside from SVM? - Cross Validated (stackex...

How do outliers and missing values impact these classifiers? - Data Science Stack Exchange

 

So as I already wrote in another topic (and on LinkedIn), if you want to have a first quick test with a ML algorithm, try using Random Forest :

  • Robust to outliers
  • Low sensitivity to noise (bagging help reduce the variance)
  • Less prone to overfitting, thanks to bootstrapping and available "Out-Of-Bag" (samples not used in each bootstrap samples) performances validation
  • Good generalization performance
  • Handles easily non-linear situations thanks to the ensemble of decision trees that can handle non-linear boundaries with "if-else" rules
  • Controls multicollinearity among variables with the random feature subset selection at each node
  • Low tunability: performances are not very sensitive to hyperparameters tuning (see my latest post about Random Forest hyperparameters tuning with DoE)
  • Ability to handle sparse data
  • Easier interpretability compared to other algorithms (with the use of "built-in" Feature Importance)
  • Relatively fast and efficient computation (depending on the number of trees and depth used): Since individual trees are independent of each others, the training process can be parallelized.

 

So if you're looking for an accurate, robust, versatile, scalable algorithm with low tendency to overfitting for the analysis of your datasets, Random Forest can be a good starting point and benchmark. Always start with the simple algorithms first and a good validation strategy to evaluate and compare your models, before going to complex algorithms like Neural Networks which might need several layers of validation (to build and fine-tune them).

 

I hope this long answer will help you,

Victor GUILLER

"It is not unusual for a well-designed experiment to analyze itself" (Box, Hunter and Hunter)
maryam_nourmand
Level III

Re: data preprocessing

Hello
thanks for your answer
if i want use (K-folds crossvalidation)and my response type is categorical 
when i want to select column to y ,i should put all of my predictors and my response variable into it or just my response variable ?

Victor_G
Super User

Re: data preprocessing

Hello @maryam_nourmand,

 

If you're speaking about the use of Make Validation Column with K-folds cross-validation methods, you can use all your predictors (X variables) in the stratification columns panel (to make sure the folds will be homogeneous regarding the distributions of X variables), and your response(s) in the Y panel : JMP will order and distribute the rows between folds based on the response values to ensure the folds are homogeneous regarding the response(s), avoiding biased learning and evaluation during cross-validation : Launch the Make Validation Column Platform (jmp.com)

If you have several rows corresponding to the same ID/group, keep this data structure by using this ID/group column(s) in the "Grouping Columns" panel.

 

Hope this will answer your question,

 

Victor GUILLER

"It is not unusual for a well-designed experiment to analyze itself" (Box, Hunter and Hunter)