Solved: data preprocessing

Report Inappropriate Content · May 27, 2024 03:11 PM

Hello
I want to apply non-parametric methods such as Decision Tree, Random Forest, Naive Bayes, SVM, Neural Network, and Boosted Tree on my dataset. What preprocessing steps should I perform before applying these models?

Victor_G · May 28, 2024 2:43 AM

Hi @maryam_nourmand,

When talking about pre-processing, there are usually several steps to prepare the data for analysis :

0. Partitioning your data : Split your data into training, validation and test sets, or plan your validation (K-folds crossvalidation, Leave-One-Out, ...) and test strategy. This should be done first to ensure any data transformation you may do after will not lead to data leakage, meaning a part of information from validation and test sets from the preprocessing steps can be found during model training (data ranges, distribution, imputation techniques, ...). You can use Make Validation Column (jmp.com) for this step.

Data cleaning : Identify and correct errors or inconsistencies in the data set to ensure that the data is of high quality and suitable for analysis or model training.
- Missing values : possibility to exclude rows, do data imputation, or do nothing. Take care to analyze your missing values with Explore Missing Values (jmp.com) platform to check if they are MAR (Missing At Random), MCAR (Missing Completely At Random) or MNAR (Missing Not At Random). Depending on the type of missing values you have / the patterns, some options might be preferable than others.
  - Tree-based models are much more flexible to missing values compared to other algorithms you mention, and can treat missing value as informative : Informative Missing (jmp.com).
  - For Neural Networks platform, you can check "Informative missing" option when launching the platform to enable missing value imputation and coding. If you do not check this option, the row will be excluded from the analysis.
  - Naive Bayes don't have a problem with missing data when there are only few missing values.
  - SVM will by default exclude rows where there is at least one missing value.
- Outliers : possibility to exclude them, balance/weigh them (with frequency or through transformations), or do nothing. You can detect them using a large variety of methods with Explore Outliers (jmp.com) platform.
  - Tree-based models are not very sensitive to outliers, as they split the data into slices/ranges of data, so it doesn't change the partitioning done (it will be in the highest or lowest range/slice, but won't change the cutoff value) : https://datascience.stackexchange.com/questions/37394/are-decision-trees-robust-to-outliers
  - Neural Networks are sensitive to outliers, so you need to check them and handle them.
  - Naive Bayes and SVM may be sensitive to outliers.
- Duplicates : possibility to exclude them, reduce their frequency, or do nothing. You can detect them (and other patterns) using Explore Patterns (jmp.com) platform.
  - Duplicates may affect all algorithms, as it influences the learning done by favoring or enforcing specific patterns found in duplicate points compared to other points. It can also lead to overfitting and false evaluation of model performance if they are not taken into account during the data splitting into training, validation and test sets. Machine Learning methodology is very different from statistical modeling, where duplicates are used to evaluate variance and uncertainty.
Variable scaling/normalization : Bring all features/variables values into a similar range without distorting differences in the ranges of values, to avoid features with large ranges to dominate those with smaller ranges/variation.
This is helpful for algorithms based on distances or similarities, like SVM, KNN, Neural Networks, (linear & logistic) regression, and already done by JMP by default (example with KNN : https://community.jmp.com/t5/India-JMP-Users-Group-Library/Predictive-modelling-with-machine-learnin...)
Tree-based models and probability-based algorithms like Naive Bayes may not require scaling.
Handling categorical data : When coding, you need to encode categorical data using label encoding, one-hot encoding, dummy encoding... Fortunately in JMP, this is already done and prepared, no need to do it by yourself (example with regression models) :
Statistical Details for Nominal Effects Coding (jmp.com)
Ordinal Factors (jmp.com)

Similar questions and answers from Stackexchange :
machine learning - What algorithms need feature scaling, beside from SVM? - Cross Validated (stackex...

How do outliers and missing values impact these classifiers? - Data Science Stack Exchange

So as I already wrote in another topic (and on LinkedIn), if you want to have a first quick test with a ML algorithm, try using Random Forest :

Robust to outliers
Low sensitivity to noise (bagging help reduce the variance)
Less prone to overfitting, thanks to bootstrapping and available "Out-Of-Bag" (samples not used in each bootstrap samples) performances validation
Good generalization performance
Handles easily non-linear situations thanks to the ensemble of decision trees that can handle non-linear boundaries with "if-else" rules
Controls multicollinearity among variables with the random feature subset selection at each node
Low tunability: performances are not very sensitive to hyperparameters tuning (see my latest post about Random Forest hyperparameters tuning with DoE)
Ability to handle sparse data
Easier interpretability compared to other algorithms (with the use of "built-in" Feature Importance)
Relatively fast and efficient computation (depending on the number of trees and depth used): Since individual trees are independent of each others, the training process can be parallelized.

So if you're looking for an accurate, robust, versatile, scalable algorithm with low tendency to overfitting for the analysis of your datasets, Random Forest can be a good starting point and benchmark. Always start with the simple algorithms first and a good validation strategy to evaluate and compare your models, before going to complex algorithms like Neural Networks which might need several layers of validation (to build and fine-tune them).

I hope this long answer will help you,

Victor GUILLER

"It is not unusual for a well-designed experiment to analyze itself" (Box, Hunter and Hunter)

View solution in original post

Victor_G · May 28, 2024 2:43 AM

Hi @maryam_nourmand,

When talking about pre-processing, there are usually several steps to prepare the data for analysis :

0. Partitioning your data : Split your data into training, validation and test sets, or plan your validation (K-folds crossvalidation, Leave-One-Out, ...) and test strategy. This should be done first to ensure any data transformation you may do after will not lead to data leakage, meaning a part of information from validation and test sets from the preprocessing steps can be found during model training (data ranges, distribution, imputation techniques, ...). You can use Make Validation Column (jmp.com) for this step.

Data cleaning : Identify and correct errors or inconsistencies in the data set to ensure that the data is of high quality and suitable for analysis or model training.
- Missing values : possibility to exclude rows, do data imputation, or do nothing. Take care to analyze your missing values with Explore Missing Values (jmp.com) platform to check if they are MAR (Missing At Random), MCAR (Missing Completely At Random) or MNAR (Missing Not At Random). Depending on the type of missing values you have / the patterns, some options might be preferable than others.
  - Tree-based models are much more flexible to missing values compared to other algorithms you mention, and can treat missing value as informative : Informative Missing (jmp.com).
  - For Neural Networks platform, you can check "Informative missing" option when launching the platform to enable missing value imputation and coding. If you do not check this option, the row will be excluded from the analysis.
  - Naive Bayes don't have a problem with missing data when there are only few missing values.
  - SVM will by default exclude rows where there is at least one missing value.
- Outliers : possibility to exclude them, balance/weigh them (with frequency or through transformations), or do nothing. You can detect them using a large variety of methods with Explore Outliers (jmp.com) platform.
  - Tree-based models are not very sensitive to outliers, as they split the data into slices/ranges of data, so it doesn't change the partitioning done (it will be in the highest or lowest range/slice, but won't change the cutoff value) : https://datascience.stackexchange.com/questions/37394/are-decision-trees-robust-to-outliers
  - Neural Networks are sensitive to outliers, so you need to check them and handle them.
  - Naive Bayes and SVM may be sensitive to outliers.
- Duplicates : possibility to exclude them, reduce their frequency, or do nothing. You can detect them (and other patterns) using Explore Patterns (jmp.com) platform.
  - Duplicates may affect all algorithms, as it influences the learning done by favoring or enforcing specific patterns found in duplicate points compared to other points. It can also lead to overfitting and false evaluation of model performance if they are not taken into account during the data splitting into training, validation and test sets. Machine Learning methodology is very different from statistical modeling, where duplicates are used to evaluate variance and uncertainty.
Variable scaling/normalization : Bring all features/variables values into a similar range without distorting differences in the ranges of values, to avoid features with large ranges to dominate those with smaller ranges/variation.
This is helpful for algorithms based on distances or similarities, like SVM, KNN, Neural Networks, (linear & logistic) regression, and already done by JMP by default (example with KNN : https://community.jmp.com/t5/India-JMP-Users-Group-Library/Predictive-modelling-with-machine-learnin...)
Tree-based models and probability-based algorithms like Naive Bayes may not require scaling.
Handling categorical data : When coding, you need to encode categorical data using label encoding, one-hot encoding, dummy encoding... Fortunately in JMP, this is already done and prepared, no need to do it by yourself (example with regression models) :
Statistical Details for Nominal Effects Coding (jmp.com)
Ordinal Factors (jmp.com)

Similar questions and answers from Stackexchange :
machine learning - What algorithms need feature scaling, beside from SVM? - Cross Validated (stackex...

How do outliers and missing values impact these classifiers? - Data Science Stack Exchange

So as I already wrote in another topic (and on LinkedIn), if you want to have a first quick test with a ML algorithm, try using Random Forest :

Robust to outliers
Low sensitivity to noise (bagging help reduce the variance)
Less prone to overfitting, thanks to bootstrapping and available "Out-Of-Bag" (samples not used in each bootstrap samples) performances validation
Good generalization performance
Handles easily non-linear situations thanks to the ensemble of decision trees that can handle non-linear boundaries with "if-else" rules
Controls multicollinearity among variables with the random feature subset selection at each node
Low tunability: performances are not very sensitive to hyperparameters tuning (see my latest post about Random Forest hyperparameters tuning with DoE)
Ability to handle sparse data
Easier interpretability compared to other algorithms (with the use of "built-in" Feature Importance)
Relatively fast and efficient computation (depending on the number of trees and depth used): Since individual trees are independent of each others, the training process can be parallelized.

So if you're looking for an accurate, robust, versatile, scalable algorithm with low tendency to overfitting for the analysis of your datasets, Random Forest can be a good starting point and benchmark. Always start with the simple algorithms first and a good validation strategy to evaluate and compare your models, before going to complex algorithms like Neural Networks which might need several layers of validation (to build and fine-tune them).

I hope this long answer will help you,

Victor GUILLER

"It is not unusual for a well-designed experiment to analyze itself" (Box, Hunter and Hunter)

maryam_nourmand · May 28, 2024 06:15 AM

Hello
thanks for your answer
if i want use (K-folds crossvalidation)and my response type is categorical
when i want to select column to y ,i should put all of my predictors and my response variable into it or just my response variable ?

Victor_G · May 28, 2024 07:46 AM

Hello @maryam_nourmand,

If you're speaking about the use of Make Validation Column with K-folds cross-validation methods, you can use all your predictors (X variables) in the stratification columns panel (to make sure the folds will be homogeneous regarding the distributions of X variables), and your response(s) in the Y panel : JMP will order and distribute the rows between folds based on the response values to ensure the folds are homogeneous regarding the response(s), avoiding biased learning and evaluation during cross-validation : Launch the Make Validation Column Platform (jmp.com)

If you have several rows corresponding to the same ID/group, keep this data structure by using this ID/group column(s) in the "Grouping Columns" panel.

Hope this will answer your question,

Victor GUILLER

"It is not unusual for a well-designed experiment to analyze itself" (Box, Hunter and Hunter)

data preprocessing

Re: data preprocessing

Re: data preprocessing

Re: data preprocessing

Re: data preprocessing

Recommended Articles

Get Going with JMP: Essentials for Using JMP

Multiple-Group Analysis in Structural Equation Modeling

Hiding and Excluding Data