Hello
I want to apply non-parametric methods such as Decision Tree, Random Forest, Naive Bayes, SVM, Neural Network, and Boosted Tree on my dataset. What preprocessing steps should I perform before applying these models?
Hi @maryam_nourmand,
When talking about pre-processing, there are usually several steps to prepare the data for analysis :
0. Partitioning your data : Split your data into training, validation and test sets, or plan your validation (K-folds crossvalidation, Leave-One-Out, ...) and test strategy. This should be done first to ensure any data transformation you may do after will not lead to data leakage, meaning a part of information from validation and test sets from the preprocessing steps can be found during model training (data ranges, distribution, imputation techniques, ...). You can use Make Validation Column (jmp.com) for this step.
Similar questions and answers from Stackexchange :
machine learning - What algorithms need feature scaling, beside from SVM? - Cross Validated (stackex...
How do outliers and missing values impact these classifiers? - Data Science Stack Exchange
So as I already wrote in another topic (and on LinkedIn), if you want to have a first quick test with a ML algorithm, try using Random Forest :
So if you're looking for an accurate, robust, versatile, scalable algorithm with low tendency to overfitting for the analysis of your datasets, Random Forest can be a good starting point and benchmark. Always start with the simple algorithms first and a good validation strategy to evaluate and compare your models, before going to complex algorithms like Neural Networks which might need several layers of validation (to build and fine-tune them).
I hope this long answer will help you,
Hi @maryam_nourmand,
When talking about pre-processing, there are usually several steps to prepare the data for analysis :
0. Partitioning your data : Split your data into training, validation and test sets, or plan your validation (K-folds crossvalidation, Leave-One-Out, ...) and test strategy. This should be done first to ensure any data transformation you may do after will not lead to data leakage, meaning a part of information from validation and test sets from the preprocessing steps can be found during model training (data ranges, distribution, imputation techniques, ...). You can use Make Validation Column (jmp.com) for this step.
Similar questions and answers from Stackexchange :
machine learning - What algorithms need feature scaling, beside from SVM? - Cross Validated (stackex...
How do outliers and missing values impact these classifiers? - Data Science Stack Exchange
So as I already wrote in another topic (and on LinkedIn), if you want to have a first quick test with a ML algorithm, try using Random Forest :
So if you're looking for an accurate, robust, versatile, scalable algorithm with low tendency to overfitting for the analysis of your datasets, Random Forest can be a good starting point and benchmark. Always start with the simple algorithms first and a good validation strategy to evaluate and compare your models, before going to complex algorithms like Neural Networks which might need several layers of validation (to build and fine-tune them).
I hope this long answer will help you,
Hello
thanks for your answer
if i want use (K-folds crossvalidation)and my response type is categorical
when i want to select column to y ,i should put all of my predictors and my response variable into it or just my response variable ?
Hello @maryam_nourmand,
If you're speaking about the use of Make Validation Column with K-folds cross-validation methods, you can use all your predictors (X variables) in the stratification columns panel (to make sure the folds will be homogeneous regarding the distributions of X variables), and your response(s) in the Y panel : JMP will order and distribute the rows between folds based on the response values to ensure the folds are homogeneous regarding the response(s), avoiding biased learning and evaluation during cross-validation : Launch the Make Validation Column Platform (jmp.com)
If you have several rows corresponding to the same ID/group, keep this data structure by using this ID/group column(s) in the "Grouping Columns" panel.
Hope this will answer your question,