Discussions

Lu · Dec 20, 2019 04:22 AM

Hello,

I will like to start preprocessing my features in a database before starting analysis. I am looking for help concerning the steps to follow.

What comes first after data cleaning? Outlier removal, Scaling (normailzation of data) or Missing data imputation?

Which order would you suggest?

Regards

Lu

dale_lehman · Dec 25, 2019 04:25 PM

I believe standardization is done for clustering. The JMP documentation says that for KNN, each variable is scaled by its standard deviation - precisely to avoid the type of problem you are alluding to. So, I still think no additional preprocessing is required for you to do what you want - compare a number of different predictive models.

View solution in original post

dale_lehman · Dec 21, 2019 09:08 AM

I'll take a stab at this, although I think you probably need to be more specific to get better advice. I'm not sure how you are distinguishing "cleaning" data from "preprocessing" but I would advise against the step you list. I never recommend removing outliers - at least until fairly well into an analysis when you have some understanding on why the outliers are, in fact, outliers. I also would not recommend normalization or rescaling data as a general practice (though it may become useful for particular contexts) - many JMP analysis platforms automatically normalize data where it is helpful. Similarly, imputing missing data is unnecessary in most JMP platforms as it is done automatically (usually by just checking a box to include missing). There are times you will want to impute missing values more carefully, perhaps using your own methodology (e.g., building a regression model to impute missing data), but this again will depend on the context. So, I don't recommend doing any of the things you list as an automatic thing. Instead, I'd begin by graphically examining your data to make sure you understand what is being measured, what types of relationships seem to exist and are potentially important, and to ascertain whether some "preprocessing" is a good idea.

Lu · Dec 25, 2019 09:32 AM

Thanks Dale for the repsonse,

Indeed, it is important to examen your data graphically first before preprocessing them.

I want to compare the predictive performance of several machine learning models (linear models such as KNN and non linear models such as Random Forest, etc...) on my data. Linear models treat features as if they were on the same scale. But physiological variables have values on much different scales such as pH and heart rate. Therefore I want to Scale them by Z-normalization. Moreover, clustering algorithm (K-mean clustering) does not impute missing values automatically in JMPpro. Thats why a wanted to preprocess my data first

dale_lehman · Dec 25, 2019 04:25 PM

I believe standardization is done for clustering. The JMP documentation says that for KNN, each variable is scaled by its standard deviation - precisely to avoid the type of problem you are alluding to. So, I still think no additional preprocessing is required for you to do what you want - compare a number of different predictive models.

Discussions

Preprocessing Features

Re: Preprocessing Features

Re: Preprocessing Features

Re: Preprocessing Features

Re: Preprocessing Features

Recommended Articles