Re: Data leakage - Page 2

FR60 · Jun 8, 2023 5:29 PM

Hi

I have read several post on data leakage between the test and training groups. The common suggestions to avoid leakages:

1 Split the database in training and test set before any normalization, standardization and missing value imputation

2 in case of time series avoid the random split. Decide a time cut point and use it to separate the training and test set.

Any comment, suggestions, useful readings?

Felice

FR60 · Feb 26, 2021 03:37 AM

Hi Diedrich

thank you very much for your comment.

About the training-test leakage the suggestion reported in several blogs is to split the data before to do any data preprocessing. For example if we want normalize all predictors between 0 and 1 we should before split the data and then calculate the min/max by using only the training data and then normalize with those values the validation and test data too. Same story for other similar type of data management like the z-standardization or missing value imputations. Of course if the cross validation is used than a more complicated calculation need to be implemented following the same methodology reported above.

Now let's move to the question. In your mail you wrote:

"I can see now why it might be important to standardize/normalize the data after the split, makes sense. It's my understanding JMP typically does this behind the scenes during calculations anyway and then just re-scales it for when it's made into a report"

This means that using JMP we don't need to worry about the training-test leakage because this is avoided internally by jmp algorithms?

If so this is great to me.

Please can you confirm?

Thanks Felice

SDF1 · Feb 26, 2021 04:38 PM

Hi @FR60 ,

Unfortunately, I can't confirm that JMP does that for every platform or under every circumstance.

I do know that in several modeling platforms in JMP, when you get the estimates report, there is often one for scaled estimates. You can also see this in the Fit Y by X platform. There, if you fit a line, or polynomial, or whatever function to the data, JMP returns a formula where it centers the fit function. So, if the fit is y=a*x+b*x^2 +c, the function JMP returns is actually y=a*(x-x0)+b*(x-x1)^2+c, where x0 and x1 are constants. It does this unless you explicitly tell it not to.

It's my understanding JMP does it this way because the estimates for a, b, and c are more accurate when centering the data than when not centering. I wouldn't be surprised if JMP does this for every platform, but I don't know. They encode all the data when generating a DOE, so even if you enter in your real values, on the back-end of the program, it uses encoded values like -1, 0, 1 for low, mid, high settings in the DOE. It doesn't actually use the real numbers you might put in.

To get full confirmation on the topic, I'd write to a JMP staff person directly, or talk with the Systems Engineer that supports your area. It would be good to know if this is a platform dependent thing, or done across the board with JMP.

Thanks!,

DS