Hi
I have read several post on data leakage between the test and training groups. The common suggestions to avoid leakages:
1 Split the database in training and test set before any normalization, standardization and missing value imputation
2 in case of time series avoid the random split. Decide a time cut point and use it to separate the training and test set.
Any comment, suggestions, useful readings?
Felice
My 2 cents:
If you have time series data I tend to think your item 2 is the bigger issue unless you are doing a lot of interpolation. Consider sources of noise in your data, if any of those occur over multiple rows, or if rows are somehow linked to the rows before and after (monitoring a tank level over time, when it takes more than 1 row for the tank to get to a new steady state, for example) then you cannot split data by row and expect the validation or testing sets to be independent from training.
Instead you might consider splitting the data into testing and validation in larger chunks, for example:
In either case you might consider excluding transition periods between groups.
My personal experience: since starting to use at least one of these methods, validation and testing group statistics represent much more reasonable expectations for the predictive power of models.
Here's my limited experience.
Modeling is quite complex (right model choice, data preparation, data splitting, meta data tuning ...).
A good, robost model is independent from splitting, meta data , model type ...
Both training and test set need to be representative for the process, but the test set doesn't need to be as comprehensive. Training / Test is necessary to judge result, as @ih already mentioned.
For splitting, the stratify option can be used to control the split. But for both, modeling and splitting, you need the process know how to do it best.
For complex models, and when result is critical, I tried to test different models, different meta data, and different validation columns. When there is no major difference, there is no issue. And this is quite easy to be done in JMP (Pro).
At the end, only this test gives you the answer, whether there is no leakage. But honestly, I wouldn't mind leakage, when I can be sure to have a good model.
I wonder if we are confusing two things: leakage and overfitting. I think they are quite different. Overfitting is always a potential problem with predictive modeling, and the usual way to prevent it is to have validation, and if you have enough data, also a separate test data set. From my experience, overfitting is less of a problem with JMP than it might otherwise be, since a number of the other tuning parameters also help prevent it. For example, with decision trees, setting a relatively high minimum split size will help avoid overfitting. Still it is a good idea to have separate test data to evaluate whether your model is capable of producing realistic predictions on new data. I think this is especially important for time series data.
Leakage is an entirely different issue, in my view. Leakage is a mis-specified model. You are using factors that would only be known at the same time as what you are trying to predict. For example, if you are trying to predict customer retention and one of your factors is this year's purchases, then you will only have data for retained customers and it will be missing for those you don't retain. In that case, the factor of this year's purchases will perfectly predict customer retention, but that is not a useful model. That is what I believe leakage is.
I don't think leakage is prevented or detected by having different training/validation/test data sets. You may detect leakage in any of those 3: a perfect model (or nearly so) on the training data would be enough to suspect there might be leakage. I think the best prevention of leakage is careful consideration of the factors and understanding how and when they are measured. If a model appears too good to be true, then it might be due to leakage (or, it could just be a very good model). But I don't think it is a question answered by any technical devices.
I am interested to hear what others have to say about this.
Dear @dale_lehman , yes you are right, I mixed it up, at least formally, as I thought just about the overall goal, to have a good and robust model. Avoid leakage is defined as an aim, not to have already seen data in the test set. I found some additional explanation here:
But how to do this intentionally, when we usually divide training and test sets randomly? I more think in a process model category, the training and the test set both come from the same process. When thinking about forecast, it may be different.
The article you link to is very confused about data leakage. It is mixing up the idea of differential performance on training and validation/test data (which is overfitting) with lack of a clear boundary between training and test data (which is one way to think of leakage). The two concepts are, I believe, distinct. When your model fits the training data and not the test data, that is an example of overfitting. Leakage would not account for that - in fact, if there is leakage, it would appear in the training data as well. Further, if the test data is randomly selected, then leakage should occur in the training data whenever it is in the test data (and vice versa). It results from having informative features in the data that are being used to predict a response, but which could not be known when the prediction is being made.
Things are a bit less clear with time series data, since the test data would not be a random selection of the entire data set (it might be a random time period). Most time series models will use past data to forecast the future - leakage would be if you used later data to forecast earlier data. Overfitting would be using a time period to forecast a subset of the time period used to build the model - i.e., not keeping the time periods separate. Even if overfitting and leakage are avoided in a time series model, there is an assumption that the time period used to build the model is sufficiently similar to the period you are trying to forecast. If the periods are not similar (e.g., if forces change that make the model no longer relevant), then I would say the model is not very good, but I wouldn't call that leakage and probably wouldn't call that overfitting either.
Hi @FR60 ,
My thoughts on the specific points you bring up:
1. If you are going to standardize and/or normalize the data, I think doing it before splitting the data set would be best. As for missing value imputation, you'll want to be careful about that. Sometimes missing data is informative and sometimes not. Also, the way in which JMP imputes the values might not be appropriate for your specific data. It might impute the missing data incorrectly and introduce an artificial behavior in the data that will influence the model incorrectly.
2. For the time series, if you're really using a time series to forecast and predict future events in time, I don't think you'd be splitting the data set -- at least not with the time series platforms in JMP, they do not have a Validation column in those platforms. But, if you wanted to have an independent data set to compare different forecasting models on to see which one best fits future, unused data, then yes, I'd split the data set at some point and use that to generate forecast models and save those formulas to the data table and see how well they predict the unused data. You could maybe use a simple Fit Y by X and look at the R^2 value to see which model predicts the outcome best, or look at the residuals between actual and modeled data to see which has the smallest sum of residuals or something like that.
In general, though, what I do is split the data table 80/20 into training and test data sets. I then separate out the test data into a completely different data table so that the models I build have no chance of using the test data. This is more than simply "hiding" and "excluding" data and eliminates any possibility of an algorithm accidentally using the test data. With the "training" set data table, I then either create a validation column or just use the validation portion option of a platform, again splitting it 80/20 into training/validation in order to reduce overfitting. After generating multiple different models with different platforms, I then use the test data set to see which model best predicts the unused/unseen data.
Hope this helps,
DS
I have found this tutorial in Kaggle very interesting.
Again is stressed to avoid to apply any normalization and imputation of missing values to data before to split them in training and test set.
https://www.kaggle.com/alexisbcook/data-leakage
Ciao Felice
Another post on Data leakage during data pre processing.
https://machinelearningmastery.com/data-preparation-without-data-leakage/
Hi @FR60 ,
I can see where some confusion might have come in with regard to this. There are two kinds of splits and leakage that you're referring to, as when you referenced the article at Kaggle, and also @dale_lehman commented on this as well.
One split is making sure that the data used to train the model doesn't also include data known only at the outcome. That would be the "target leakage" Kaggle is talking about. The other is the "train-test contamination", where you want to make sure the test data doesn't find it's way into the training or validation sets. The second one was the one I was more referring to -- my data sets typically don't contain columns that are known after the predictor variables, so "target leakage" is not common for me to worry about, whereas the "train-test contamination" is.
For JMP, maybe it's easier to think about the data splits as either a row-wise split == "train-test contamination" vs a column-wise split == "target leakage". Depending on your data structure, you might be doing only one or both.
I can see now why it might be important to standardize/normalize the data after the split, makes sense. It's my understanding JMP typically does this behind the scenes during calculations anyway and then just re-scales it for when it's made into a report.
Coming back to your original post, I guess it depends on how your data needs to be split and how it's structured -- is it structured vertically in time or horizontally. This might determine if you do a row-wise or column-wise split for the time data. But, I would apply similar kinds of logic to deciding when to cut the time series. If there is a point in time where outcome results might get fed into the model as predictors, then you definitely need to cut before then. I'm not sure how you'd do this in the right way for the "train-test contamination" issue. If you're trying to do forcasting or time-series analysis, JMP actually works pretty well using all the data for that. You get to decide how many period to predict forward based on historical time data, and the output has confidence intervals and so forth.
Good discussion!,
DS