What inspired this wish list request?
When building complex models based on large number of collinear/correlated variables, there are several possibilities available. One of them is transforming the variables into principal components, and then create a model based on these principal components to avoid collinearity issues, reduce dimensionality and keep a large part of the initial information.
The problem that could happen in predictive modeling is data leakage : information from the validation set and/or test set is used in the pre-processing and/or model building.
This can happen in PCA platform if validation/test rows are not hidden and excluded : if the PCA is done on the whole dataset and after models are built on the training principal components data, the model still contains a part of the information from the validation and test sets. Model evaluation can become biased as the performances would be assessed too "optimistically".
What is the improvement you would like to see?
As for other platforms (see Fit Model below), I would like a validation role when launching the PCA platform dialog, so that the PCA is created with the training rows, and applied on validation and test rows :
Why is this idea important?
The validation role for PCA would enable user to be more aware of this data leakage situation when they are building predictive models. It's also part of the best practices to split the dataset first into train/valid/test sets before doing any transformation/model, so adding this validation role would enforce a reliable best practice for users. Finally, it would be also easier to use PCA in a more reliable way as a pre-processing step for predictive models with this feature, instead of manually or scripting excluding rows.
... View more