Hi @mjz5448,
As stated by @statman, Predictor screening is a nice data mining tool to explore possible important factors based on historical data, but it won't replace a proper study done with DoE and clear objectives.
Predictor Screening is a method based on the Random Forest algorithm that help you identify important variables (identified by the calculation of features importance) for your response(s) in your historical data. Depending on the representativeness of your data, coverage of the experimental space, missing values, presence of outliers, interactions, correlations or multicollinearity between your factors, this platform may have some shortcuts, many unknowns, and shouldn't be trust blindly.
- The analysis and insights you can have with this analysis is only limited to the historical experimental space (which might be too limited and/or not representative for a good understanding of the system). You can see this as Production experimental space (historical data) vs. R&D (DoE) : in production, only small variations of the factors is allowed (and some are fixed, so you won't be able to determine their importances) to create robust and quality products, whereas in R&D the objective is different : explore an experimental space and optimize the factors of a system, by using larger ranges for the factors considered.
- The Predictor screening platform doesn't give you a model (unlike the Bootstrap Forest available in JMP Pro), only feature/factor importances, so you don't have access to individual trees and branches, which could help identify interactions between factors.
- Since you don't know the quality of the model behind (through R², RMSE or other relevant metrics according to your objectives), you don't know how much variability is explained through it (or how accurate the model is) and if the calculations of feature importance is relevant/adequate for your topic.
- Since all factors are ranked, the threshold/limit between important and non-important factors is not trivial. To help you with, you can add a random factor (with random values). An example of this technique is shown here with the Diamonds Dataset from JMP with Price as response, adding a random factor and omitting Carat Weight as possible input (since it is a very important predictor, so I wanted to show you an example where it's hard to sort out the other predictors) :
- This platform doesn't give you indications about possible correlations or multicollinearity between factors. You can use the Multivariate platform to explore correlations between inputs, outputs, and input-outputs. You can also create dummy regression models to calculate VIF scores between your factors, in order to spot multicollinearity.
- High cardinality categorical factors may also bias the feature importance calculations. The more cardinality (number of classes), the higher the bias, as the data will form a high number of small groups/classes that the model will try to learn. So it will fail to generalize or "understand the logic" and as a result this will increase the chance of overfitting, giving this feature a high importance.
An example here with the same dataset, same conditions but adding a random categorical factors with 11 levels (from A to K, named "Shuffle[Column10]). This categorical random feature with high cardinality is ranked higher than the random numerical variable seen before :
DoE has a broader scope than an exploratory data analysis/data mining technique, and is more an end-to-end approach/methodology : from defining your objectives, responses and factors in order to create the most informative dataset (but also adapted to your experimental budget), to the analysis of the results and refinement of the model in order to better know how to proceed next or reach your objectives. Always use domain expertise (with other tools and visualizations if you have historical data) to guide the selection of factors for your DoE.
I would encourage you to use the Augment Design platform if you have historical data that is relevant for your study (in terms of factors), as you can always expand/reduce the factors' ranges if needed, and you have a lot of flexibility on how to augment your initial historical dataset into a bigger design. As a good practice (and particularly relevant when augmenting historical data into a DoE), make sure to check the "Group new runs into separate block" option in this platform. Using "Augment Design" can help you save some experimental budget in the initial screening part to allocate the saved runs in the optimization part (by augmenting the DoE a second time with Space-Filling method in a narrower experimental space or defining a more precise model with higher order terms only on the relevant factors).
I hope this complementary answer will help you understand the differences,
Victor GUILLER
"It is not unusual for a well-designed experiment to analyze itself" (Box, Hunter and Hunter)