Our World Statistics Day conversations have been a great reminder of how much statistics can inform our lives. Do you have an example of how statistics has made a difference in your life? Share your story with the Community!
Sep 21, 2012 12:02 PM
| Last Modified: Jul 12, 2019 6:55 AM
Several of the clients of Predictum, our analytical solutions company, have asked us to help them to evaluate the capabilities of JMP Pro with their data and analytical challenges, and we've done so with excellent results.
We’ve seen that the advanced modelling capabilities in JMP Pro give insights over and above those available from conventional modelling methods, especially when the potential for overfitting and multicollinearity between variables are present. JMP Pro also offers several useful tools for formatting data in a way that makes subsequent analysis efficient. To describe the advantages present in JMP Pro, a brief discussion of these statistical and data formatting issues is needed.
Overfitting and multicollinearity are two common problems with big data sets. Overfitting occurs when a data set has fewer observations than predictors and in cases where cross-validation is not exploited. Multicollinearity occurs when the two or more predictors are correlated with each other. Both overfitting and multicollinearity prevent adequate estimation of coefficients in linear least squares regression, which is the most common predictive modelling technique used in traditional statistics.
JMP Pro offers many tools for addressing overfitting and multicollinearity. Partial least squares (PLS) regression is an excellent predictive modelling technique that overcomes overfitting and multicollinearity, especially when the predictors are continuous. Some of the most useful features of PLS regression, including a choice of validation methods as well as provisions for imputing missing data, are available only in JMP Pro.
Boosted trees and bootstrap forests are good predictive modelling techniques when the predictors are categorical, which can be used to counter overfitting and multicollinearity. Both of these advanced techniques are also available only in JMP Pro. Neural networks is also a good predictive modelling technique that overcomes overfitting. This platform is found in regular JMP, but JMP Pro has many advanced features and flexibility in building advanced neural networks.
For partial least squares regression and neural networks, a key feature called validation deserves some further mention. Validation is a method of honest assessment, which allows you to measure and compare the predictive ability of models built using various techniques. It involves splitting the data set into partitions (e.g., a training set and a testing set), fitting the model with the training set, predicting the responses in the testing set and measuring the predictive ability of the model.
Different variations of validation are available including K-fold cross-validation, which splits a data set into K partitions and rotates each partition as the testing set while combining the remaining partitions into the training set; the predictive accuracy of the model is the average of the predictive accuracies measured from the K testing sets. Another popular and effective validation strategy is to assign observations training, validation and test sets. Fixing the validation role for each observation creates a level playing field for comparing across different modelling methods.
Validation is a valuable technique for assessing the value of a model, and some or all of the validation-related features are available only in JMP Pro.
We have used the advanced capabilities in JMP Pro for a variety of applications, including semiconductor fabs, chemical operations and call-center performance. We have institutionalized these methods with JMP Pro and, in some cases, combined regular JMP with SAS on a server doing the heavy analytical work. Using the PLS platform in JMP Pro, we helped one customer decide which facility layout lent itself to the best identification of problem tools. We have also used these tools to study test and yield data to identify combinations of process settings and tool configurations that were problematic. Our clients are very pleased with the results.