Level: Intermediate
Philip Ramsey, Principal Lecturer, University of New Hampshire; and Owner, North Haven Group
Chris Gotwalt, JMP Director of Statistical Research and Development, SAS
There are two different goals to statistical modeling: explanation and prediction. Explanatory models often predict poorly (Shmueli, 2010). Often analyses of designed experiments (DOE) are explanatory, yet the experimental goals are prediction. DOE is a best practice for product and process development where one predicts future performance. Predictive modeling requires partitioning the data into training and validation sets where the validation set is used to assess predictive models. Most DOEs have insufficient observations to form a validation set precluding direct assessment of prediction performance. We demonstrate a “balanced auto-validation” technique using the original data to create two copies of that data, one a training set and the other a validation set. The sets differ in row weightings. The weights are Dirichlet distributed and “balanced;” observations contributing more to the training data contribute less to the validation set (and vice versa). The approach only requires copying the data and creating a formula column for weights. The technique is general, allowing one to apply predictive modeling techniques to smaller data sets common to laboratory and manufacturing studies. Two biopharma process development case studies are used for demonstration. Both cases have large validation sets combined with definitive screening designs. JMP is used to demonstrate analyses.