All Models Are Wrong, But Simulation Helps Identify the Best of the Bunch ( US 2018 145 )

Level: Intermediate
Robert Anderson, JMP Senior Statistical Consultant, SAS



Correctly identifying the best possible model and determining which factors are genuinely important are always vitally important tasks, but never easy. Holdback validation is often used to suppress overfitting and avoid including non-genuine terms in a model. However, it is not a foolproof method, especially when working with small data sets. The model you obtain is often dependent on how the training and validation rows are assigned. A single validation column cannot be relied on to point to the “best” model. However, by using many different validation columns, a clearer picture starts to emerge. Using the Simulate function in JMP Pro and some simulated data sets, this presentation will demonstrate how refitting models using multiple validation columns allows the most frequently occurring and most likely model to be identified. It will also demonstrate that this approach works even for data sets with as few as 30 rows.