Solved: Re: Validation column

frankderuyck · May 25, 2020 03:21 AM

I am working on a logistic regression in jmp pro. Using validation column option the data set is split in a fixed training and validation part (I used no test set); I understand that holdback vallidation is used? Is there also a possibility to chose cross-validation?

Mark_Bailey · May 25, 2020 10:34 AM

The options are a matter of personal preference. Lasso is for variable selection. Elastic Net, therefore, is also used for variable selection. Ridge is for shrinking estimates to avoid over-fitting.

The Penalized Estimation Methods are documented in JMP Help.

View solution in original post

Mark_Bailey · May 25, 2020 08:59 AM

Yes, the validation column is a way to define hold out sets for training, validation, and testing.

Using hold out sets is cross-validation. If you mean another way of defining hold out sets, such as K-fold cross-validation in the case of small data sets, it is not available in Nominal Logistic or Ordinal Logistic platforms. It is available in the Model Launch outline once you launch Generalized Regression.

frankderuyck · May 25, 2020 09:16 AM

Hi Marc, think there is a confusion, let me check:

I understood that cross-validation is like K-means & Jack Knife where all data are used in the validation process and are used to build the model based on inernal cyclic validation. In this case a test set is necessary to check model performance on new data.

Hold back validation is holding apart a fraction of the data set that is used to validate the model performance so, as the data in the hold back don't contribute to the model building here a test set is not required, right?

Mark_Bailey · May 25, 2020 09:30 AM

I know that there is confusion. K-Means is a supervised learning method to fit clusters. Jack-knife is a technique to estimate the standard error independent of the model. Honest assessment is an approach to select and evaluate among candidates models in lieu of future observations. Cross-validation is generally used for honest assessment. Cross-validation is generally accomplished by either holding out sub-sets of data (large data set case) or by K-fold cross-validation (small data set case).

See Hastie, Trevor, Robert Tibshirani, and Jerome Friedman, "Elements of Statistical Learning: Data Mining, Inference, and Prediction, Second Edition," Springer. See Section 7.10: Cross-validation.

frankderuyck · May 25, 2020 09:44 AM

In my reply above I meant Kfold not K means..

So I went to generalized regression, made a validation column and selected Kfold - 5 folds. What model estimation to use for my nominal logistic fit (three continuous factors): lasso, elastic net..? Is there a rule of thumb for selecting the estimation method?

Mark_Bailey · May 25, 2020 10:34 AM

The options are a matter of personal preference. Lasso is for variable selection. Elastic Net, therefore, is also used for variable selection. Ridge is for shrinking estimates to avoid over-fitting.

The Penalized Estimation Methods are documented in JMP Help.