Thanks for the example. Let's look at closely, from a couple of perspectives. And they should lead to the same conclusion: there is no gold in the data.
First perspective, split data into training and validation, one should not attempt to use the fitted model based on the result from your script.
Second perspective, just fit the good and old logistic regression without train/val splitting, and see whether there are warning signs.
First warning sign: whole model test, which says there is great chance the model is no better than random guess.
Second warning sign: Effect Tests and Parameter Significance. Here is Effect Tests. Only two reds. If those two are the gold, the rest are rubbles. Rubbles make us think they are gold. Remove them, shall we?
Ok, now what comes out from the other end of the sluice box is the following Whole Tests and Effect Tests. And pay attention to these p-values. They are around 0.05 and 0.10. On the fence, by usual means. Should we keep them? If we remove them, then we get random guess. Let's pretend that we really want to keep them, and see how it goes.
How about ROC and AUC?
So AUC is 0.59. Gold? I guess that it is on the fence by usual means as well, because the model fitting says the model is on the fence, so should be everything derived from it. Need a proof? Yes, we can have a proof. We can bootstrap the AUC value, if you have JMP Pro. Right click on AUC value, and select Bootstrap. I gave it a 10000 draws. And 0.5 is within the range of the bootstrap samples. So it really comes down to how much one want to believe those two remaining factors are gold.
What I have done is to put a bootstrap confidence interval around AUC. We can make confidence intervals around ROC as well, so we may see the nice half shell is not that impressive at all. It would take a lot of effort to do so, though. Let's just look at one point ROC instead: the point corresponding to 0.5 cutoff threshold. Here are the steps.
First, turn on Confusion Matrix.
Look at the highlighted numbers, they are True Positive Rate and False Positive Rate corresponding to the 0.5 cutoff threshold.
And we can find this point on ROC as in the following screenshot at the cross.
Now bootstrap those highlighted numbers. And I get this table, in which column "0" is bootstrapped False Positive Rate, and column "1" is bootstrapped True Positive Rate.
Plot column "1" vs column "0", and overlay on the ROC. And here is what I get. I also put a diagonal line as a reference.
What the bootstrap sample says that the probability 0.5 threshold point on ROC had a great chance over the diagonal line, but there is non-negligible chance below the diagonal. It is on the fence, if we should have concluded already from the model.
So what I want to conclude from this example is that we should really nail the model before concluding the model overfits. Including random noise as effects will overfit in this case.