Hi @ih ,
You both are certainly right that I cannot get around my thoughts and subjective influence on the process and decisions, which therefore makes the the entirety of it somewhat subjective and prone to someone's bias. It would be nice to remove as much of that as possible so as to limit any one person's bias into the matter at hand. After all, we're after the best performing model, not what someone wants/wishes to be the best performing model.
It's also true that I don't solely use the R2 as a metric for the models I generate. It is however one of the more reliable methods in helping to drive the tuning process of some of the model fits, for example with boosted trees and bootstrap forest. Both of those methods have some randomness involved in the process, which is good, but fine tuning the parameters of the model can get rather unhandy -- it's very easy to generate large tuning tables as there are several parameters, making the hyperparameter space very big. In order to help that process a little, the Gaussian Fit platform can reduce the parameter space upon iterative tuning fits. R2 works as a very effective stand-in metric with that platform.
The bootstrap option for report diagnostics is a very helpful tool that I routinely use. I find it also very helpful with the null factor from the autovalidation add-in to test which factors (out of a large set) truly should be kept in the model -- a sort of pre-modeling factor reduction technique.
As mentioned before, the models tend to perform better when the training and validation R2 approach similar values. The other extreme, high training and low validation routinely perform far worse overall. When setting the desirability functions and levels (or cutoffs), that can be a very subjective portion and can heavily influence the outcome. What would be nicer is an algorithm that would optimize that process for you. Of course, the end decision is still one which requires the user to make the call -- to asses whether the result has physical meaning, not just mathematical meaning. An analogy would be spectral data, say Raman or photoelectron spectra, which often have line shapes that are Lorentzian or Gaussian (or a mix). Mathematically, it's possible to have a negative sigma (standard deviation) since in the function, sigma is squared. But physically, this has zero meaning, so any fit result from an algorithm that suggested a negative sigma for the curve can be discarded because there is no physical basis for a measurement to have a negative standard deviation. Similarly, one can obtain R2 values that are negative, or have R2 validation values greater than for the training set, but those instances should be major red flags and not considered in further analysis.
In the end, this method is used to help narrow down the hyperparameter space so that finding a more stable global minimum is faster and easier than generating a massive tuning table with hundreds of thousands of rows to test all possible combinations. This still only provides recommendations for the tuning parameters, and one must still assess whether those findings make sense. The model is still iteratively generated under random starting conditions, but each time, narrowing down on a more stable set of parameters. This approach is helpful with the other modeling platforms that have many input parameters for tuning them.
Thanks!,
DS