I'm using the Gaussian process model to fit a few hundred computer simulation data points. The computer simulations are a deterministic system, so the response never changes if the same inputs are used. I used the space filling Fast Flexible Filling design to generate the table. There are 7 different factors and 3 of those are entered categorically, the other 4 are continuous. The problem is that no matter how large I make the training design, the prediction equation generated is not very accurate when tested against a validation set of another couple hundred runs. The error in the training set used for the fit is less than 1%, but the error in the validation set can be as high as 4.5%. I'm looking to get around at most an error of 1-1.5%.
I've also tried using the Fit Model with a response surface model effect, and after crossing the factors a couple more times the actual vs predicted plot becomes a near perfect 45 degree angle, but the prediction equation isn't very accurate either. The Gaussian model is supposed to be the best choice for computer simulation data.
Does anyone know what limitations the Gaussian model has? Is my 1-1.5% error too unrealistic? The validation data set is generated in the same way, fast flexible filling. There's nothing unusual that stands out in the validation data. Many of the points that have 4% error have factor values that are very similar to other data points that have less than 1% error. I would think that with a smoothing fit, which is what the Gaussian model does, that the equation would predict everything very well.
Also, not sure how significant it is but I'm using version 12 of standard JMP. Is the Gaussian model in JMP Pro more powerful? Are there other statistical analysis software packages that can handle Gaussian modeling better than JMP?
Thanks for any help I appreciate it!
(Probably I'm not telling you anything you are not already aware of).
I would check out Brad's excellent talk:
By design, the Gaussian Process (GP) "interpolates the data", so the predicted value will always agree with the simulated or computed ('measured') value at each point in the factor space where there is a run. So the focus shifts away from using validation to avoid overfitting (which is about getting the right balance between 'signal' and 'noise'), to the ability (or otherwise . . . ) of the GP to reasonably interpolate at points at which you don't have runs.
On this basis, I would expect the error for the training set to be (nearly) zero, and the error for the validation set to be bigger, potentially much bigger. I also thought that the GP will only fit continuous factors, so I'm not sure how you handled the categorical ones.
To add to Ian's reply, although every point is perfectly fit with Gaussian Process, there may still be some bias that looks a lot like random variation. You can get a sense for how much bias there is in your data by looking at the actual by jackknife predicted plot. Draw a line between the lower left and upper right corners of the plot. The tighter your data is around that line the lower the bias for your data.
Neural nets (NN) are also very good when building surrogate models for computer experiments. Look at several levels of hidden nodes using the Tan H activation function. Just a word of precaution, you will get a different result every time you run the same model if you do not set a random seed when running your models. Check out the random seed reset add-in in the link below.
I have a question related to this post. I am setting up a Latin hypercube design and fitting a Gaussian Process. When comparing both the Gaussian and Cubic correlation structures, and comparing the predicted values at the design points with the simulated value at each point, these values are exactly the same for the Cubic interpolation function, but not for the Gaussian. Is this explained by the bias as in bllw@jmp's post? Or is this due to any roundoff errors? In theory there should be perfect interpolation for both types of correlation functions, or am I wrong on this one?
Any help would be highly appreciated!
I would refer you to the books section under Help and a clearer explanation of the differences between Gaussian and Cubic. Below are a few things I found there.
Note: The estimated parameters can be different due to different starting points in the
minimization routine, the choice of correlation type, and the inclusion of a nugget parameter.
Correlation Type Choose the correlation structure for the model. The platform fits a spatial
correlation model to the data, where the correlation of the response between two
observations decreases as the values of the independent variables become more distant.
Gaussian restricts the correlation between two points to always be nonzero, no matter the
distance between the points.
Cubic allows the correlation between two points to be zero for points that are far enough
apart. This method is a generalization of a cubic spline.
Thanks for the answer, billw@JMP.
I have been looking into the books in the Help section, yet the description of the correlation functions is quite brief.
I thought that, no matter what correlation structure is used, the Gaussian process model always interpolates the design points perfectly. Apparently this is not the case for the Gaussian correlation structure, but I could not find a clear explanation for this in the Help section books... I will look into other sources to clarify this.
Thanks for your help!