Created:
Aug 8, 2021 12:42 PM
| Last Modified: Jun 10, 2023 4:34 PM(2435 views)
I plotted my observed vs predicted values, and added a line of fit with the RMSE, however, the values it's calculating/displaying is significantly lower than what I'm calculating and expecting (0.46 vs 2.07).
I'm not sure what you are trying to do. In the data set you show, is column 1 the actual values and column 2 the predicted values? The difference in these is called the residuals.
I don't see what you are trying to model? Typically, you will have at least 1 independent variable and at least 1 dependent variable and the RMSE is the standard deviation of the model fit. I don't know which is which in your data set? But if you were modeling these 2 columns, I get completely different results than you?
There are many ways to look at residuals for the main purpose of determining if your model meets the fundamental assumptions of quantitative analysis (NID(0, variance)).
Summary of Fit RSquare 0.125645 RSquare Adj 0.084009 Root Mean Square Error 1.804222 Mean of Response 5.251277 Observations (or Sum Wgts) 23
"All models are wrong, some are useful" G.E.P. Box
I noticed this interesting calculation as well when I used Python to model my experiment. I run a model, JMP gives me a RMSE, and then I save column of the predicted value. I used the formula below to do calculation in Excel, which gives me a different RMSE result with JMP. Could you please tell what's wrong for my calculation?
I just realized I made a mistake by ignoring df. The formula above is not right, should divide by degree of residuals instead of n, and then square root of it.
The error degrees of freedom equal the number of observations (23) minus the number of parameters estimated (2). This script shows how you can calculate the RMSE and compares it to the result from the Bivariate platform. I'm curious why you calculate RMSE with Excel when JMP gives you the answer.
Names Default to Here( 1 );
// duplicate example in discussion
dt = New Table( "RMSE Example",
Add Rows( 23 ),
New Script(
"Source",
Open(
"https://community.jmp.com/t5/Discussions/Is-JMP-miscalculating-my-RMSE/m-p/583546",
HTML Table( 1, Column Names( 0 ), Data Starts( 1 ) )
)
),
New Column( "X",
Numeric,
"Continuous",
Format( "Best", 12 ),
Set Values(
[6.343, 6.259, 6.139, 5.266, 5.451, 6.492, 6.52, 5.784, 6.154, 6.103,
6.122, 5.541, 5.181, 5.355, 5.674, 5.642, 5.136, 5.512, 5.9, 5.65, 5.55,
5.6, 4.6]
)
),
New Column( "Y",
Numeric,
"Continuous",
Format( "Best", 12 ),
Set Values(
[1.607119, 6.177004, 0.429539, 6.05368, 8.40834, 5.226375, 5.440272,
5.990638, 6.07527, 5.965431, 0.445846, 7.000725, 5.931456, 5.626706,
5.849083, 5.859607, 5.14744, 5.693088, 5.690088, 5.792552, 5.513773,
5.659517, 5.195817]
)
)
);
// calculate RMSE with Bivariate platform
obj = Bivariate( Y( :Y ), X( :X ), Fit Line( 1 ) );
bivRMSE = (obj << Report)["Summary of Fit"][NumberColBox(1)] << Get( 3 );
// calculate RMSE uing direct linear regression
yData = dt:Y << Get As Matrix;
xData = dt:X << Get As Matrix;
{ estimate, se, diagnostic } = Linear Regression( yData, xData );
predY = estimate[1] + estimate[2] * xData;
df = N Row( yData ) - 2; // error df = n - 2 for parameter estimates
manRMSE = Sqrt( Sum( (yData - predY)^2 ) / df );
// compare results
Show( bivRMSE, manRMSE );
Created:
Aug 8, 2021 06:07 PM
| Last Modified: Aug 8, 2021 3:10 PM(2404 views)
| Posted in reply to message from rummeln 08-08-2021
Expanding on @statman's comments and assuming you were trying to predict the first column from the second, it looks like you fit your residuals with a second linear model. This would mean you created a type of ensemble model from whatever your first model was and a linear term. This would further reduce your error, giving you a smaller RMSE. I believe instead you likely wanted to use the Model Comparison platform under Analyze > Predictive Modeling > Model Comparison.
This gives a reported RASE that is closer to your expected RMSE.
As an example, consider the iris sample data set. If you use some method to predict Sepal length and predict it to be a function of Sepal length as shown in the equation below, you will get a reasonable model.
But, the slope, or parameter estimate for Sepal width, is not quite right (or at least it does not match what is observed in the data).
Thus, you can fit a second model from the output of the first using a linear model to improve the fit (this is because there is a pattern in the residuals).
Saving the predicted formula gives a new predicted formula column based on another predicted formula column. Note how the slope of the blue line changes.
In this case you will get the same result as if you just modeled Sepal length from Petal length using a linear model, but that would not always be true depending on what was used for the first model.
Brining up the Model Comparison shows the improvement from the second model: