Discussions

lloyd_tripp · Jun 10, 2023 4:48 PM

Hi All,

I have limited datasets that I would like to analyze (bivariate or fit model). The problem is I'd like to include error in the values a prior so as to not overestimate the significance of individual datapoints given the limited dataset.

I know how to include a column for the predefined error and have it reported on graph builder but I'm unsure how to go about this in the analysis platforms.

P_Bartell · May 28, 2022 07:36 AM

Thanks for the additional clarification.

Here's my take...when you wrote, "I worry the p-value would be deflated since we aren't capturing the measurement error...". You are onto something...but keep in mind the relationship of any observed value from a measurement system to it's two components of variance. You are indeed 'capturing' the measurement error in your data because any time you observe a measurement of something it has two sources of variation that are additive. Variance of the product + Variance of the measurement system = Variance observed. So I think what you are trying to do is come up with a way to subtract the variance of the measurement system from the variance of the product and somehow estimate a mean for each response. With only single measurements this is an impossibility since you can't estimate the mean of the product by simple subtraction of variances.

You have discovered one of the basic challenges of statistical thinking and inference in that sometimes variation of measurement systems can swamp the variation in a product/process, making statistical inferences about the product/process more problematic as the disparity between variance of the measurement system and variance of the product/process increases...in lay terms, it's harder for a signal (the product/process of interest) to rise above the noise (measurement system).

So where do you go from here? One obvious recommendation is bite the bullet and get replicate measurements from the system and use the averages of each response for modeling purposes...this will help beat down the measurement system variation wrt to estimating the mean of each response. But that sounds like it's not feasible? A next fall back position could be to build a model with the data you have...then use bootstrapping within JMP to examine parameter estimates and their distributions from a practical point of view. A last idea is within the construct of the JMP Prediction Profiler and Simulation capabilities spend some time simulating the system's behavior. There is a capability to 'add' variation to the simulation results over and above factor variation instructions so this can be a surrogate for adding measurement system variation to the mean of each response.

Hope this helps a bit?

View solution in original post

peng_liu · May 25, 2022 09:36 PM

Based on your description, here is my understanding: you have a small sample, and you see some observations a bit extreme, and you suspect that extreme observations are due to some kind of randomness, and you would like to introduce that randomness to the existing observations.

I am not sure whether that would help you to "suppress" the influence of extreme observations. But here are some ways of handling extreme observations.

Consider down-grading the extreme observations, i.e. use a lower weight on them. If all your observations are given weight 1, give the extreme ones a fraction of one.
Consider using a different distribution, e.g. t-distribution, Lognormal distribution, depending on your situation.
Consider fractional weighted bootstrap (right click the statistic that you are interested in, choose Bootstrap menu, and operate accordingly), which allows you to mimic repeated sampling and see how the result varies.
Consider using simulation (right click the statistic that you are interested in, choose Simulate menu, and operate accordingly). This is probably the closest to what is in your mind. You simulate samples based on your observations, adding what ever uncertainties, and see how the result varies. However, your assumption on the introduced uncertainties is the new source of influence. I am not sure how that will impact your analysis and conclusion.

P_Bartell · May 26, 2022 07:32 AM

Maybe I'm not interpreting your original post in the way you've intended, but here's how I'm comprehending your thinking...you have a value for a response and/or independent variables and you'd like to sprinkle some quantified 'error' on each of these values and THEN complete your analysis? Effectively changing the values to some subsequent value. Before I offer pathways to accomplish this I want to make sure I understand what you are trying to do before analysis. Not sure why you'd want to do this? Maybe you can share your thinking over and above '...so as to not overestimate the significance of individual data points...'. Essentially making up new data? Aside from the math...is this ethical in the context of what you are trying to do?

lloyd_tripp · May 27, 2022 04:37 PM

I guess I didn't explain what I wanted to do very well.

We have an instrument that we've previously characterized to give us a value with 1sigma of +-x units of error. It's very difficult for us to collect data with this instrument and so we have a limited sample size. Although the few points per group may show a difference, I worry the p-value would be deflated since we aren't capturing the measurement error in this limited dataset.

Is there a way to incorporate an a priori measurement error in the dataset ? Should I not try to incorporate this error?

P_Bartell · May 28, 2022 07:36 AM

Thanks for the additional clarification.

Here's my take...when you wrote, "I worry the p-value would be deflated since we aren't capturing the measurement error...". You are onto something...but keep in mind the relationship of any observed value from a measurement system to it's two components of variance. You are indeed 'capturing' the measurement error in your data because any time you observe a measurement of something it has two sources of variation that are additive. Variance of the product + Variance of the measurement system = Variance observed. So I think what you are trying to do is come up with a way to subtract the variance of the measurement system from the variance of the product and somehow estimate a mean for each response. With only single measurements this is an impossibility since you can't estimate the mean of the product by simple subtraction of variances.

You have discovered one of the basic challenges of statistical thinking and inference in that sometimes variation of measurement systems can swamp the variation in a product/process, making statistical inferences about the product/process more problematic as the disparity between variance of the measurement system and variance of the product/process increases...in lay terms, it's harder for a signal (the product/process of interest) to rise above the noise (measurement system).

So where do you go from here? One obvious recommendation is bite the bullet and get replicate measurements from the system and use the averages of each response for modeling purposes...this will help beat down the measurement system variation wrt to estimating the mean of each response. But that sounds like it's not feasible? A next fall back position could be to build a model with the data you have...then use bootstrapping within JMP to examine parameter estimates and their distributions from a practical point of view. A last idea is within the construct of the JMP Prediction Profiler and Simulation capabilities spend some time simulating the system's behavior. There is a capability to 'add' variation to the simulation results over and above factor variation instructions so this can be a surrogate for adding measurement system variation to the mean of each response.

Hope this helps a bit?

statman · May 28, 2022 7:08 AM

Hmmm, in reading the dilemma, I have a completely different understanding of the issue (although I may also be misunderstanding). Here are my thoughts:

Lloyd has a situation where the data he is collecting is from a narrow inference space and is therefore does not contain a reasonable estimate of the true variability (it is not representative). Since the estimate of the variation in the study (Mean Square Error, MSE) is potentially deflated, The corresponding F or t values will be inflated and p-values will appear more significant than in reality. I want to applaud you for understanding this potential hazard. Now for what you can do via JMP, I don't have great advice. I always think it is a good idea to compare the MSE of your sample data set with estimates that come from other samples of the same process (especially if those are from more representative sampling). I have always done this with my own calculations. I would guess there is a way to replace the MSE in ANOVA with your own value, but I personally do not know how to do this in JMP. Could you add values to the sample data set that would inflate the MSE to match the more representative error estimate? Do you really need to use p-values? I might suggest you stay with graphical techniques and ignore the quantitative approach.

"All models are wrong, some are useful" G.E.P. Box

Mark_Bailey · May 30, 2022 10:01 AM

Is the issue with the data that you do not always get a measurement or that the measurement is limited? If it is the latter issue, then your data is censored. You know that the actual value is beyond a limit. The measurement in such a case might be a lower bound or an upper bound. If this description sounds like your issue, then we can help you analyze censored data with JMP. That is to say, you can include all the data, not just values that are in a range.

lloyd_tripp · May 31, 2022 04:58 PM

We always get a measurement and we are within the measurement range of the instrument. The issue is that it's very costly to collect even a single data point. We have don't prior MSA studies so we've always been using a +-0.3 unit as a back of the napkin error on any data point we collect. I'd like to know if there's a more rigorous way to include this error in any analysis.

Discussions

Including Pre-Defined Error in Analysis

Re: Including Pre-Defined Error in Analysis

Re: Including Pre-Defined Error in Analysis

Re: Including Pre-Defined Error in Analysis

Re: Including Pre-Defined Error in Analysis

Re: Including Pre-Defined Error in Analysis

Re: Including Pre-Defined Error in Analysis

Re: Including Pre-Defined Error in Analysis

Re: Including Pre-Defined Error in Analysis

Recommended Articles