Subscribe Bookmark RSS Feed

Does k means sample size estimate require a normal distribution?

Twoolman

Occasional Contributor

Joined:

Dec 27, 2016

Hi everyone.

I'm using the k means sample size estimator under DOE | Sample Size and Power.

Do the means that I submit using this function need to be from a normally distributed dataset in order to be used effective? I don't think that is the case especially since k means in JMP only lets me submit a maximum of 10 groups, and some of my datasets have more than that (in this situation I randomly submit groups from that dataset to the k means function after calculating that sets standard dev).

Thanks in advance.
1 ACCEPTED SOLUTION

Accepted Solutions
markbailey

Staff

Joined:

Jun 23, 2011

Solution

Bias has to do with the omission of informative data. Is the estimate independent of or is it conditional on the nature of the omission (e.g., network affiliation)? Is the estimate independent of the insurance network affiliation status? If the omitted observations are non-informative, then your estimates will be unbiased. This issue has everything to do with your question of the analysis and the definition of the population. So if your scope includes only one insurance network and the estimates and inferences will not apply to any other insurance network, then you do not have to worry about bias.

Sample size determines the standard error of the estimates. The estimates might be of population parameters, model parameters, or future responses (scoring). The smaller groups will exhibit larger standard errors (confidence, prediction, or tolerance intervals).

Bias and variance are separate attributes of the estimates although we sometimes trade some of one (small increase in bias) for the benefit of the other (large decrease in variance) for overall better accuracy.

I do not see how the omission of some groups based solely on an arbitrary criterion of sample size will introduce bias. (As I said above, some rules for omission can bias the estimates.) I know that such an omission will decrease the amount of evidence and necessarily increase the width of the intervals of the estimates. Your scoring will be less accurate.

These comments are very general and confined to the identification and collection of the data. (No less important, though.) A lot depends on the analysis method, too.

10 REPLIES
markbailey

Staff

Joined:

Jun 23, 2011

This sample size calculator is based on the one-way analysis of variance (ANOVA) model assuming constant variance. The response Y is modeled as a combination of a constant term (intercept or mean response in this parameterization), the X = k group effects, and the errors. (Note that the group effects sum to zero.) The errors are assumed to be normally distributed with a mean = 0 and a variance that is independent of the groups.

It doesn't matter what k is.

If the variance depends on the factor level (group), then use the Welch ANOVA.

If the errors are not normally distributed, then use a non-parametric method or resampling method.

Twoolman

Occasional Contributor

Joined:

Dec 27, 2016

Thanks for the quick reply. Can I use Welch's ANOVA as a method for telling me the minimum sample size per group within a dataset that contains multiple groups?

 

It might help if I step back and explain what I am doing. I have observations by surgeon for multiple types of surgeries. For a specific sugery (a dataset) I might have 10 different surgeons (10 different groups) who performed that surgery who I want to compare against each other. Each surgeon in a given surgery has a different number of surgeries performed. I'm trying to determine the minimum number of surgeries (observation events) by group (by surgeon) in order to include that surgeon (group) in the analysis.

 

Thanks!

Tom

markbailey

Staff

Joined:

Jun 23, 2011

I think that you misunderstand the use of the sample size calculators. Their purpose is a prospective determination given a correct model (e.g., normal distribution with mean = 0 and constant variance), a known variance, a known minimum difference to detect, the level of significance (acceptable type I error rate), and the desired power (1 - acceptable type II error rate). This determination is very useful in a designed experiment.

Your case sounds like an observational study. The surgeries happen regardless of whether anyone is studying them. Why would you pre-select groups to be in or out of your study by number of observations? Excluding a group can only decrease the power in your test (one-way ANOVA). Excluding a group biases your estimates. You might exclude the most informative group this way.

If someone asked me the best way to compare groups A and B with 20 observations, I would tell them to observe 10 of each. On the other hand, if they told me that they have 17 observations of group A and 3 observations of group B, I would tell them to test the data that they have. It won't be as informative as the former design, but it is all they have. (BTW, it won't bias the estimates but it will inflate the standard errors.) If they exclude group B because of the small number of observations, what would they have left?

markbailey

Staff

Joined:

Jun 23, 2011

Also consider the response for each observations. The one-way ANOVA is for a measurement that uses JMP's continuous modeling type. If the data are otherwise, such as counts or categories, then none of the sample size calculators determine the correct sample size. In such cases, you can used the Generalized Linear Models platform or one of the Logistic Regression platforms from Fit Model.

Twoolman

Occasional Contributor

Joined:

Dec 27, 2016

Sorry I didn't specify some of the further details.

 

 The alpha we're willing to accept is .20, with a power of .80. This becomes an experiment because we don't necessarily know the actual number of surgeries performed by each surgeon. This can happen when we are working with an insurance company and they are only giving us data for surgeons who were performing surgies at hospitals that were part of their insurance network. So given surgeon may have done 100 tonsilectomy surgies for 2016 but we only are given 25% of those observations.

 

This therefore becomes a type of experiment in that in these conditions where we are essentially only getting a random sample of events per surgeon, how do we define the cutoff of what is the minimum number of  observations per surgeon?

 

I agree with you in that when we are working with a specific hospital and have access to 100% of all surgeon data, we can simply use all observations and not be concerned about an experiemental cutoff limit.

 

markbailey

Staff

Joined:

Jun 23, 2011

I am trying to help and not start a conflict based on semantics, but your analysis is still based on observational data. The fact that the surgeons included in the analysis depends on their affiliation with a particular insurance network has to do with the definition of the population and the limitation of the inference from the data. Even if affiliation is a predictor/factor, experimental units (surgeons) were not randomly assigned to treatments (insurance network affiliation). An experiment is a study in which you determine (that is, establish physically) the conditions or treatments, randomly assign experimental units to treatments, and measure the response for a run.

While in some sense the surgeons included in your analysis are a random sample, they do not constitute a 'simple random sample.' Such a sample assumes that all possible samples of a given size had an equal probability of being selected from the population. That is not the case here.

Why do you need to exclude surgeons based on the number of surgeries? What is the purpose of a minimum number of surgeries?

If the number of surgeries increases, then the power of the test increases. If the number of surgeries decreases, then the power of the test decreases. If the number of surgeries is zero, then the power is zero (type II error risk is infinite).

Again, I am trying to understand and help.

Twoolman

Occasional Contributor

Joined:

Dec 27, 2016

Thanks, Mark. Your comments are extremely helpful and appreciative.

 

The concern we have with having a minimum number of surgeries per surgeon is that we are scoring the surgeons based on performance metrics such as readmission %, surgical site re-infection %, etc.. We want to be fair and unbiased when performing that scoring analysis. It is possible that we may have 100% of Surgeon X's surgeries but only 25% of surgeon Y's surgeries based on network affiliation in our dataset. Would it bias our scoring results in this case? Note that the performance metrics above are all averages based on the observations that we do have.

 

Thanks!

 

markbailey

Staff

Joined:

Jun 23, 2011

Solution

Bias has to do with the omission of informative data. Is the estimate independent of or is it conditional on the nature of the omission (e.g., network affiliation)? Is the estimate independent of the insurance network affiliation status? If the omitted observations are non-informative, then your estimates will be unbiased. This issue has everything to do with your question of the analysis and the definition of the population. So if your scope includes only one insurance network and the estimates and inferences will not apply to any other insurance network, then you do not have to worry about bias.

Sample size determines the standard error of the estimates. The estimates might be of population parameters, model parameters, or future responses (scoring). The smaller groups will exhibit larger standard errors (confidence, prediction, or tolerance intervals).

Bias and variance are separate attributes of the estimates although we sometimes trade some of one (small increase in bias) for the benefit of the other (large decrease in variance) for overall better accuracy.

I do not see how the omission of some groups based solely on an arbitrary criterion of sample size will introduce bias. (As I said above, some rules for omission can bias the estimates.) I know that such an omission will decrease the amount of evidence and necessarily increase the width of the intervals of the estimates. Your scoring will be less accurate.

These comments are very general and confined to the identification and collection of the data. (No less important, though.) A lot depends on the analysis method, too.

Twoolman

Occasional Contributor

Joined:

Dec 27, 2016

Thank you, Mark. That has been extremely helpful.