turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

- JMP User Community
- :
- Discussions
- :
- Discussions
- :
- Does k means sample size estimate require a normal distribution?

Topic Options

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

Dec 27, 2016 6:16 PM
(4673 views)

Hi everyone.

I'm using the k means sample size estimator under DOE | Sample Size and Power.

Do the means that I submit using this function need to be from a normally distributed dataset in order to be used effective? I don't think that is the case especially since k means in JMP only lets me submit a maximum of 10 groups, and some of my datasets have more than that (in this situation I randomly submit groups from that dataset to the k means function after calculating that sets standard dev).

Thanks in advance.

I'm using the k means sample size estimator under DOE | Sample Size and Power.

Do the means that I submit using this function need to be from a normally distributed dataset in order to be used effective? I don't think that is the case especially since k means in JMP only lets me submit a maximum of 10 groups, and some of my datasets have more than that (in this situation I randomly submit groups from that dataset to the k means function after calculating that sets standard dev).

Thanks in advance.

1 ACCEPTED SOLUTION

Accepted Solutions

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

Bias has to do with the omission of informative data. Is the estimate independent of or is it conditional on the nature of the omission (e.g., network affiliation)? Is the estimate independent of the insurance network affiliation status? If the omitted observations are non-informative, then your estimates will be unbiased. This issue has everything to do with your question of the analysis and the definition of the population. So if your scope includes only one insurance network and the estimates and inferences will not apply to any other insurance network, then you do not have to worry about bias.

Sample size determines the standard error of the estimates. The estimates might be of population parameters, model parameters, or future responses (scoring). The smaller groups will exhibit larger standard errors (confidence, prediction, or tolerance intervals).

Bias and variance are separate attributes of the estimates although we sometimes trade some of one (small increase in bias) for the benefit of the other (large decrease in variance) for overall better accuracy.

I do not see how the omission of some groups based solely on an arbitrary criterion of sample size will introduce bias. (As I said above, some rules for omission can bias the estimates.) I know that such an omission will decrease the amount of evidence and necessarily increase the width of the intervals of the estimates. Your scoring will be less accurate.

These comments are very general and confined to the identification and collection of the data. (No less important, though.) A lot depends on the analysis method, too.

Learn it once, use it forever!

10 REPLIES

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

This sample size calculator is based on the one-way analysis of variance (ANOVA) model assuming constant variance. The response Y is modeled as a combination of a constant term (intercept or mean response in this parameterization), the X = k group effects, and the errors. (Note that the group effects sum to zero.) The errors are assumed to be normally distributed with a mean = 0 and a variance that is independent of the groups.

It doesn't matter what k is.

If the variance depends on the factor level (group), then use the Welch ANOVA.

If the errors are not normally distributed, then use a non-parametric method or resampling method.

Learn it once, use it forever!

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

Thanks for the quick reply. Can I use Welch's ANOVA as a method for telling me the minimum sample size per group within a dataset that contains multiple groups?

It might help if I step back and explain what I am doing. I have observations by surgeon for multiple types of surgeries. For a specific sugery (a dataset) I might have 10 different surgeons (10 different groups) who performed that surgery who I want to compare against each other. Each surgeon in a given surgery has a different number of surgeries performed. I'm trying to determine the minimum number of surgeries (observation events) by group (by surgeon) in order to include that surgeon (group) in the analysis.

Thanks!

Tom

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

I think that you misunderstand the use of the sample size calculators. Their purpose is a **prospective determination** given a correct model (e.g., normal distribution with mean = 0 and constant variance), a known variance, a known minimum difference to detect, the level of significance (acceptable type I error rate), and the desired power (1 - acceptable type II error rate). This determination is very useful in a **designed experiment**.

Your case sounds like an **observational study**. The surgeries happen regardless of whether anyone is studying them. Why would you pre-select groups to be in or out of your study by number of observations? Excluding a group can only decrease the power in your test (one-way ANOVA). Excluding a group biases your estimates. You might exclude the most informative group this way.

If someone asked me the best way to compare groups A and B with 20 observations, I would tell them to observe 10 of each. On the other hand, if they told me that they have 17 observations of group A and 3 observations of group B, I would tell them to test the data that they have. It won't be as informative as the former design, but it is all they have. (BTW, it won't bias the estimates but it will inflate the standard errors.) If they exclude group B because of the small number of observations, what would they have left?

Learn it once, use it forever!

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

Also consider the response for each observations. The one-way ANOVA is for a **measurement** that uses JMP's **continuous** *modeling type*. If the data are otherwise, such as **counts** or **categories**, then none of the sample size calculators determine the correct sample size. In such cases, you can used the **Generalized Linear Models** platform or one of the **Logistic Regression** platforms from **Fit Model**.

Learn it once, use it forever!

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

Sorry I didn't specify some of the further details.

The alpha we're willing to accept is .20, with a power of .80. This becomes an experiment because we don't necessarily know the actual number of surgeries performed by each surgeon. This can happen when we are working with an insurance company and they are only giving us data for surgeons who were performing surgies at hospitals that were part of their insurance network. So given surgeon may have done 100 tonsilectomy surgies for 2016 but we only are given 25% of those observations.

This therefore becomes a type of experiment in that in these conditions where we are essentially only getting a random sample of events per surgeon, how do we define the cutoff of what is the minimum number of observations per surgeon?

I agree with you in that when we are working with a specific hospital and have access to 100% of all surgeon data, we can simply use all observations and not be concerned about an experiemental cutoff limit.

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

I am trying to help and not start a conflict based on semantics, but your analysis is still based on observational data. The fact that the surgeons included in the analysis depends on their affiliation with a particular insurance network has to do with the definition of the population and the limitation of the inference from the data. Even if affiliation is a predictor/factor, experimental units (surgeons) were not randomly assigned to treatments (insurance network affiliation). An experiment is a study in which you determine (that is, establish physically) the conditions or treatments, randomly assign experimental units to treatments, and measure the response for a run.

While in some sense the surgeons included in your analysis are a random sample, they do not constitute a 'simple random sample.' Such a sample assumes that all possible samples of a given size had an equal probability of being selected from the population. That is not the case here.

Why do you need to exclude surgeons based on the number of surgeries? What is the purpose of a minimum number of surgeries?

If the number of surgeries increases, then the power of the test increases. If the number of surgeries decreases, then the power of the test decreases. If the number of surgeries is zero, then the power is zero (type II error risk is infinite).

Again, I am trying to understand and help.

Learn it once, use it forever!

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

Thanks, Mark. Your comments are extremely helpful and appreciative.

The concern we have with having a minimum number of surgeries per surgeon is that we are scoring the surgeons based on performance metrics such as readmission %, surgical site re-infection %, etc.. We want to be fair and unbiased when performing that scoring analysis. It is possible that we may have 100% of Surgeon X's surgeries but only 25% of surgeon Y's surgeries based on network affiliation in our dataset. Would it bias our scoring results in this case? Note that the performance metrics above are all averages based on the observations that we do have.

Thanks!

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

Bias has to do with the omission of informative data. Is the estimate independent of or is it conditional on the nature of the omission (e.g., network affiliation)? Is the estimate independent of the insurance network affiliation status? If the omitted observations are non-informative, then your estimates will be unbiased. This issue has everything to do with your question of the analysis and the definition of the population. So if your scope includes only one insurance network and the estimates and inferences will not apply to any other insurance network, then you do not have to worry about bias.

Sample size determines the standard error of the estimates. The estimates might be of population parameters, model parameters, or future responses (scoring). The smaller groups will exhibit larger standard errors (confidence, prediction, or tolerance intervals).

Bias and variance are separate attributes of the estimates although we sometimes trade some of one (small increase in bias) for the benefit of the other (large decrease in variance) for overall better accuracy.

I do not see how the omission of some groups based solely on an arbitrary criterion of sample size will introduce bias. (As I said above, some rules for omission can bias the estimates.) I know that such an omission will decrease the amount of evidence and necessarily increase the width of the intervals of the estimates. Your scoring will be less accurate.

These comments are very general and confined to the identification and collection of the data. (No less important, though.) A lot depends on the analysis method, too.

Learn it once, use it forever!

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

Thank you, Mark. That has been extremely helpful.