- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Get Direct Link
- Report Inappropriate Content
(English) Comparative analysis between two groups when N=4 million or more
JMP Technical Support Teacher
Thank you for your help. I am a JMP user. I am thinking of performing a statistical analysis on the data collected on a cancer statistics website.
We conducted statistics to determine whether certain rare cancers are more likely to cause secondary cancers than other cancers, and found the following number of cases.
Duplicates
Yes No
Certain rare cancers 827 12204
Other than certain rare cancers 129634 4490934
( Total 130461 4503138 )
I have two questions.
Question 1
We plan to conduct analysis on two groups: one with a certain rare cancer and one with other than a certain rare cancer.
When I use JMP to perform the chi-square analysis that you previously instructed me to perform, it seems that I need to create more than 4.5 million columns. Therefore, I would like to perform Fisher's exact test or two-sample Kolmogorov-Smirnov test in JMP. Can I perform an analysis with more than 4.5 million cases in JMP?
Question 2
I would like to see if there is any age bias in the two groups, but on the cancer statistics website, age is not shown as a number, but is divided into blocks such as 15-19 years old (please see the two attached Excel documents).
In that case, is it possible to analyze using JMP to ensure that there is no age bias?
Those are the two points. I am very sorry, but I would appreciate it if you could give me some information to the extent possible.
This post originally written in Japanese and has been translated for your convenience. When you reply, it will also be translated back to Japanese.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Get Direct Link
- Report Inappropriate Content
Re: (Japanese ) N=400万以上の場合の、2群間の比較解析につきまして
Hi @ChenShu07733,
Welcome in the Community !
The first question I have in mind when reading your post is why you would like to use all data, instead of selecting a representative subset ?
Processing millions of rows is quite hard/complex for any computer and software, and I'm not sure about the added value so many rows can bring. Selecting a representative subset can make the analysis easier, quicker, and with the same outcome, and you can possibly use some of the remaining/non-selected rows as validation for your analysis. On another hand, depending on your objectives, using so many rows can highly "bias" the outcomes of a statistical test : due to the very high sample size, the confidence intervals might be extremely small, so the tests might result in statistically significant outcomes almost always (depending on the effect size/difference to detect and the p-value/statistical significance threshold set).
I would recommend giving extra time to think about your objectives, the required sample size needed for your analysis (depending on the risk/alpha/Type I error level, and the difference/effect size you want to detect) using Power Explorers for Hypothesis Tests.
Concerning the age, I don't think this is a problem : when selecting a sample of the population data, make sure to have the same repartition/proportion of individuals by age blocks between your subset/sample and the population.
Some readings about statistical testing and sample size :
Type I and Type II Errors and Statistical Power - StatPearls - NCBI Bookshelf
Statistical analysis: sample size and power estimations | BJA Education | Oxford Academic
Hope this answer may help you,
"It is not unusual for a well-designed experiment to analyze itself" (Box, Hunter and Hunter)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Get Direct Link
- Report Inappropriate Content
Re: (Japanese) Comparative analysis between two groups when N=4 million or more
Regarding question 1, I think you can analyze it by entering the following data:
Select Analyze > Fit Y by X and specify columns for Y, X, and frequency.
However, as VIctor mentioned, because the amount of data is large, even if the difference in proportion is small, the P value will be small, so care must be taken in interpreting it.
This post originally written in Japanese and has been translated for your convenience. When you reply, it will also be translated back to Japanese.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Get Direct Link
- Report Inappropriate Content