I am trying to perform a normality test on a series of datasets similar to the one shown in the figure attached, and later compare them using ANOVA. However, when I perform the normality test, I keep getting what I considered to be a ‘false negative.’ As you can see from the screenshot of the journal, the dataset as a whole resembles a normal distribution, even though there are only about 7-8 bins being populated. My best guess is that because there are ‘so many’ empty bins in between the populated ones, JMP is interpreting the dataset as multimodal. I would like it to perform the fit/test on the overall set as shown graphically.
However JMP is interpreting the distribution of the data, it also affects the conclusions drawn from ANOVA. I would be very grateful if anyone could provide some insight as to how to perform the statistical analysis on these data (i.e. test each dataset for normality and then compare them using ANOVA).
You have a large sample so these tests are extremely sensitive to a departure from (perfect) normality. The observations, though, at either end depart from linearity in the normal quantile plot, indicating that the distribution is right-skewed. Why do you think that there is no departure?
Is this example one of the samples to be analyzed in the ANOVA? How do the other samples look?
Thank you for the quick response. While I agree the distribution is skewed, I do not expect it to completely fail the normality test given the set looks somewhat normaly distributed--but yes, I do not have a quantitative way to justify it at the moment. As I recall, I also tried less points in each bin and still got a failed normality test. The other datasets also look the same.
Basically, I'm trying to compare these sets and determine which ones are different when compared to an ideal case scenario. I'm open to other alternatives. ANOVA is the approach i'm familiar with.
@garibay90 your data may also be resolution limited as well, as evidenced by the apparent "gaps" in between the bars in your bar chart, but even more clearly apparent in the "chattered" pattern of the points in your normal probability plot. this data doesn't look particularly well behaved for ANOVA, unless this data set is comprised of multople groups (samples)! In which case you need to run the normality assessment separately per group first. And then you can run the ANOVA. But even when you run the ANOVA, as mentioned, your analysis will be 'statistically biased' by your extremely large sample size. Hypothesis tests in general (including ANOVA) have much higher power to reject the null hypothesis as sample size increases. For ANOVA, you are testing the null hypothesis that the treatment offset for each group mean (to grand mean) is the same for all treatment groups. If you reject the null, then all you can assert is that at least one difference in the treatment offset between groups is observed. Note: another central assumption here is that the variance between the groups is the same. You can verify this assumption by using Fit Y by X> Unequal Variances, and looking at the p-values associated with the hypothesis tests which JMP automatically produces.
Principally though, ANOVA is a test compariing the means between groups (taking into account the overall variance between groups, which is assumed to be equal from group to group).
Thank you! Yes, I agree with what you are saying. This is only one dataset. I have mutiple ones like the shown in the figure.
Your data is partitioned into bins as shown by the normal probability plot. Did you create this binning arrangement or did the data come to you in this binned fashion? If you created the binning have you tried using the raw data, before binning?
Data are not binned. It is actually a collection of numbers and JMP is doing the binning. Attached is another figure of the journal that shows the distribution of the populated bins clearly.
@garibay90 I agree with @P_Bartell , but when you say "It is actually a collection of numbers and JMP is doing the binning" you are right but at the same time not completely right. Yes the histogram does the automatic binning for you, and you can adjust that binning dynamically by using the "grabber" tool. BUT the data is the data, and the data shows, at some 'reasonable' bin size - and quite strongly corroborated by the N-Q plot - that the data is binned, that is, in my parlance, is "partitioned" into separate "groups." My word choice is probably not correct statistical language but hopefully we are on the same page. I beleive what @P_Bartell is saying is: you need to address the question of "why is your data structured like this?" Answering this question is probably principally important to running ANOVA. The nature and context of your data needs to be understood first, then you can apply various inferential statistical techniques to test various hypotheses (e.g. on the mean, the variance, etc).
@PatrickGiuliano Thanks! Now I understand what you guys are getting at. In that case yes, the raw data generated are binned. Basically, my output is two columns: the bin and the counts. Not sure there is a way around this one. However, I went ahead and wrote a script to generate as many values of each bin as listed on the counts column. This is the set of numbers I am later inputting into JMP to perform the normality test. I was hoping JMP would generate a diffent set of bins so my distribution would appear/be interpreted as more continuous. I can attach an example dataset if needed. Thanks for all the input!