(This script is a new version that provides By group processing. Finally! Note that the p-value reported in the first version is no longer available.)
This script adds the two-tailed outlier test by Grubbs to the Distribution platform. The normal quantile plot and Goodness of Fit test are opened to help assess the assumption that the sample was drawn from a normal population.
Simply open the data table with the numeric variable to be evaluated, then open and run the script. Select the data column and click Y, Data. Specify the desired level of significance (alpha is 0.05 by default). This example is based on the variable height in the Big Class data table in the Sample Data folder.
Click OK.
The pattern of the markers in the normal quantile plot appears to be linear and none of the markers is outside of the region designated by the dotted read curves. The sample and, therefore, the population are judged to be normal in this case. The test at the bottom of the platform is not significant at the specified level.
Now the same analysis is performed using a By variable. This example uses sex as the grouping variable.
Click OK.
Hi Mark
Thanks for posting this.
I would be great if it could be done BY a grouping variable.
BR, Marianne
If the data set happens to have any missing values, this code incorrectly calculates N for the Grubbs test. This can be problematic if the number of missing columns is rather large.
See the following correction:
lines 76 and 110
n = N rows( yVal ) - N missing ( yVal);
the script breaks if the by variable is numeric. A simple way to fix this is to change bcol to character data type after the line
dt = Current Data Table();
If( N Items( bCol ),
bCol[1] << set data type(character);
/* you could also fix the script to work with numeric By varaibles withotu cahnging them, but thsi seemed simpler and you can always change it back to numeric at the end */
Cheers
Gunter
Hi, thanks for the script.
It would be helpfull to also include in the output which value was actually considered an outlier
(or at least provide an indication whether it concerns the lowest or the highest value).
Thank you for script. I second Jans reccomendation to have some method to indicate the values that are detected to be outliers.