Re: Is there a way to automate/script the Winsor process of outlier filtering? - Page 2

chfields · Oct 28, 2016 6:21 AM

Is there a way to automate/script the Winsor process of outlier filtering? I found the two addins from Brady, but those do not exclude the outlier rows, they only pertain to the enhancing the Summary function.

Imputation Addin

Extended Summary Add-in

mikedriscoll · Oct 26, 2015 04:55 PM

Thanks!

chfields · Nov 2, 2015 07:56 PM

I don't exactly get the x=x_orig.

It seems like this would create an infinite loop; but it works so I cannot argue with that.

thanks!

msharp · Nov 3, 2015 04:22 PM

Essentially what we are doing is accounting for over-estimations of the median/MAD. In our example data set, using the median/MAD in the winzorization process adjusts the outlier point too much.

x_orig = [4.5, 4.9, 5.6, 4.2, 6.2, 5.2, 9.9];

median and mad estimates:

mean = 5.2; and sigma = 1.05;

using these estimates changes the data to:

x_firstpass = [4.5, 4.9, 5.6, 4.2, 6.2, 5.2, 6.775];

This data gives us a better approximate of our mean and sigma:

adjusted mean = 5.34; sigma = 1.04;

If we use the better estimates to winzorize the data set it would would lead to:

x_secondpass = [4.5, 4.9, 5.6, 4.2, 6.2, 5.2, 6.905];

However, if we use x instead of x_orig, the code won't find the outlier point 9.9 to adjust it accordingly, b/c 6.775 isn't an outlier when the mean = 5.34 and sigma = 1.04.

thus:

x_secondpass = [4.5, 4.9, 5.6, 4.2, 6.2, 5.2, 6.775];

mean and sigma will be the same:

adjusted mean = 5.34; sigma = 1.04;

sigma - sigma_old = 0

and our convergence will believe it's completed.

tl:dr - If you are trying to find the optimum zero point your code needs to be robust enough to run both directions.

mikedriscoll · Nov 3, 2015 06:08 PM

For the example in the AMC tech brief upon which the code in this thread is based, what is the tuning constant k? I've seen the theoretical equations that relate to the tuning constant, but I haven't seen anything that describes windsorizing +/- n*sigma, and sigma_new = x * stddev(data), where (I would guess that) n and x are related to the tuning constant.

Thanks,
Mike

msharp · Nov 3, 2015 06:56 PM

It would probably be worth your time to read some wiki articles. Winsorization is just a process, similar to trimming.

Wikipedia:

Winsorising or Winsorisation is the transformation of statistics by limiting extreme values in the statistical data to reduce the effect of possibly spurious outliers.

We only need a percentile, here we used 1.5sigma.

In the article referenced, they use the median to winsorize, then used the winsorized data to better predict the center and spread. They then winsorize again to even better predict the center and spread. Here we winsorize to get a robust accurate mean/std dev.

The tuning constant is used to in Huber's M estimates. This process is the opposite, instead you use M estimates to come up with a robust accurate mean/std dev to then winsorize the data. When K approaches infinity the M-estimate mean = mean. When K approaches 0 the M-estimate mean = median.

You can see page 16 of this presentation http://www.bauer.uh.edu/rsusmel/phd/ec1-25.pdf

So to answer your question the tuning constant K is not used b/c we didn't use Huber's M-estimates.

mikedriscoll · Nov 4, 2015 02:13 PM

Thanks, that makes a lot of sense. I guess I misunderstood the part in the AMC brief refering to Huber... I thought they meant Huber M-estimation, but I agree it is just iteratively windsorizing.

Thanks for that link. It is helpful.

chfields · Nov 2, 2015 07:55 PM

This code did it for me!! Thanks!!!

Kevin_Anderson · Oct 23, 2015 05:53 PM

CHFields: to answer your original question: Yes, there is a way to automate outlier filtering. In fact, there are a very large (if not infinite) number of ways.

The msharp script (original data mean of 5.785 with JMP Robust Mean/Huber Center of 5.735) yields a Winsorized mean of 5.339. If you really want to go down the rabbit hole, a 12A One-Step Hampel W-Estimator (a robust mean recommended by Andrews,D.F., et. al (1972); Robust Estimates of Location: Survey and Advances; Princeton University Press) yields 5.256 on the same data.

Is it concerning that all these measures of location are so different on the same data? Not really. Like St. Gregory the Miracle Worker said, "That which is different is not the same." Thoughtful researchers will not use the mean without at least some form of a reasonable rejection procedure. It would be great if that reasonable procedure were easily implemented in a JMP function that could be called from a script, but without that being an option, the Winsorizing script from msharp would be a great alternative, as long as you identify the method used and also heed the earlier "poor taste and bad practice" warnings as well.

Have a robust day!