cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
Try the Materials Informatics Toolkit, which is designed to easily handle SMILES data. This and other helpful add-ins are available in the JMPĀ® Marketplace
Choose Language Hide Translation Bar
chfields
Level II

Is there a way to automate/script the Winsor process of outlier filtering?

Is there a way to automate/script the Winsor process of outlier filtering? I found the two addins from Brady, but those do not exclude the  outlier rows, they only pertain to the enhancing the Summary function.

Imputation Addin

Extended Summary Add-in

17 REPLIES 17
mikedriscoll
Level VI

Re: Is there a way to automate/script the Winsor process of outlier filtering?

Thanks!

chfields
Level II

Re: Is there a way to automate/script the Winsor process of outlier filtering?

I don't exactly get the x=x_orig.

It  seems like this would create an infinite loop; but it works so I cannot argue with that.

thanks!

msharp
Super User (Alumni)

Re: Is there a way to automate/script the Winsor process of outlier filtering?

Essentially what we are doing is accounting for over-estimations of the median/MAD.  In our example data set, using the median/MAD in the winzorization process adjusts the outlier point too much.

x_orig = [4.5, 4.9, 5.6, 4.2, 6.2, 5.2, 9.9];

median and mad estimates:

mean = 5.2; and sigma = 1.05;


using these estimates changes the data to:

x_firstpass = [4.5, 4.9, 5.6, 4.2, 6.2, 5.2, 6.775];

This data gives us a better approximate of our mean and sigma:

adjusted mean = 5.34; sigma = 1.04;


If we use the better estimates to winzorize the data set it would would lead to:

x_secondpass = [4.5, 4.9, 5.6, 4.2, 6.2, 5.2, 6.905];


However, if we use x instead of x_orig, the code won't find the outlier point 9.9 to adjust it accordingly, b/c 6.775 isn't an outlier when the mean = 5.34 and sigma = 1.04.

thus:

x_secondpass = [4.5, 4.9, 5.6, 4.2, 6.2, 5.2, 6.775];

mean and sigma will be the same:

adjusted mean = 5.34; sigma = 1.04;


sigma - sigma_old = 0

and our convergence will believe it's completed.


tl:dr - If you are trying to find the optimum zero point your code needs to be robust enough to run both directions.

mikedriscoll
Level VI

Re: Is there a way to automate/script the Winsor process of outlier filtering?

For the example in the AMC tech brief upon which the code in this thread is based, what is the tuning constant k?  I've seen the theoretical equations that relate to the tuning constant, but I haven't seen anything that describes windsorizing +/- n*sigma, and sigma_new = x * stddev(data), where (I would guess that) n and x are related to the tuning constant. 

Thanks,
Mike

msharp
Super User (Alumni)

Re: Is there a way to automate/script the Winsor process of outlier filtering?

It would probably be worth your time to read some wiki articles.  Winsorization is just a process, similar to trimming. 

Wikipedia:

Winsorising or Winsorisation is the transformation of statistics by limiting extreme values in the statistical data to reduce the effect of possibly spurious outliers.

We only need a percentile, here we used 1.5sigma.

In the article referenced, they use the median to winsorize, then used the winsorized data to better predict the center and spread.  They then winsorize again to even better predict the center and spread.  Here we winsorize to get a robust accurate mean/std dev.

The tuning constant is used to in Huber's M estimates.  This process is the opposite, instead you use M estimates to come up with a robust accurate mean/std dev to then winsorize the data.  When K approaches infinity the M-estimate mean = mean.  When K approaches 0 the M-estimate mean = median.

You can see page 16 of this presentation http://www.bauer.uh.edu/rsusmel/phd/ec1-25.pdf

So to answer your question the tuning constant K is not used b/c we didn't use Huber's M-estimates.

mikedriscoll
Level VI

Re: Is there a way to automate/script the Winsor process of outlier filtering?

Thanks, that makes a lot of sense. I guess I misunderstood the part in the AMC brief refering to Huber... I thought they meant Huber M-estimation, but I agree it is just iteratively windsorizing.

Thanks for that link. It is helpful.

chfields
Level II

Re: Is there a way to automate/script the Winsor process of outlier filtering?

This code did it for me!!  Thanks!!!

Kevin_Anderson
Level VI

Re: Is there a way to automate/script the Winsor process of outlier filtering?

CHFieldsā€‹: to answer your original question: Yes, there is a way to automate outlier filtering.  In fact, there are a very large (if not infinite) number of ways.

The msharpā€‹ script (original data mean of 5.785 with JMP Robust Mean/Huber Center of 5.735) yields a Winsorized mean of 5.339.  If you really want to go down the rabbit hole, a 12A One-Step Hampel W-Estimator (a robust mean recommended by Andrews,D.F., et. al (1972); Robust Estimates of Location: Survey and Advances; Princeton University Press) yields 5.256 on the same data.

Is it concerning that all these measures of location are so different on the same data?  Not really.  Like St. Gregory the Miracle Worker said, "That which is different is not the same."  Thoughtful researchers will not use the mean without at least some form of a reasonable rejection procedure.  It would be great if that reasonable procedure were easily implemented in a JMP function that could be called from a script, but without that being an option, the Winsorizing script from msharpā€‹ would be a great alternative, as long as you identify the method used and also heed the earlier "poor taste and bad practice" warnings as well.

Have a robust day!