Solved: Re: Automatically selecting outliers

lmaley · Nov 26, 2018 10:59 AM

I have 7000 rows, that consists of multiple batches. I am doing a distribution by batch and using the outlier box plot to look for outliers. But I have 100+ batches, is there a way to have JMP automatically select which rows are outliers?

ron_horne · Nov 27, 2018 02:02 PM

now see if it works for you...

the script you sent was not adapted to your data at all...

View solution in original post

Dan_Obermiller · Nov 26, 2018 11:19 AM

You might want to look under Analyze > Screening > Explore Outliers.

Dan Obermiller

txnelson · Nov 26, 2018 8:25 AM

I am not sure of exactly what you are looking for. If you want to select the rows for a given distribution, when viewing the output, you can simply drag across the displayed outlier data points in the outlier box plot, and it will select the outlier data points.

But if you want to go from column to column and select outlier rows, a simple script could do that, however, remember that row selection goes across all columns, and what you would end up will are selected rows where any of the columns are selected.

What you may really want, is a simple script that colors all of the individual cells that are outliers. Maybe red if they are to far above the mean, and orange if they are outliers below the mean.

Can you please clarify exactly what you are looking for?

See Dan's reply for a better answer.

Jim

lmaley · Nov 26, 2018 01:37 PM

The screening outliers does not work if I need a variable in the by. What I want to do is my Y is in one column. Another column is labeled batch. I put batch in by and make the number of distributions as I do batches. So say 100 distributions. I want a script to highlight all the outliers in all 100 distributions. Ideally it will only highlight the rows so I can pick my own color since I use colors for different thing. Make more sense now?

Mark_Bailey · Nov 26, 2018 11:56 AM

If you had searched the File Exchange portion of the JMP Community for 'outlier,' you would have found these two utility scripts:

Of course, the statistical basis for each of these tests must be suited to your situation and criterion.

lmaley · Nov 26, 2018 02:00 PM

I tried the Grubb's script. This just shows the outliers and does not highlight the rows right? I need something that will highlight the rows.

ron_horne · Nov 26, 2018 03:14 PM

Hi,

perhaps here is a solution:

https://community.jmp.com/t5/Discussions/How-to-extract-month-in-words-from-a-date-value-how-to-chan...

Ron

lmaley · Nov 26, 2018 03:29 PM

I can't get this to work with my data, I changed the height and weight to my columns and removed the two lines where it opened the table and embedded it into my table, but it says evaluations are done but will not do anything.

I also need something that uses the by and this for each batch individually.

ron_horne · Nov 26, 2018 07:40 PM

try to split the table by batch and group by unique value (so you do not lose any rows) then run my script.
otherwise you need a different strategy than the one i offer.
does it make sense?

ron_horne · Nov 27, 2018 08:17 AM

let me try again

it is not a sleek solution but steps are clear and easy to modify

Names Default To Here( 1 );

// make a sample data table
dt = New Table( "Data",
	add rows( 10000 ),
	New Column( "it", Numeric, Continues, Format( "Best", 8 ), Formula( Row() ) ),
	New Column( "Batch" ,Numeric, Ordinal, Format( "Best", 8 ), 	formula( random integer (110) )),
	New Column( "y" ,Numeric, Continues, Format( "Best", 8 ), 	formula( random log normal (:Batch / 10, 5) )),
	New Column( "Out" ,Numeric, Ordinal, Format( "Best", 8 ), formula (0)),

);

dt << run formulas;
column (dt, "it") << delete formula;
column (dt, "Batch") << delete formula;
column (dt, "y") << delete formula;
column (dt, "Out") << delete formula;

// now we get to work

// extract the list of the by variable levels
byvarlist = Associative Array( dt:Batch ) << get keys;

// run across each group and locate the outliers
For( i = 1, i <= N Items( byvarlist ), i++,

// produce a subset data table for each group using selection method
dt << select where (dt:batch == byvarlist[i]); 
subdt = dt << Subset(
	Selected Rows( 1 ),
); 

dt << clear select;

// indicate outliers within the group
subdt:Out << set formula(
	If(
		:y > Col Quantile( :y, 0.75 ) + (Col Quantile( :y, 0.75 ) -
		Col Quantile( :y, 0.25 )) * 1.5, 1,
		:y < Col Quantile( :y, 0.25 ) - (Col Quantile( :y, 0.75 ) -
		Col Quantile( :y, 0.25 )) * 1.5, 1,
		0
	)
);
subdt << run formulas;

// prepare table for updating original by removing formulas and columns that are not needed
column (subdt, "Out") << delete formula;

keep = {"it", "Out"};
todelete = subdt << get column names( "string" );
For( icol = 1, icol <= N Items( keep ), icol++,
	Remove From( todelete, Contains( todelete, Keep[icol] ), 1 )
);

subdt << delete columns( todelete );

// update
dt << Update(
	With( subdt ),
	Match Columns( :it = :it )
);

Close( subdt, No Save );

);

// now the original data has all the outliers indicated in a column
selected = dt << select where (:Out == 1);
selected << Markers( 11 );
selected << Marker size ( 8 );// not working???
selected << Row Colors( red );// not working???
dt << clear select;