Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

- JMP User Community
- :
- Discussions
- :
- How would I best exclude thousands of rows of data based on a single column show...

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page

Highlighted
##
How would I best exclude thousands of rows of data based on a single column showing a value to be an outlier?

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

Feb 7, 2020 11:45 AM
(1189 views)

1.2 million rows x 16 columns

Column for ambient temperature shows many observations (by day) that were classified as outliers, above the box and whisker top tail in the graph builder. This is unfortunate, but likely due to the placement of the ambient temperature measurement probe on a moving vehicle.

Goal is to properly ID and suppress rows of data having ambient temperature outliers, before using a neural network to model the data to compare performance of widget A to widget B.

I've read a few threads here, but cannot find a similar question with solution.

5 REPLIES 5

Highlighted
##
Re: How would I best exclude thousands of rows of data based on a single column showing a value to be an outlier?

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

Easy-peezy

Bring up your outlier box

Drag your cursor over all of the outlier points which will select all of them in your data table.

Now go to the RowState column in your data table and go to one of the rows that has been selected.

While hovering on the selected row, right click on your mouse and select "Exclude"

All selected rows will be Excluded

You may want to select Hide and Exclude

Jim

Highlighted
##
Re: How would I best exclude thousands of rows of data based on a single column showing a value to be an outlier?

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

Hi @Victor60

i have once written a script that earmarks rows if they are outliers in any of the groups in any of the variables. i have updated it to include different days in once column and a couple of columns of temperature. it produces a new binary column indicating outliers for each original column and one more indicating whether the observation is an outlier in any of the variables.

lets us know if it works.

ron

```
Names Default To Here( 1 );
// make a sample data table
dt = New Table( "Data",
add rows( 10000 ),
New Column( "it", Numeric, Ordinal, Format( "Best", 8 ), Formula( Row() ) ),
New Column( "Day", Numeric, Ordinal, Format( "Best", 8 ), formula( Random Integer( 110 ) ) ),
New Column( "temp1", Numeric, Continues, Format( "Best", 8 ), formula( Random Lognormal( :Day / 10, 5 ) ) ),
New Column( "temp2", Numeric, Continues, Format( "Best", 8 ), formula( Random Lognormal( :Day / 10, 5 ) ) ),
New Column( "Out",
Numeric,
Ordinal,
Format( "Best", 8 ),
Set Property( "Value Colors", {0 = -13977687, 1 = -4222943} ),
Color Cell by Value,
formula( 0 )
),
New Column( "Outtemp1",
Numeric,
Ordinal,
Format( "Best", 8 ),
Set Property( "Value Colors", {0 = -13977687, 1 = -4222943} ),
Color Cell by Value,
formula( 0 )
),
New Column( "Outtemp2",
Numeric,
Ordinal,
Format( "Best", 8 ),
Set Property( "Value Colors", {0 = -13977687, 1 = -4222943} ),
Color Cell by Value,
formula( 0 )
),
);
dt << run formulas;
Column( dt, "it" ) << delete formula;
Column( dt, "Day" ) << delete formula;
Column( dt, "temp1" ) << delete formula;
Column( dt, "temp2" ) << delete formula;
Column( dt, "Out" ) << delete formula;
Column( dt, "Outtemp1" ) << delete formula;
Column( dt, "Outtemp2" ) << delete formula;
// now we get to work
// make a list of days
byvarlist = Associative Array( dt:Day ) << get keys;
// make a list of columns
cols = dt << get column names( Continuous ); //List all Numeric columns
// loop through the days
For( i = 1, i <= N Items( byvarlist ), i++,
// select each Day
dt << select where( dt:Day == byvarlist[i] );
// subset each Day
subdt = dt << Subset( Selected Rows( 1 ) );
// clear selection for smooth workflow
dt << clear select;
// fined outliers in each variable within each Day
For( ii = 1, ii <= N Items( cols ), ii++,
out = Eval(
Parse(
"subdt << get rows where(
subdt:" || Char( cols[ii] ) || " > Col Quantile( subdt:" || Char( cols[ii] ) ||
", 0.75 ) + 1.5 * (Col Quantile( subdt:" || Char( cols[ii] ) || ", 0.75 ) - Col Quantile( subdt:" || Char( cols[ii] ) ||
", 0.25 )) | subdt:" || Char( cols[ii] ) || " <
Col Quantile( subdt:" || Char( cols[ii] ) ||
", 0.25 ) - 1.5 * (Col Quantile( subdt:" || Char( cols[ii] ) || ", 0.75 ) - Col Quantile( subdt:" || Char( cols[ii] ) ||
", 0.25 )));"
)
)`;
Eval( Parse( "column (subdt, \!"Out" || Char( cols[ii] || "\!" )[" || Char( out ) || "] = 1;" ) ) );
// also indicate the global outlier
Column( subdt, "Out" )[out] = 1;
//wait (0.00001);
);
// update original data table with new information for the Day
subdt << delete columns( "Day" );
subdt << delete columns( cols );
dt << Update( With( subdt ), Match Columns( :it = :it ) );
Close( subdt, No Save );
//Wait( 0.00001 );
);
dt << color by column( :Out );
```

Highlighted
##
Re: How would I best exclude thousands of rows of data based on a single column showing a value to be an outlier?

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

I'd like to suggest a little different approach.

Maybe try scripting the Explore Outliers platform, like this:

`dt=current data table();`

obj=dt<<Explore Outliers(
Y( :DATA_COLUMN ),
Q( 0.5 ),
Quantile Range Outliers ,
Show only columns with outliers( 1 )
);
obj << Exclude rows( :DATA_COLUMN );

To set this up, use the Explore Outliers platform to get your conditions nailed down, then just save the script to the script window.

Substitute your conditions for mine, and you're good to go.

You could also, just select the outliers, then invert the selection and then make a subset of the data you want to keep.

JMP Systems Engineer, Pharm and BioPharm Sciences

Highlighted
##
Re: How would I best exclude thousands of rows of data based on a single column showing a value to be an outlier?

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

Amazing. Thanks for insights.

Highlighted
##
Re: How would I best exclude thousands of rows of data based on a single column showing a value to be an outlier?

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

Wow! Thanks for the education!

Article Labels

There are no labels assigned to this post.