Solved: Re: Remove rows that have missing values in a changing range of columns

cxu04 · Oct 7, 2021 05:29 PM

I have a data table where I want to remove the rows that have no values in all the columns for a variable range of columns.

For example, for the data table below:

dt =

Sample	Product Type	Col1	Col2	Col3	Col4	Col5
10001	A	0.0122	0.0223	0.01546	0.8742	23.12
10003	B
10004	A	0.123	0.124	0.1240	0.9712	23.78

I would want to drop the row with sample '10003' because it's entries in Col1 - Col5 are empty. I know this post is very similar, but it's not very helpful for me because the number of columns that I want to check are empty are usually greater than 15 and the column names can/will change, so it's not practical to manually type out this code and adjust the column names

dt << Select Where(
     Is Empty(:Col1) &
     Is Empty(:Col2) &
     is Empty(:Col3) & ... etc.
) << Delete Rows;

I am able to pretty easily get all the columns that I need to check are empty in a list because they are/will all be continuous with

cols = dt << Get Column Names(Continuous);

But this is where I get stuck and am unsure how to proceed. Do I use a for loop to go through each column and check if it's empty? But I only want to drop rows where ALL the columns are empty in the range of columns. If I was using Python, I would just use pandas.DataFrame.dropna(axis=0). How would I do this in JMP?

ErraticAttack · Oct 7, 2021 8:21 PM

There is a JMP builtin for this -- Missing Data Pattern. No need to reinvent the wheel

Names Default To Here( 1 );
dt = Open( "$SAMPLE_DATA/Cities.jmp" );

// column names to search for missing values
names = {"CO", "SO2", "NO", "Lead"};

// you can uncomment the following line to use all 'Continuous' columns
//names = dt << Get Column Names( "Continuous" );

// turn the list into a 'Columns( a, b, c )' expression
columns = Substitute( names, {}, Expr( Columns ) );

// run the missing data pattern
Eval( Eval Expr( table = dt << Missing Data Pattern( Expr( name expr( columns ) ) ) ) );

// select rows where all values are missing
Eval( Eval Expr( table << Select Where( :Number of Columns Missing == N Items( Expr( names ) ) ) ) );
Close( table, No Save );
rows = dt << Get Selected Rows();

// if any rows are selected then they have missing values for all columns given, delete
If( N Rows( rows ) > 0, dt << Delete Rows( rows ) );

Jordan

View solution in original post

txnelson · Oct 7, 2021 06:45 PM

Here is one way of handling your issue. Other Community Members may have better methods, but this one seems to always be able to work. I am using the Sample data table called Blood Pressure.

What the script below does, is to generate the JSL statement required to select the rows where if you sum up all of the continuous columns in the data table, and the sum is a missing value, then select the row. So the target JSL that needs to be built is

dt << select where(ismissing(sum(as column(\!"BP 8M\!"),as column(\!"BP 8M\!"),as column(\!"BP 8M\!"),as column(\!"BP 8M\!"),as column(\!"BP 8M\!"),as column(\!"BP 8M\!"),as column(\!"BP 8M\!"),as column(\!"BP 8M\!"),as column(\!"BP 8M\!")))==1);

The script creates this code in a string variable, and then it is executed using

Eval( Parse( ...........);

In the example, I create 3 new rows in the data table, which by default have missing values. The script then runs and finds those 3 rows.

Names Default To Here( 1 );

dt = 
// Open Data Table: Blood Pressure.jmp
// → Data Table( "Blood Pressure" )
Open( "$SAMPLE_DATA/Blood Pressure.jmp" );
dt << add rows(3);

colNames = dt << get column names( continuous, string );
theExpr = "dt << select where(ismissing(sum(as column(\!"" || colNames[1] || "\!")";
For( i = 2, i <= N Items( colNames ), i++,
	theExpr = theExpr || ",as column(\!"" || colNames[1] || "\!")"
);
theExpr = theExpr || "))==1);";
Eval( Parse( theExpr ) )
;

Jim

ErraticAttack · Oct 7, 2021 8:21 PM

There is a JMP builtin for this -- Missing Data Pattern. No need to reinvent the wheel

Names Default To Here( 1 );
dt = Open( "$SAMPLE_DATA/Cities.jmp" );

// column names to search for missing values
names = {"CO", "SO2", "NO", "Lead"};

// you can uncomment the following line to use all 'Continuous' columns
//names = dt << Get Column Names( "Continuous" );

// turn the list into a 'Columns( a, b, c )' expression
columns = Substitute( names, {}, Expr( Columns ) );

// run the missing data pattern
Eval( Eval Expr( table = dt << Missing Data Pattern( Expr( name expr( columns ) ) ) ) );

// select rows where all values are missing
Eval( Eval Expr( table << Select Where( :Number of Columns Missing == N Items( Expr( names ) ) ) ) );
Close( table, No Save );
rows = dt << Get Selected Rows();

// if any rows are selected then they have missing values for all columns given, delete
If( N Rows( rows ) > 0, dt << Delete Rows( rows ) );

Jordan

gzmorgan0 · Oct 11, 2021 03:23 AM

Hi @cxu04,

Both @txnelson and @ErraticAttack provided excellent solutions. @ErraticAttack has the added benefit to select the criteria for exclusions, 1 column missing or all missing and any value in between. Below, I have attached an alternate solution using matrix functions. As an FYI, some JMP platforms analyses and graphs have options for excluding or including missing values as well as imputation.

Names Default To Here( 1 );

// Open Data Table: Blood Pressure.jmp
// → Data Table( "Blood Pressure" )
dt = Open( "$SAMPLE_DATA/Blood Pressure.jmp" );
dt << add rows( 3 );

colNames = dt << get column names( continuous, string );


mymat = dt << get as matrix( colNames );

//this method counts the number of missing values in your column range for each row
//you can set the threshold number for exclusions, misslimit
//   any value is missing,   missLimit = 0
//   all are missing,        misslimit = nitems( colNames ) -1
//   20% +                   misslimit = Ceiling( nitems( colNames)*0.2 ) ) - 1
//

misslimit  = 3;

xx = Loc( V Sum( Is Missing( Transpose( mymat ) ) ) > misslimit ) ;  //get the row numbers where more than misslimit are missing.

dt << select rows( xx );
dt << exclude();
dt << clear select;

//look at the log output to get an idea what these functions are doing
show( mymat, VSum(IsMissing( Transpose(mymat) )), xx);

Here is a portion of the JMP LOG output for the show() command.

show( mymat, VSum(IsMissing( Transpose(mymat) )), xx);
/*:

mymat = 
[	182.595915155309 173.550216520736 180.116643270986 173.797442210887 178.416197148117 181.152265967985 171.278300921568 177.570953487335 170.514896659448, 
	172.512940709438 180.718560564297 180.917546679043 170.201828509985 178.839136233913 176.084997418848 175.161843431476 184.884491085702 187.603078247176, 
	180.774421425438 188.500473854652 177.384986937421 188.441549950731 175.346240386252 181.924943930192 183.12980493925 182.794049100295 180.121927541929, 
	180.620679187184 176.718084823659 182.381646560932 176.337805749776 172.773758285362 183.608276048664 183.04930217403 187.31279033708 182.931206367403, 
	184.202187190097 180.174755903154 176.277045698247 171.91229023858 174.678173508279 175.578954629626 170.259254710046 189.822176301411 182.694806601573, 
	171.439595422176 181.462075626649 179.517397097984 183.021068043181 175.683661608275 174.15485890414 184.878981656321 182.387718653956 184.169322994033, 
	174.589230394192 180.330100756984 180.187858248718 183.498290114926 179.622232533597 172.866924129507 184.16481869131 184.653326274105 180.368840760416, 
	179.109260746268 172.886222586229 182.904992672503 187.881911714776 192.773275959188 169.793497264777 179.098192996471 184.653672194875 170.403578559484, 
	177.787439187438 192.050244302843 176.785646502378 184.924564594592 172.499108408691 179.464417655732 186.765084181627 184.384583007231 177.427685029706, 
	180.687489407623 187.037431530346 181.618651921524 181.868587651411 177.245563534856 176.983573711354 183.864008024749 183.43593738674 184.394933320607, 
	181.804016028632 185.501158177984 178.209310795898 177.186342937671 176.573252982367 177.20643071091 182.640175499696 190.972901519754 174.986928330607, 
	178.085353285743 183.877324525276 186.575178728884 174.027593116806 174.72327532089 176.115218469266 179.80577644674 181.46178488602 184.070139882305, 
	186.313144806757 179.931613365546 180.989181249718 171.059060559811 172.873089144473 181.319863949352 179.038752187802 181.459302956398 189.242086812316, 
	175.347853139956 180.493132509235 177.2578428812 184.517143208058 171.169474537616 181.880353653926 175.727537977595 184.604441269869 169.540004898628, 
	181.103289178403 174.944992509985 180.642613381863 179.143677395064 185.445956950321 183.622763290159 183.636564430071 173.249587056935 185.21824057543, 
	181.288352739913 177.663853371957 178.850703910826 178.211632365764 181.685844613992 178.639264502305 188.483289884327 183.125809166649 173.182049720424, 
	178.777832744608 174.717037207769 188.986597488081 190.546847185169 178.055520662495 176.991735583422 182.282596238225 187.951725739917 170.940897197153, 
	176.763152449944 186.677919707096 172.45356859494 179.666182707754 182.726891244295 177.924284755616 183.200306257339 190.139279669601 177.823031925273, 
	175.768608854708 178.12326354079 182.404679249467 176.614597739946 178.320419876662 177.143465976767 175.856743742269 183.549834391574 183.621024712236, 
	184.167672392284 171.230704316198 182.335148506991 181.334797218108 186.395858538357 183.421182544757 183.517577768589 171.907000056173 178.774574166082, 
	. . . . . . . . ., 
	. . . . . . . . ., 
	. . . . . . . . .];
V Sum(Is Missing(mymat`)) = [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 9 9 9];
xx = [21 22 23];

ron_horne · Oct 11, 2021 05:03 AM

all good suggestions here by @txnelson , @ErraticAttack and @gzmorgan0 .

May i offer what i would consider the most basic usage of JSL commands that i think is robust enough to achieve the task under different circumstances.

essentially, i did it manually and made just the minimum script amendments to generalize it in JSL.

would like to know what others think.


dt = Open( "$SAMPLE_DATA/Missing Data Pattern.jmp" );
// this data set actually doesn't have any row that is completely missing. 
dt << add rows(3); // now it does


// now the work begins
colNames = dt << get column names( continuous, string );

// make a missing data pattern table
missingdt = dt << Missing Data Pattern(
	columns( eval (colNames) ),
	Count Missing Value Codes,
	Add Value Colors Property,
	Output Table( "Missing Data Pattern" )
);


// the next two steps should be fast enough in almost any size of data set that i do not think it is worth writing it as a condition for the cases that no row is completely missing.
// select the row representing all missing. if no such row exists nothing will be selected.
missingdt << select where (:Number of columns missing == N items (colNames));
wait (0);
// delete the selected rows from the original table (if anything was selected)
dt << delete rows;
close (missingdt, no save);