cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
Try the Materials Informatics Toolkit, which is designed to easily handle SMILES data. This and other helpful add-ins are available in the JMP® Marketplace
Choose Language Hide Translation Bar

script to select all cols that contain a list of values?

I've got a large dt (70k cols, character) that I'm digging through. Most cols have a wide range of values, but a noticeable population of them have only two nuisance values. I want to identify all these cols and then dump them.

Any advice?

 

These nuisance cols will contain ONLY the terms "**skipped**" or "dne".

I've already dumped all cols that contain a single record value.

 

LambdaMarmoset7_0-1695858814574.png

 

2 REPLIES 2
txnelson
Super User

Re: script to select all cols that contain a list of values?

Here is a sample script

Names Default To Here( 1 );

// Create a sample data table to be used for illustration 
dt = New Table( "Example", add rows( 20 ) );
dt << delete columns( :Column 1 );
For( i = 1, i <= 100, i++,
	If( randomuniform( 0, 1 ) > .95,
		If( Random Uniform( 0, 1 ) > .5,
			dt << New Column( "col" || Char( i ),
				character,
				values( {"**skipped**"} )
			),
			dt << New Column( "col" || Char( i ),
				character,
				values( {"dne"} )
			)
		),
		dt << New Column( "col" || Char( i ),
			character,
			set each value( Char( Random Uniform( 0, 100 ) ) )
		)
	)
);

// Below is the simple script that will build a list of columns to be deleted,
// and then delete them
colNames = dt << get column names(string,character);

colsToDelete = {};
For Each( {name}, colnames,
	If(
		N Rows(
			(dt << get rows where(
				as Column( name ) == "**skipped**" | As Column( name ) ==
				"dne"
			))
		),
		Insert Into( colsToDelete, name )
	)
);

dt << delete columns( colsToDelete );
Jim
jthi
Super User

Re: script to select all cols that contain a list of values?

Depending on your data and the speed you need, you might want to change how you get the list of unique values, but below is one option

Names Default To Here(1);

dt = New Table("Untitled 4",
	Add Rows(3),
	Compress File When Saved(1),
	New Column("Column 1", Character, "Nominal", Set Values({"dne", "**skipped**", "**skipped**"})),
	New Column("Column 2", Character, "Nominal", Set Values({"dne", "a", "b"})),
	New Column("Column 3", Character, "Nominal", Set Values({"**skipped**", "c", "d"})),
	New Column("Column 4", Character, "Nominal", Set Values({"dne", "dne", "**skipped**"}))
);

aa_colmask = ["dne" => 1, "**skipped**" => 1]; // we don't need the values, just keys

cols_to_remove = {};

For Each({colname}, dt << Get Column Names("String"),
	// depending on how much data you have, you might want to use different unique value techniques
	// Summarize, Associative Array and Summary are few options.
	// You can also first check if there are exactly 2 unique values, then check those
	aa_values = Associative Array(Column(dt, colname)); // I use AA here as it is the easiest option (not fastest!)
	aa_values << Remove(aa_colmask);
	If(N Items(aa_values) == 0,
		Insert Into(cols_to_remove, colname);
	);
);

dt << Delete Columns(cols_to_remove);
-Jarmo