script to select all cols that contain a list of values?

LambdaMarmoset7 · Sep 27, 2023 07:54 PM

I've got a large dt (70k cols, character) that I'm digging through. Most cols have a wide range of values, but a noticeable population of them have only two nuisance values. I want to identify all these cols and then dump them.

Any advice?

These nuisance cols will contain ONLY the terms "**skipped**" or "dne".

I've already dumped all cols that contain a single record value.

txnelson · Sep 27, 2023 09:33 PM

Here is a sample script

Names Default To Here( 1 );

// Create a sample data table to be used for illustration 
dt = New Table( "Example", add rows( 20 ) );
dt << delete columns( :Column 1 );
For( i = 1, i <= 100, i++,
	If( randomuniform( 0, 1 ) > .95,
		If( Random Uniform( 0, 1 ) > .5,
			dt << New Column( "col" || Char( i ),
				character,
				values( {"**skipped**"} )
			),
			dt << New Column( "col" || Char( i ),
				character,
				values( {"dne"} )
			)
		),
		dt << New Column( "col" || Char( i ),
			character,
			set each value( Char( Random Uniform( 0, 100 ) ) )
		)
	)
);

// Below is the simple script that will build a list of columns to be deleted,
// and then delete them
colNames = dt << get column names(string,character);

colsToDelete = {};
For Each( {name}, colnames,
	If(
		N Rows(
			(dt << get rows where(
				as Column( name ) == "**skipped**" | As Column( name ) ==
				"dne"
			))
		),
		Insert Into( colsToDelete, name )
	)
);

dt << delete columns( colsToDelete );

Jim

jthi · Sep 28, 2023 01:49 AM

Depending on your data and the speed you need, you might want to change how you get the list of unique values, but below is one option

Names Default To Here(1);

dt = New Table("Untitled 4",
	Add Rows(3),
	Compress File When Saved(1),
	New Column("Column 1", Character, "Nominal", Set Values({"dne", "**skipped**", "**skipped**"})),
	New Column("Column 2", Character, "Nominal", Set Values({"dne", "a", "b"})),
	New Column("Column 3", Character, "Nominal", Set Values({"**skipped**", "c", "d"})),
	New Column("Column 4", Character, "Nominal", Set Values({"dne", "dne", "**skipped**"}))
);

aa_colmask = ["dne" => 1, "**skipped**" => 1]; // we don't need the values, just keys

cols_to_remove = {};

For Each({colname}, dt << Get Column Names("String"),
	// depending on how much data you have, you might want to use different unique value techniques
	// Summarize, Associative Array and Summary are few options.
	// You can also first check if there are exactly 2 unique values, then check those
	aa_values = Associative Array(Column(dt, colname)); // I use AA here as it is the easiest option (not fastest!)
	aa_values << Remove(aa_colmask);
	If(N Items(aa_values) == 0,
		Insert Into(cols_to_remove, colname);
	);
);

dt << Delete Columns(cols_to_remove);

-Jarmo

script to select all cols that contain a list of values?

Re: script to select all cols that contain a list of values?

Re: script to select all cols that contain a list of values?