remove duplicate Columns

Report Inappropriate Content · May 28, 2017 03:15 AM

Hi:

I have a hugue data table with 400+ columns and 50,000 rows. I have a 30-40 duplicate columns and I would like to remove them from my data table. any ideas?

txnelson · May 28, 2017 08:00 AM

I am not aware of any builtin JMP method to do this. However, you can use the RSquare statistic to determine if the columns are identical.

If you are going to do this interactively, the continuous columns can be identified by using

Analyze==>Multivariate Methods==>Multivariate

It will give you a corellation matrix that you can output into a data table, and then use it to identify which columns have a correlation of 1.

The ordinal/nominal columns can be identifyied by creating a "Make Combined Data Table" from the RSquare table in Fit Y by X. Once again, an RSquare of 1.0 will give you the clue that the columns are identical.

Now if you are going to script a solution, the above can be fairly easily automated.

Maybe another community member will have a better solution

Jim

WHTseng · Jul 7, 2017 6:04 AM

Hi,

I encounter the same problem today, and here is my script to remove the duplicated columns.

It really takes time if you have a table with over thousands of rows. There should be a more efficient way to do it, hope someone can figure it out and share.

Names Default To Here( 1 );
	
col_name = dt << get column names;
dup_col = {}; //store the duplicated column name
same_c = 0; //to record how many rows are the same while doing columns comparison

/* When the elements are the same in the same row for 2 columns, same_c will be added by 1*/
/*If the same_c is equal to the total row number, then the 2 columns are the same. Insert the column name into dup_col*/  
For( i = 1, i <= (N Cols( dt ) - 1), i++,
	For( k = i + 1, k <= N Cols( dt ), k++,
		(For Each Row( If( :(col_name[i])[Row()] == :(col_name[k])[Row()], same_c++, Wait( 0 ) ) ) ; 
			If( same_c == N Rows( dt ),	Insert Into( dup_col, col_name[k] ), Wait( 0 )) ; 
			same_c = 0 ; )
	)
);

dt << Delete Column( Eval( dup_col ) );
// delete the duplicated columns

Jeff_Perkinson · Jul 7, 2017 2:09 PM

You can use an MD5 checksum, generated by the Blob MD5() function to generate a hash of the values in a column and then compare checksums across columns.

You can use an associative array to check for duplicate md5 checksums as they are generated to avoid having to compare every checksum with every other checksum yourself.

dt = New Table( "Some Duplicate Columns",
	Add Rows( 1000 ),
	New Column( "Column 1", Numeric, "Continuous", Format( "Best", 12 ), Formula( Random Normal() ) ),
	New Column( "Copy of Column 1", Numeric, "Continuous", Format( "Best", 12 ), Formula( :Column 1 ) ), 

	New Column( "Not a copy of Column 1",
		Numeric,
		"Continuous",
		Format( "Best", 12 ),
		Formula( If( Row() == 1, 1, :Column 1 ) )
	),
	New Column( "Character copy of Column 1", Character, "Nominal", Formula( Char( :Column 1 ) ) ),
	New Column( "Character copy of Copy of Column 1",
		Character,
		"Nominal",
		Formula( Char( :Copy of Column 1 ) ),

	)
);

//get values into matrix (numeric columns) or list (character columns)
c1 = :column 1 << get values;

c2 = :Copy of Column 1 << get values;

//avoid having to check column data types when converting the values to blob
//by converting matrix or list to character
//
//if you're not lazy you could use the Matrix to Blob() function for numeric columns
//and the Char to Blob() function for character columns
char_c1 = Char( c1 );
char_c2 = Char( c2 );

c1_md5 = Blob MD5( Char To Blob( char_c1 ) );
c2_md5 = Blob MD5( Char To Blob( char_c2 ) );

Show( c1_md5 == c2_md5 );

//check the whole data table using hex version of md5 as keys of associative array

aa = Associative Array();

For( i = 1, i <= N Col( dt ), i++,
	md5 = Hex( Blob MD5( Char To Blob( Char( Column( dt, i ) << get values ) ) ) );
	
	//check to see if the associate array already has a key named with this md5
	If( aa << Contains( md5 ),
	
		//aa does contain md5
		Print( aa[md5] || " is a duplicate of " || (Column( dt, i ) << get name) ),
		
		//aa does not contain md5
		//store a key named with this md5 with a value of the column name
		aa[md5] = Column( dt, i ) << get name;
	);
);

-Jeff

Mark_Bailey · Jul 7, 2017 09:20 AM

Why do you have so many duplicate columns? Can this result be prevented?

ian_jmp · Jul 10, 2017 05:23 AM

Here's a brute force (and untested) alternative that may or may not be 'better' for the intended use:

NamesDefaultToHere(1);

dt = CurrentDataTable();

// Make a matrix 'mat' holding pairwise equality of column values
n = NCol(dt);
mat = J(n, n, .);
for(c1 = 1, c1 <=n, c1++,
	for(c2 = c1+1, c2 <= n, c2++,
		c1Vals = Column(dt, c1) << getValues;
		c2Vals = Column(dt, c2) << getValues;
		mat[c1, c2] = all(c1Vals == c2Vals);
	);
);

// Convert 'mat' to a table
cols = dt << getColumnNames("String");
dtn = dt << getName;
dt2 = AsTable(mat, << columnNames(cols));
dt2 << setName("Duplicate Columns in "||dtn);
dt2 << NewColumn("Column", Character, Values(cols));
dt2 << moveSelectedColumns({"Column"}, ToFirst);

remove duplicate Columns

Re: remove duplicate Columns

Re: remove duplicate Columns

Re: remove duplicate Columns

Re: remove duplicate Columns

Re: remove duplicate Columns

Recommended Articles

Get Going with JMP: Essentials for Using JMP

Distribution new features for JMP 17

Analytics with Confidence 2: Models That Don't Generalise