- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Get Direct Link
- Report Inappropriate Content
remove duplicate Columns
Hi:
I have a hugue data table with 400+ columns and 50,000 rows. I have a 30-40 duplicate columns and I would like to remove them from my data table. any ideas?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Get Direct Link
- Report Inappropriate Content
Re: remove duplicate Columns
I am not aware of any builtin JMP method to do this. However, you can use the RSquare statistic to determine if the columns are identical.
If you are going to do this interactively, the continuous columns can be identified by using
Analyze==>Multivariate Methods==>Multivariate
It will give you a corellation matrix that you can output into a data table, and then use it to identify which columns have a correlation of 1.
The ordinal/nominal columns can be identifyied by creating a "Make Combined Data Table" from the RSquare table in Fit Y by X. Once again, an RSquare of 1.0 will give you the clue that the columns are identical.
Now if you are going to script a solution, the above can be fairly easily automated.
Maybe another community member will have a better solution
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Get Direct Link
- Report Inappropriate Content
Re: remove duplicate Columns
Hi,
I encounter the same problem today, and here is my script to remove the duplicated columns.
It really takes time if you have a table with over thousands of rows. There should be a more efficient way to do it, hope someone can figure it out and share.
Names Default To Here( 1 );
col_name = dt << get column names;
dup_col = {}; //store the duplicated column name
same_c = 0; //to record how many rows are the same while doing columns comparison
/* When the elements are the same in the same row for 2 columns, same_c will be added by 1*/
/*If the same_c is equal to the total row number, then the 2 columns are the same. Insert the column name into dup_col*/
For( i = 1, i <= (N Cols( dt ) - 1), i++,
For( k = i + 1, k <= N Cols( dt ), k++,
(For Each Row( If( :(col_name[i])[Row()] == :(col_name[k])[Row()], same_c++, Wait( 0 ) ) ) ;
If( same_c == N Rows( dt ), Insert Into( dup_col, col_name[k] ), Wait( 0 )) ;
same_c = 0 ; )
)
);
dt << Delete Column( Eval( dup_col ) );
// delete the duplicated columns
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Get Direct Link
- Report Inappropriate Content
Re: remove duplicate Columns
You can use an MD5 checksum, generated by the Blob MD5() function to generate a hash of the values in a column and then compare checksums across columns.
You can use an associative array to check for duplicate md5 checksums as they are generated to avoid having to compare every checksum with every other checksum yourself.
dt = New Table( "Some Duplicate Columns",
Add Rows( 1000 ),
New Column( "Column 1", Numeric, "Continuous", Format( "Best", 12 ), Formula( Random Normal() ) ),
New Column( "Copy of Column 1", Numeric, "Continuous", Format( "Best", 12 ), Formula( :Column 1 ) ),
New Column( "Not a copy of Column 1",
Numeric,
"Continuous",
Format( "Best", 12 ),
Formula( If( Row() == 1, 1, :Column 1 ) )
),
New Column( "Character copy of Column 1", Character, "Nominal", Formula( Char( :Column 1 ) ) ),
New Column( "Character copy of Copy of Column 1",
Character,
"Nominal",
Formula( Char( :Copy of Column 1 ) ),
)
);
//get values into matrix (numeric columns) or list (character columns)
c1 = :column 1 << get values;
c2 = :Copy of Column 1 << get values;
//avoid having to check column data types when converting the values to blob
//by converting matrix or list to character
//
//if you're not lazy you could use the Matrix to Blob() function for numeric columns
//and the Char to Blob() function for character columns
char_c1 = Char( c1 );
char_c2 = Char( c2 );
c1_md5 = Blob MD5( Char To Blob( char_c1 ) );
c2_md5 = Blob MD5( Char To Blob( char_c2 ) );
Show( c1_md5 == c2_md5 );
//check the whole data table using hex version of md5 as keys of associative array
aa = Associative Array();
For( i = 1, i <= N Col( dt ), i++,
md5 = Hex( Blob MD5( Char To Blob( Char( Column( dt, i ) << get values ) ) ) );
//check to see if the associate array already has a key named with this md5
If( aa << Contains( md5 ),
//aa does contain md5
Print( aa[md5] || " is a duplicate of " || (Column( dt, i ) << get name) ),
//aa does not contain md5
//store a key named with this md5 with a value of the column name
aa[md5] = Column( dt, i ) << get name;
);
);
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Get Direct Link
- Report Inappropriate Content
Re: remove duplicate Columns
Why do you have so many duplicate columns? Can this result be prevented?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Get Direct Link
- Report Inappropriate Content
Re: remove duplicate Columns
Here's a brute force (and untested) alternative that may or may not be 'better' for the intended use:
NamesDefaultToHere(1);
dt = CurrentDataTable();
// Make a matrix 'mat' holding pairwise equality of column values
n = NCol(dt);
mat = J(n, n, .);
for(c1 = 1, c1 <=n, c1++,
for(c2 = c1+1, c2 <= n, c2++,
c1Vals = Column(dt, c1) << getValues;
c2Vals = Column(dt, c2) << getValues;
mat[c1, c2] = all(c1Vals == c2Vals);
);
);
// Convert 'mat' to a table
cols = dt << getColumnNames("String");
dtn = dt << getName;
dt2 = AsTable(mat, << columnNames(cols));
dt2 << setName("Duplicate Columns in "||dtn);
dt2 << NewColumn("Column", Character, Values(cols));
dt2 << moveSelectedColumns({"Column"}, ToFirst);