Discussions

robot · Oct 31, 2013 04:48 PM

Hi,

I am using JMP10. I am trying to create a script to delete duplicate rows of data in a data table. An example of my attempt is as follows:

// Start.

// Create data table.

dt = open("$SAMPLE_DATA/Big Class.jmp");

dt << Select Randomly(5);

subdt = dt << Subset(Output Table Name("subset"), Invisible);

dt << Concatenate(subdt, Append to First Table);

Close(subdt, No Save);

dt << Clear Select;

// Find and delete duplicate rows for bigClass.

icol = dt << New Column( "Index", Numeric, Continuous, Formula( Sequence( 1, N Row( dt ), 1, 1 ) ) );

repcol = dt << New Column( "Repeat Data",

Numeric,

Continuous,

Formula( If( :Index == Col Minimum( :Index, :name, :age, :sex, :height, :weight ), 1, 0 ) )

);

dt << Select Where( As Column(repcol) == 0 );

dt << Delete Rows();

dt << Delete Columns( {icol, repcol} );

// Find and delete duplicate rows for generic data table.

colnames = dt << Get Column Names();

icol = dt << New Column( "Index", Numeric, Continuous, Formula( Sequence( 1, N Row( dt ), 1, 1 ) ) );

repcol = dt << New Column( "Repeat Data",

Numeric,

Continuous,

Formula( If( :Index == Col Minimum( :Index, /* Insert colnames here. */ ), 1, 0 ) )

);

dt << Select Where( As Column(repcol) == 0 );

dt << Delete Rows();

dt << Delete Columns( {icol, repcol} );

// End.

My problem is I would like to make this script run on a generic data table. How can I insert my colnames list into my formula? Is there a more clever way to do this?

Thanks!

ms · Oct 31, 2013 07:10 PM

Brilliant idea to use Col Min() to find duplicates!

Below is an alternative approach based on your idea that appears to work. It is simpler because there is no need for any temporary formula columns.

// Create example data table.

Open( "$SAMPLE_DATA/Big Class.jmp" ) << Concatenate( Open( "$SAMPLE_DATA/Big Class.jmp" ), Append to First Table );

// Delete duplicates from generic table

dt = Current Data Table();

col_min = Parse( "Col Min( Row(), :Name(\!"" || Concat Items( dt << get column names( string ), "\!"), :Name(\!"" ) || "\!"))" );

duplicates = [];

For Each Row( If( col_min != Row(), duplicates |/= Row() ) );

dt << delete rows( duplicates );

View solution in original post

ms · Oct 31, 2013 07:10 PM

Brilliant idea to use Col Min() to find duplicates!

Below is an alternative approach based on your idea that appears to work. It is simpler because there is no need for any temporary formula columns.

// Create example data table.

Open( "$SAMPLE_DATA/Big Class.jmp" ) << Concatenate( Open( "$SAMPLE_DATA/Big Class.jmp" ), Append to First Table );

// Delete duplicates from generic table

dt = Current Data Table();

col_min = Parse( "Col Min( Row(), :Name(\!"" || Concat Items( dt << get column names( string ), "\!"), :Name(\!"" ) || "\!"))" );

duplicates = [];

For Each Row( If( col_min != Row(), duplicates |/= Row() ) );

dt << delete rows( duplicates );

robot · Oct 31, 2013 07:50 PM

Thanks MS! Very clever. I will still have to study exactly how it works.

ms · Oct 31, 2013 08:27 PM

I was too quick. The first posted version would fail for complex column names that require the :Name("e.g. ∆X") construct. I edited the code above and now it should work more generally.

Jeff_Perkinson · Nov 1, 2013 11:03 AM

It is clever to use Col Min() with an index column to locate duplicates.

I usually use Tables->Summary to get a data table free of duplicates. Just Group by all of the columns that would define a duplicate. This has the added advantage of providing a column (N Rows) that tells you how many duplicate rows there were in the original table.

-Jeff

robot · Nov 7, 2013 01:06 PM

Thanks Jeff.

Discussions

Insert list of column names into a formula

Re: Insert list of column names into a formula

Re: Insert list of column names into a formula

Re: Insert list of column names into a formula

Re: Insert list of column names into a formula

Re: Insert list of column names into a formula

Re: Insert list of column names into a formula

Recommended Articles