cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
Check out the JMP® Marketplace featured Capability Explorer add-in
Choose Language Hide Translation Bar

Selecting Duplicate Rows in a Data Table

I'm sure this must have been tackled before but I can't find it on the forum so apologies if the answer is blindingly obvious.

I have a large data table that I know contains duplicate rows. What I want to do is to easily select these duplicates and either delete or exclude from further analysis. Is there a script that can do this in v8 & v9 of JMP?
1 ACCEPTED SOLUTION

Accepted Solutions
ms
Super User (Alumni) ms
Super User (Alumni)

Re: Selecting Duplicate Rows in a Data Table

If entire rows are identical for duplicates then you can use the summary command from the tables command. Add all columns to the "Group" box and press ok.

This should give you a table without duplicates. The only differences is that the table contains a new column "N Rows" and that the sorting might have changed.

Under the red triangle at "Source" in the left panel you can choose to copy the script and paste it in a new script window if you want to repeat this with other tables (with the same set of columns).

To get rid of the N Rows column you can add the line

Current data table() << Delete Column(N Rows)

to the script.

View solution in original post

8 REPLIES 8
pmroz
Super User

Re: Selecting Duplicate Rows in a Data Table

There are several ways, but they depend on your data. How do you know that a row is a duplicate? I.e. if a row is a duplicate, are all its column values the same as another row?

Is there one column value (like a key or index) you could use to detect duplicates? Or do you need to look at the entire row?

Give us some more to go on...

Regards,
Peter
pmroz
Super User

Re: Selecting Duplicate Rows in a Data Table

A quick search revealed this thread:

 

https://community.jmp.com/t5/Discussions/find-and-drop-duplicate-rows/m-p/370#M370

Basically you can use the summary command to produce a table free of duplicates.

Regards,
Peter

Re: Selecting Duplicate Rows in a Data Table

Peter,
apologies for the lack of clarity in the post.
In the example I'm looking at I have 51 rows and 19 columns of both character and numeric data, I'm classing it a duplicate when all of the column values are the same as another row (numeric or character).

I know nothing about the JSL so I tried to just copy & paste into a new script and came up with multiple messages basically saying it doesn't work. Could someone help getting it into the right format?

Script from previous link

 

dt << CurrentDataTable;
TotalRows = NRows(dt);
eqFlag = 1;
deleteFlag = 0;
 
For(i=1, i<TotalRows,i++,
For(k =1, k <=9,k++,
If(Column(dt,k)sqbrack i sqbrack !=Column(dt,k)[TotalRows]
ms
Super User (Alumni) ms
Super User (Alumni)

Re: Selecting Duplicate Rows in a Data Table

If entire rows are identical for duplicates then you can use the summary command from the tables command. Add all columns to the "Group" box and press ok.

This should give you a table without duplicates. The only differences is that the table contains a new column "N Rows" and that the sorting might have changed.

Under the red triangle at "Source" in the left panel you can choose to copy the script and paste it in a new script window if you want to repeat this with other tables (with the same set of columns).

To get rid of the N Rows column you can add the line

Current data table() << Delete Column(N Rows)

to the script.
ms
Super User (Alumni) ms
Super User (Alumni)

Re: Selecting Duplicate Rows in a Data Table

Here is a general script that does what I described above. Applies to any table as column number and names need not to be specified.

dt = Current Data Table();
dt2 = Eval(
     Parse(
          Eval(
               Expr(
                    "dt << Summary(group(" || Substitute(
                              Substitute( Char( dt << get column names() ), "}", "" ),
                         "{", ""
                    ) || "))"
               )
          )
     )
);
dt2 << delete column( Eval( N Col( dt ) + 1 ) );

The script populates Group(column1, column2,...) by deleting the curly brackets from a list of columns. Is anybody aware of a simpler, less convoluted way to use a list variable of columns as an argument in Summary() and similar functions?

 

If duplicates are defined by a subset of columns rather than all columns, one way would be to replace "get column names" with  "get selected columns" in the above script. However, the columns must then first be manually selected in the data table, and the summary table updated with the original table to restore remaining columns.

 

laural
Level I

Re: Selecting Duplicate Rows in a Data Table

This is the method that I use to find complete duplicates. One more thing you may want to do is to unlock the columns in the summary table.

Here is the script to do that.

dt = current data table();

colList=dt<
show(collist);

for (i=1,i<=nitems(collist),i++, 
collist<< lock(0);

);
annaivashko
Level II

Re: Selecting Duplicate Rows in a Data Table

Thank you! I found this very quick and helpful to deal with duplicates
txnelson
Super User

Re: Selecting Duplicate Rows in a Data Table

Take a look at "select duplicate rows" in the Scripting Index

Names Default To Here( 1 );
dt = Open( "$SAMPLE_DATA/Big Class.jmp" );
dt << Select duplicate rows( Match( :age, :height ) );
Jim