Solved: Re: Selecting Duplicate Rows in a Data Table

Report Inappropriate Content · Nov 15, 2010 05:42 AM

I'm sure this must have been tackled before but I can't find it on the forum so apologies if the answer is blindingly obvious.

I have a large data table that I know contains duplicate rows. What I want to do is to easily select these duplicates and either delete or exclude from further analysis. Is there a script that can do this in v8 & v9 of JMP?

ms · Nov 15, 2010 11:24 AM

If entire rows are identical for duplicates then you can use the summary command from the tables command. Add all columns to the "Group" box and press ok.

This should give you a table without duplicates. The only differences is that the table contains a new column "N Rows" and that the sorting might have changed.

Under the red triangle at "Source" in the left panel you can choose to copy the script and paste it in a new script window if you want to repeat this with other tables (with the same set of columns).

To get rid of the N Rows column you can add the line

Current data table() << Delete Column(N Rows)

to the script.

View solution in original post

pmroz · Nov 15, 2010 10:11 AM

There are several ways, but they depend on your data. How do you know that a row is a duplicate? I.e. if a row is a duplicate, are all its column values the same as another row?

Is there one column value (like a key or index) you could use to detect duplicates? Or do you need to look at the entire row?

Give us some more to go on...

Regards,
Peter

pmroz · Jan 4, 2017 9:37 AM

A quick search revealed this thread:

https://community.jmp.com/t5/Discussions/find-and-drop-duplicate-rows/m-p/370#M370

Basically you can use the summary command to produce a table free of duplicates.

Regards,
Peter

Report Inappropriate Content · Jan 4, 2017 9:30 AM

Peter,
apologies for the lack of clarity in the post.
In the example I'm looking at I have 51 rows and 19 columns of both character and numeric data, I'm classing it a duplicate when all of the column values are the same as another row (numeric or character).

I know nothing about the JSL so I tried to just copy & paste into a new script and came up with multiple messages basically saying it doesn't work. Could someone help getting it into the right format?

Script from previous link

dt << CurrentDataTable;
TotalRows = NRows(dt);
eqFlag = 1;
deleteFlag = 0;
 
For(i=1, i<TotalRows,i++,
For(k =1, k <=9,k++,
If(Column(dt,k)sqbrack i sqbrack !=Column(dt,k)[TotalRows]

ms · Nov 15, 2010 11:24 AM

If entire rows are identical for duplicates then you can use the summary command from the tables command. Add all columns to the "Group" box and press ok.

This should give you a table without duplicates. The only differences is that the table contains a new column "N Rows" and that the sorting might have changed.

Under the red triangle at "Source" in the left panel you can choose to copy the script and paste it in a new script window if you want to repeat this with other tables (with the same set of columns).

To get rid of the N Rows column you can add the line

Current data table() << Delete Column(N Rows)

to the script.

ms · Jan 4, 2017 9:31 AM

Here is a general script that does what I described above. Applies to any table as column number and names need not to be specified.

dt = Current Data Table();
dt2 = Eval(
     Parse(
          Eval(
               Expr(
                    "dt << Summary(group(" || Substitute(
                              Substitute( Char( dt << get column names() ), "}", "" ),
                         "{", ""
                    ) || "))"
               )
          )
     )
);
dt2 << delete column( Eval( N Col( dt ) + 1 ) );

The script populates Group(column1, column2,...) by deleting the curly brackets from a list of columns. Is anybody aware of a simpler, less convoluted way to use a list variable of columns as an argument in Summary() and similar functions?

If duplicates are defined by a subset of columns rather than all columns, one way would be to replace "get column names" with "get selected columns" in the above script. However, the columns must then first be manually selected in the data table, and the summary table updated with the original table to restore remaining columns.

laural · Dec 2, 2010 10:30 AM

This is the method that I use to find complete duplicates. One more thing you may want to do is to unlock the columns in the summary table.

Here is the script to do that.

dt = current data table();

colList=dt<
show(collist);

for (i=1,i<=nitems(collist),i++, 
collist<< lock(0);

);

annaivashko · Dec 24, 2020 12:25 PM

Thank you! I found this very quick and helpful to deal with duplicates

txnelson · Dec 24, 2020 12:54 PM

Take a look at "select duplicate rows" in the Scripting Index

Names Default To Here( 1 );
dt = Open( "$SAMPLE_DATA/Big Class.jmp" );
dt << Select duplicate rows( Match( :age, :height ) );

Jim