- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Get Direct Link
- Report Inappropriate Content
Selecting Duplicate Rows in a Data Table
I have a large data table that I know contains duplicate rows. What I want to do is to easily select these duplicates and either delete or exclude from further analysis. Is there a script that can do this in v8 & v9 of JMP?
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Get Direct Link
- Report Inappropriate Content
Re: Selecting Duplicate Rows in a Data Table
This should give you a table without duplicates. The only differences is that the table contains a new column "N Rows" and that the sorting might have changed.
Under the red triangle at "Source" in the left panel you can choose to copy the script and paste it in a new script window if you want to repeat this with other tables (with the same set of columns).
To get rid of the N Rows column you can add the line
Current data table() << Delete Column(N Rows)
to the script.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Get Direct Link
- Report Inappropriate Content
Re: Selecting Duplicate Rows in a Data Table
Is there one column value (like a key or index) you could use to detect duplicates? Or do you need to look at the entire row?
Give us some more to go on...
Regards,
Peter
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Get Direct Link
- Report Inappropriate Content
Re: Selecting Duplicate Rows in a Data Table
A quick search revealed this thread:
https://community.jmp.com/t5/Discussions/find-and-drop-duplicate-rows/m-p/370#M370
Basically you can use the summary command to produce a table free of duplicates.
Regards,
Peter
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Get Direct Link
- Report Inappropriate Content
Re: Selecting Duplicate Rows in a Data Table
Peter,
apologies for the lack of clarity in the post.
In the example I'm looking at I have 51 rows and 19 columns of both character and numeric data, I'm classing it a duplicate when all of the column values are the same as another row (numeric or character).
I know nothing about the JSL so I tried to just copy & paste into a new script and came up with multiple messages basically saying it doesn't work. Could someone help getting it into the right format?
Script from previous link
dt << CurrentDataTable;
TotalRows = NRows(dt);
eqFlag = 1;
deleteFlag = 0;
For(i=1, i<TotalRows,i++,
For(k =1, k <=9,k++,
If(Column(dt,k)sqbrack i sqbrack !=Column(dt,k)[TotalRows]
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Get Direct Link
- Report Inappropriate Content
Re: Selecting Duplicate Rows in a Data Table
This should give you a table without duplicates. The only differences is that the table contains a new column "N Rows" and that the sorting might have changed.
Under the red triangle at "Source" in the left panel you can choose to copy the script and paste it in a new script window if you want to repeat this with other tables (with the same set of columns).
To get rid of the N Rows column you can add the line
Current data table() << Delete Column(N Rows)
to the script.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Get Direct Link
- Report Inappropriate Content
Re: Selecting Duplicate Rows in a Data Table
Here is a general script that does what I described above. Applies to any table as column number and names need not to be specified.
dt = Current Data Table();
dt2 = Eval(
Parse(
Eval(
Expr(
"dt << Summary(group(" || Substitute(
Substitute( Char( dt << get column names() ), "}", "" ),
"{", ""
) || "))"
)
)
)
);
dt2 << delete column( Eval( N Col( dt ) + 1 ) );
The script populates Group(column1, column2,...) by deleting the curly brackets from a list of columns. Is anybody aware of a simpler, less convoluted way to use a list variable of columns as an argument in Summary() and similar functions?
If duplicates are defined by a subset of columns rather than all columns, one way would be to replace "get column names" with "get selected columns" in the above script. However, the columns must then first be manually selected in the data table, and the summary table updated with the original table to restore remaining columns.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Get Direct Link
- Report Inappropriate Content
Re: Selecting Duplicate Rows in a Data Table
This is the method that I use to find complete duplicates. One more thing you may want to do is to unlock the columns in the summary table.
Here is the script to do that.
dt = current data table();
colList=dt<
show(collist);
for (i=1,i<=nitems(collist),i++,
collist<< lock(0);
);
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Get Direct Link
- Report Inappropriate Content
Re: Selecting Duplicate Rows in a Data Table
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Get Direct Link
- Report Inappropriate Content
Re: Selecting Duplicate Rows in a Data Table
Take a look at "select duplicate rows" in the Scripting Index
Names Default To Here( 1 );
dt = Open( "$SAMPLE_DATA/Big Class.jmp" );
dt << Select duplicate rows( Match( :age, :height ) );