cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
Browse apps to extend the software in the new JMP Marketplace
Choose Language Hide Translation Bar

Eliminating Duplicate Rows (keeping first duplicate)

I know this has been discussed before, but I'm looking for suggestions for my particular situation. I have a very large data table (~50 columns x 50,000+ rows). I need to check for "duplicate rows", where duplicate means the rows match in three columns (eg. ColA, ColB, and ColC). When duplicates exist, I need to delete all except the first of the matching rows.

Ideally, I'd like to do this with a script as I frequently need to re-pull and re-analyze the updated table. I suspect I can use a summary table to help with this (at least it will identify & select the duplicate rows). However, from there I'm not sure how to automate moving through each set of "matched" rows and delete all but the first.
2 ACCEPTED SOLUTIONS

Accepted Solutions
chungwei
Staff (Retired)

Re: Eliminating Duplicate Rows (keeping first duplicate)

I would use summary on the the 3 matching columns. Then join the summary table to the original table, match on the same 3 columns, with the drop duplicates option checked, and select only the relevant columns you want for the output table.

View solution in original post

chungwei
Staff (Retired)

Re: Eliminating Duplicate Rows (keeping first duplicate)

JMP 14 has a new command "Select Duplicate Rows" under the Rows menu,  so you can do that directly without having to do a join.

dt << select duplicate rows(match(:a, :b, :c));
dt << delete rows;

View solution in original post

7 REPLIES 7
chungwei
Staff (Retired)

Re: Eliminating Duplicate Rows (keeping first duplicate)

I would use summary on the the 3 matching columns. Then join the summary table to the original table, match on the same 3 columns, with the drop duplicates option checked, and select only the relevant columns you want for the output table.

Eliminating Duplicate Rows (keeping first duplicate)

Thanks that worked great!  Here's my code for the script version:

 

 

// JMP script to Eliminate duplicate rows "matching Parent, Wafer, & Raw Number"
dt3 = Current Data Table();
dt2 = dt3 << Summary(
    Group( :Parent, :Meas Wafer Id, :Raw Number )
);
dt = dt3 << Join(
    With( dt2 ),
    Update,
    By Matching Columns(
        :Parent = :Parent,
        :Meas Wafer Id = :Meas Wafer Id,
        :Raw Number = :Raw Number
    ),
    Drop multiples( 1, 0 ),
    Name( "Include non-matches" )(0, 0),
    Preserve main table order( 1 ),
);
   
Close(dt2, no save);
Close(dt3, no save);

 

chungwei
Staff (Retired)

Re: Eliminating Duplicate Rows (keeping first duplicate)

JMP 14 has a new command "Select Duplicate Rows" under the Rows menu,  so you can do that directly without having to do a join.

dt << select duplicate rows(match(:a, :b, :c));
dt << delete rows;
ms
Super User (Alumni) ms
Super User (Alumni)

Re: Eliminating Duplicate Rows (keeping first duplicate)

That's great!

 

Meanwhile (in JMP 13), this should be equivalent

// Keep first instance of duplicates only
dt << select where(Col Min(Row(), :a, :b, :c) < Row()) << delete rows;

 

Chily
Level III

Re: Eliminating Duplicate Rows (keeping first duplicate)

May I know if I want to keep the last (not first) of duplicates?  How to change the script? Thanks.

txnelson
Super User

Re: Eliminating Duplicate Rows (keeping first duplicate)

Here is a variation on @ms script that should do what you want

dt << select where(Col Max(Row(), :a,:b, :c) > Row()) << delete rows;
Jim
Chily
Level III

Re: Eliminating Duplicate Rows (keeping first duplicate)

Great! it works. Thank you, that's what I need.

 

Regards,Chily