BookmarkSubscribeRSS Feed

Eliminating Duplicate Rows (keeping first duplicate)

I know this has been discussed before, but I'm looking for suggestions for my particular situation. I have a very large data table (~50 columns x 50,000+ rows). I need to check for "duplicate rows", where duplicate means the rows match in three columns (eg. ColA, ColB, and ColC). When duplicates exist, I need to delete all except the first of the matching rows.

Ideally, I'd like to do this with a script as I frequently need to re-pull and re-analyze the updated table. I suspect I can use a summary table to help with this (at least it will identify & select the duplicate rows). However, from there I'm not sure how to automate moving through each set of "matched" rows and delete all but the first.
2 ACCEPTED SOLUTIONS

Accepted Solutions
chungwei

Staff

Joined:

Jun 23, 2011

Solution
I would use summary on the the 3 matching columns. Then join the summary table to the original table, match on the same 3 columns, with the drop duplicates option checked, and select only the relevant columns you want for the output table.
chungwei

Staff

Joined:

Jun 23, 2011

Solution

JMP 14 has a new command "Select Duplicate Rows" under the Rows menu,  so you can do that directly without having to do a join.

dt << select duplicate rows(match(:a, :b, :c));
dt << delete rows;
4 REPLIES
chungwei

Staff

Joined:

Jun 23, 2011

Solution
I would use summary on the the 3 matching columns. Then join the summary table to the original table, match on the same 3 columns, with the drop duplicates option checked, and select only the relevant columns you want for the output table.

Thanks that worked great!  Here's my code for the script version:

 

 

// JMP script to Eliminate duplicate rows "matching Parent, Wafer, & Raw Number"
dt3 = Current Data Table();
dt2 = dt3 << Summary(
    Group( :Parent, :Meas Wafer Id, :Raw Number )
);
dt = dt3 << Join(
    With( dt2 ),
    Update,
    By Matching Columns(
        :Parent = :Parent,
        :Meas Wafer Id = :Meas Wafer Id,
        :Raw Number = :Raw Number
    ),
    Drop multiples( 1, 0 ),
    Name( "Include non-matches" )(0, 0),
    Preserve main table order( 1 ),
);
   
Close(dt2, no save);
Close(dt3, no save);

 

chungwei

Staff

Joined:

Jun 23, 2011

Solution

JMP 14 has a new command "Select Duplicate Rows" under the Rows menu,  so you can do that directly without having to do a join.

dt << select duplicate rows(match(:a, :b, :c));
dt << delete rows;
ms

Super User

Joined:

Jun 23, 2011

That's great!

 

Meanwhile (in JMP 13), this should be equivalent

// Keep first instance of duplicates only
dt << select where(Col Min(Row(), :a, :b, :c) < Row()) << delete rows;