BookmarkSubscribeRSS Feed

Eliminating Duplicate Rows (keeping first duplicate)

I know this has been discussed before, but I'm looking for suggestions for my particular situation. I have a very large data table (~50 columns x 50,000+ rows). I need to check for "duplicate rows", where duplicate means the rows match in three columns (eg. ColA, ColB, and ColC). When duplicates exist, I need to delete all except the first of the matching rows.

Ideally, I'd like to do this with a script as I frequently need to re-pull and re-analyze the updated table. I suspect I can use a summary table to help with this (at least it will identify & select the duplicate rows). However, from there I'm not sure how to automate moving through each set of "matched" rows and delete all but the first.
2 ACCEPTED SOLUTIONS

Accepted Solutions
chungwei

Staff

Joined:

Jun 23, 2011

Solution

Re: Eliminating Duplicate Rows (keeping first duplicate)

I would use summary on the the 3 matching columns. Then join the summary table to the original table, match on the same 3 columns, with the drop duplicates option checked, and select only the relevant columns you want for the output table.
chungwei

Staff

Joined:

Jun 23, 2011

Solution

Re: Eliminating Duplicate Rows (keeping first duplicate)

JMP 14 has a new command "Select Duplicate Rows" under the Rows menu,  so you can do that directly without having to do a join.

dt << select duplicate rows(match(:a, :b, :c));
dt << delete rows;
4 REPLIES
chungwei

Staff

Joined:

Jun 23, 2011

Solution

Re: Eliminating Duplicate Rows (keeping first duplicate)

I would use summary on the the 3 matching columns. Then join the summary table to the original table, match on the same 3 columns, with the drop duplicates option checked, and select only the relevant columns you want for the output table.

Eliminating Duplicate Rows (keeping first duplicate)

Thanks that worked great!  Here's my code for the script version:

 

 

// JMP script to Eliminate duplicate rows "matching Parent, Wafer, & Raw Number"
dt3 = Current Data Table();
dt2 = dt3 << Summary(
    Group( :Parent, :Meas Wafer Id, :Raw Number )
);
dt = dt3 << Join(
    With( dt2 ),
    Update,
    By Matching Columns(
        :Parent = :Parent,
        :Meas Wafer Id = :Meas Wafer Id,
        :Raw Number = :Raw Number
    ),
    Drop multiples( 1, 0 ),
    Name( "Include non-matches" )(0, 0),
    Preserve main table order( 1 ),
);
   
Close(dt2, no save);
Close(dt3, no save);

 

chungwei

Staff

Joined:

Jun 23, 2011

Solution

Re: Eliminating Duplicate Rows (keeping first duplicate)

JMP 14 has a new command "Select Duplicate Rows" under the Rows menu,  so you can do that directly without having to do a join.

dt << select duplicate rows(match(:a, :b, :c));
dt << delete rows;
ms

Super User

Joined:

Jun 23, 2011

Re: Eliminating Duplicate Rows (keeping first duplicate)

That's great!

 

Meanwhile (in JMP 13), this should be equivalent

// Keep first instance of duplicates only
dt << select where(Col Min(Row(), :a, :b, :c) < Row()) << delete rows;