Subscribe Bookmark RSS Feed

Eliminating Duplicate Rows (keeping first duplicate)

I know this has been discussed before, but I'm looking for suggestions for my particular situation. I have a very large data table (~50 columns x 50,000+ rows). I need to check for "duplicate rows", where duplicate means the rows match in three columns (eg. ColA, ColB, and ColC). When duplicates exist, I need to delete all except the first of the matching rows.

Ideally, I'd like to do this with a script as I frequently need to re-pull and re-analyze the updated table. I suspect I can use a summary table to help with this (at least it will identify & select the duplicate rows). However, from there I'm not sure how to automate moving through each set of "matched" rows and delete all but the first.
2 REPLIES
chungwei

Staff

Joined:

Jun 23, 2011

I would use summary on the the 3 matching columns. Then join the summary table to the original table, match on the same 3 columns, with the drop duplicates option checked, and select only the relevant columns you want for the output table.

Thanks that worked great!  Here's my code for the script version:

// JMP script to Eliminate duplicate rows "matching Parent, Wafer, & Raw Number"

dt3 = Current Data Table();

dt2 = dt3 << Summary(

    Group( :Parent, :Meas Wafer Id, :Raw Number )

);

dt = dt3 << Join(

    With( dt2 ),

    Update,

    By Matching Columns(

        :Parent = :Parent,

        :Meas Wafer Id = :Meas Wafer Id,

        :Raw Number = :Raw Number

    ),

    Drop multiples( 1, 0 ),

    Name( "Include non-matches" )(0, 0),

    Preserve main table order( 1 ),

);

   

Close(dt2, no save);

Close(dt3, no save);