Discussions

Report Inappropriate Content

I know this has been discussed before, but I'm looking for suggestions for my particular situation. I have a very large data table (~50 columns x 50,000+ rows). I need to check for "duplicate rows", where duplicate means the rows match in three columns (eg. ColA, ColB, and ColC). When duplicates exist, I need to delete all except the first of the matching rows.

Ideally, I'd like to do this with a script as I frequently need to re-pull and re-analyze the updated table. I suspect I can use a summary table to help with this (at least it will identify & select the duplicate rows). However, from there I'm not sure how to automate moving through each set of "matched" rows and delete all but the first.

chungwei · Jun 10, 2011 09:59 AM

I would use summary on the the 3 matching columns. Then join the summary table to the original table, match on the same 3 columns, with the drop duplicates option checked, and select only the relevant columns you want for the output table.

View solution in original post

chungwei · Feb 13, 2018 1:20 PM

JMP 14 has a new command "Select Duplicate Rows" under the Rows menu, so you can do that directly without having to do a join.

dt << select duplicate rows(match(:a, :b, :c));
dt << delete rows;

View solution in original post

chungwei · Jun 13, 2011 10:56 AM

I would use summary on the the 3 matching columns. Then join the summary table to the original table, match on the same 3 columns, with the drop duplicates option checked, and select only the relevant columns you want for the output table.

Report Inappropriate Content · Jul 7, 2011 02:45 PM

Thanks that worked great! Here's my code for the script version:

// JMP script to Eliminate duplicate rows "matching Parent, Wafer, & Raw Number"
dt3 = Current Data Table();
dt2 = dt3 << Summary(
    Group( :Parent, :Meas Wafer Id, :Raw Number )
);
dt = dt3 << Join(
    With( dt2 ),
    Update,
    By Matching Columns(
        :Parent = :Parent,
        :Meas Wafer Id = :Meas Wafer Id,
        :Raw Number = :Raw Number
    ),
    Drop multiples( 1, 0 ),
    Name( "Include non-matches" )(0, 0),
    Preserve main table order( 1 ),
);
   
Close(dt2, no save);
Close(dt3, no save);

chungwei · Feb 13, 2018 1:20 PM

JMP 14 has a new command "Select Duplicate Rows" under the Rows menu, so you can do that directly without having to do a join.

dt << select duplicate rows(match(:a, :b, :c));
dt << delete rows;

ms · Feb 13, 2018 04:42 PM

That's great!

Meanwhile (in JMP 13), this should be equivalent

// Keep first instance of duplicates only
dt << select where(Col Min(Row(), :a, :b, :c) < Row()) << delete rows;

Chily · Jan 15, 2024 10:41 PM

May I know if I want to keep the last (not first) of duplicates? How to change the script? Thanks.

txnelson · Jan 15, 2024 11:22 PM

Here is a variation on @ms script that should do what you want

dt << select where(Col Max(Row(), :a,:b, :c) > Row()) << delete rows;

Jim

Chily · Jan 21, 2024 10:01 AM

Great! it works. Thank you, that's what I need.

Regards,Chily

Discussions

Eliminating Duplicate Rows (keeping first duplicate)

Re: Eliminating Duplicate Rows (keeping first duplicate)

Re: Eliminating Duplicate Rows (keeping first duplicate)

Re: Eliminating Duplicate Rows (keeping first duplicate)

Eliminating Duplicate Rows (keeping first duplicate)

Re: Eliminating Duplicate Rows (keeping first duplicate)

Re: Eliminating Duplicate Rows (keeping first duplicate)

Re: Eliminating Duplicate Rows (keeping first duplicate)

Re: Eliminating Duplicate Rows (keeping first duplicate)

Re: Eliminating Duplicate Rows (keeping first duplicate)

Recommended Articles

Get Going with JMP: Essentials for Using JMP

Introduction to the JMP Scripting Language