Solved: Re: Removing Duplicate Rows from Current Table

NilD · May 7, 2018 7:56 AM

Hi,

I have 256k rows in a parent table right now. While doing some analysis, I found out I have few duplicate rows. This table constantly gets new data ammended through script. And there is chance source data may have duplicate again.

So I am trying to create a clean up script, that i can run before executing some of my analysis. I searched other discussions here and found a solution (https://community.jmp.com/t5/Discussions/Eliminating-Duplicate-Rows-keeping-first-duplicate/td-p/342...), which i customized for my application

dt3 = Current Data Table();
dt2 = dt3 << Summary(
group(:CL1, :CL2, :CL3, :CL4)
);
dt3 << join(
with (dt2),
update,
by matching columns (
:CL1 = :CL1,
:CL2 = :CL2,
:CL3 = :CL3,
:CL4 = :CL5
),
drop multiples(1,0),
name("include non-matches")(0,0),
preserve main table order (1),
);

This solution did what i was looking for. However, its not able to update the parent table. It creates a new table with the right rows, but my parent table is unchanged. How can I use this code on my parent table to delete rows from the parent table, which in this case is "Currnet Data Table()."

Also, (i already know, but wanted to confirm), I have 25 columns in my data set. Should i run above script to match all of 25 columns to be accurate or as long as i have time column, i can run less number of columns (like 4 per above example)? I didnt want to slow script as i will have massive amount of data in the JMP file.

Thanks in advance!

gzmorgan0 · May 6, 2018 04:48 PM

NiiD,

If time is your main identifier, one optionis to create a new column for Dif(:Datetime,1)or get rows where Dif(:Datetime,1) ==0 it should be a duplicate. If however, there can be records with the same datetime, but different tools or other identifiers, then sort by those identifiers and Datetime and then get rows where or select where identifier == Lag(identifier,1) & Dif(:Datetime,1) ==0. Then delete those rows.

View solution in original post

gzmorgan0 · May 3, 2018 03:57 PM

NiiD,

You are missing key pieces of information about your data: is there one or several columns that should represent a unique record. For example:

a multi-step manufacturing line, like semiconductors typically would use columns Lot#, Wafer#, Operation#, Equip ID & Datetime
a repeated measures drug study: Subject ID and Datetime
etc.

Then if these columns should represent unique records (keys) then other columns should not matter and you need to decide if you want to keep the first or the last record. In other words, does your table order provide information about the duplicates? For example, databases are often ordered by most recent to older.

The attched script creates a table of 256k unique rows and 25 columns. Randomly selects 200 rows. The first 100 are duplicated once the second 100 are duplicated twice, and the table order is shuffled. So now there are 256,300 rows.

It was created with 2 string columns and 2 integer colums so 4 columns that represent the unique keys. And this script keeping the highest ordered row in the table. The intermediary tables are set as invisible for testing performance without rendering time. However, to understand what the code is doing, you might want to convert them to be visible.

I've run this 6 times and each took 2.0x seconds on my PC. The time will vary depending upon the number of dupes.

The script finds the rows to delete by working on only the duplicated rows which should be much smaller than 256k.

Final note. Instead of deleting those rows from your main table you might want to exclude them or maintain a column of Keep or Save so you can reset the rowstates to exclude them.

Good Luck!

NilD · May 7, 2018 7:58 AM

Thanks! i will try to use your code to see if i can adapt it.

I do have time column, which i can use sort to get identical rows next to eachother. All other column are actual process data, so there arent any duplicates (other than process identifiers like product name, product type). But the time column can be used to sort and have identical rows next to eachother. then question becomes is selecting one of the identical row.

I found JMP 14 has new feature, but that is not supported in JMP 13 (my version). So perhaps if there is small easy code to select duplicate row (one of two entry), then i can simply delete them.

dt3 = Data Table( "ProcessDataTable" ) << Select Duplicate Rows();

Background on duplicates: my script pulls data from a CSV file that is generated weekly, which gets imported into JMP master table. The weekly data is past seven days from time it was generated. So if for one week the CSV was generated on monday at 12:00 PM and next week it was generated at monday 8:00 AM then there could be duplicate of three hours.

That is what i am looking to exclude once data comes into JMP. There is not ability to eliminate duplicate before it reaches JMP thus I am hoping for a solution in JMP.

gzmorgan0 · May 6, 2018 04:48 PM

NiiD,

If time is your main identifier, one optionis to create a new column for Dif(:Datetime,1)or get rows where Dif(:Datetime,1) ==0 it should be a duplicate. If however, there can be records with the same datetime, but different tools or other identifiers, then sort by those identifiers and Datetime and then get rows where or select where identifier == Lag(identifier,1) & Dif(:Datetime,1) ==0. Then delete those rows.

NilD · Jun 6, 2018 12:30 PM

Thanks GZMORGAN0,

I was able to use your methodology to eliminate any duplicates. Created new column with DIF formula. Then using script sorted time column, selected any zeros (diff = 0 meaning duplicate row) and delete the duplicates.