Subscribe Bookmark RSS Feed

How to filter out duplicate IDs leaving only the last duplicate?

nkelleh

Community Trekker

Joined:

Dec 19, 2012

Hi All,

I'm hoping there is a JSL solution where I can just use a '<<Select Where' message to the datatable. I have a datatable of data in which each row of represents a device tested.

There are 2 numeric columns that specify a unique identifier for each row, i.e. an 'ID' column and a 'wafer number' column. There is also another 2 columns that specify date and time respectfully. The problem is that sometimes, devices are retested within the same dataset, which leads to duplicate values for the ID and wafer columns. I'd like to be able to select all unique IDs and in the case of duplicate IDs ensure that only the last value (either by row number or timestamp) is selected.

Below is a sample datatable showing what I'm looking for. The two parts that were retested (highlighted here in red, corresponding retests are in green) are excluded and everything else is selected.

Any help with this would be greatly appreciated.

Thank you.

9240_pastedImage_2.png

1 ACCEPTED SOLUTION

Accepted Solutions
Solution

I would sort the table by ID, WaferNum, date and time (in that order), so that rows with the same ID and WaferNum follow each other in the table, with the older one(s) above the most recent.

I would then create a new column called "Duplicate", with this formula (you should be able to copy it directly and paste into JMP):

If( :ID == :ID[Row() + 1, Empty()] & :WaferNum == :WaferNum[Row() + 1, Empty()],

1,

0

)

It will flag rows that are followed by another row with the same ID and WaferNum (i.e. replicatesthat are not the most recent) with a "1".

Use SelectWhere Duplicate == 0.

7 REPLIES
ron_horne

Super User

Joined:

Jun 23, 2011

one option could be to create one column for date and time (concatenating) . this will allow you to use the table summary by ID and WaferNum and requesting the Max value of date and time. this will produce a new clean table for you.

Ron

nkelleh

Community Trekker

Joined:

Dec 19, 2012

Thanks for this. I did see an example of this all right, but I was hoping there is a way to accomplish this by specifying a condition within the << Select Where() message to the datatable. I get the feeling it should be possible, but not sure how you would go about it.

Also, another consideration is that within a test batch, the date and time stamp don't change, which means that you'll still have a duplicates where the only difference is in the row number when limited to the columns I specified. Cheers in any case.

ron_horne

Super User

Joined:

Jun 23, 2011

take a look at these discussions:

Re: Syntax: How do I create an index column by ID

JMP formula : Rank by group

once you will establish the ranking you can filter.

ron

Solution

I would sort the table by ID, WaferNum, date and time (in that order), so that rows with the same ID and WaferNum follow each other in the table, with the older one(s) above the most recent.

I would then create a new column called "Duplicate", with this formula (you should be able to copy it directly and paste into JMP):

If( :ID == :ID[Row() + 1, Empty()] & :WaferNum == :WaferNum[Row() + 1, Empty()],

1,

0

)

It will flag rows that are followed by another row with the same ID and WaferNum (i.e. replicatesthat are not the most recent) with a "1".

Use SelectWhere Duplicate == 0.

Jeff_Perkinson

Community Manager

Joined:

Jun 23, 2011

Like so many things with JMP there are multiple ways to accomplish this.

The first thing to note is that the Col Maximum() function will take optional ByVariable arguments.

9243_JMPScreenSnapz001.png

So, I think you can use this to construct your condition for selecting rows:


(:Date ==Col Maximum(:Date,:ID,:WaferNum ))&(:Time ==


Col Maximum(:Time,:ID,:WaferNum,:Date ));


-Jeff

-Jeff
nkelleh

Community Trekker

Joined:

Dec 19, 2012

Hi Everyone,

Thanks very much for all your help.

Jeff, your approach would have worked great for me except that on further examination of the data, turns out there were duplicates with the same Date and Time Values (apologies, this was not shown in example).

In the end, I went along with jvillaumie's suggestion. Below is my implementation.


mydt = Current Data Table();



//Sort columns by ID, wafer num, date in that order


mydt << Sort( Replace Table, By(:ID), Order(Ascending));


mydt << Sort( Replace Table, By(:WaferNum), Order(Ascending));


mydt << Sort( Replace Table, By(:Date), Order(Ascending));



//Select duplicates, i.e. if the next row has the same values for ID and waferNum as the current row, i.e. everything but the last duplicate


mydt << Select Where(


      :ID == :ID[Row() + 1] & :WaferNum == :WaferNum[Row() + 1]


);



//Delete Selection


mydt <<DeleteRows;



jvillaumie

Community Trekker

Joined:

Jun 9, 2014

Hi nkelleh,

when I said "I would sort the table by ID, WaferNum, date and time (in that order)", I meant

Current Data Table() << Sort(By(:ID, :WaferNum, :Date, :Time), Order(Ascending));

I have a very strong suspicion that running several Sort (single column as an argument for each) one after the other, like you did, will not give the same results as running a single Sort with several columns as arguments.

Hope this helps