Solved: Re: How to filter out duplicate IDs leaving only the last duplicate?

nkelleh · Oct 18, 2016 4:55 PM

Hi All,

I'm hoping there is a JSL solution where I can just use a '<<Select Where' message to the datatable. I have a datatable of data in which each row of represents a device tested.

There are 2 numeric columns that specify a unique identifier for each row, i.e. an 'ID' column and a 'wafer number' column. There is also another 2 columns that specify date and time respectfully. The problem is that sometimes, devices are retested within the same dataset, which leads to duplicate values for the ID and wafer columns. I'd like to be able to select all unique IDs and in the case of duplicate IDs ensure that only the last value (either by row number or timestamp) is selected.

Below is a sample datatable showing what I'm looking for. The two parts that were retested (highlighted here in red, corresponding retests are in green) are excluded and everything else is selected.

Any help with this would be greatly appreciated.

Thank you.

jvillaumie · Jun 15, 2018 6:28 AM

I would sort the table by ID, WaferNum, date and time (in that order), so that rows with the same ID and WaferNum follow each other in the table, with the older one(s) above the most recent.

I would then create a new column called "Duplicate", with this formula (you should be able to copy it directly and paste into JMP):

If( :ID == :ID[Row() + 1, Empty()] & :WaferNum == :WaferNum[Row() + 1, Empty()],
1,
0
)

It will flag rows that are followed by another row with the same ID and WaferNum (i.e. replicatesthat are not the most recent) with a "1".

Use SelectWhere Duplicate == 0.

View solution in original post

ron_horne · Jul 16, 2015 06:43 AM

one option could be to create one column for date and time (concatenating) . this will allow you to use the table summary by ID and WaferNum and requesting the Max value of date and time. this will produce a new clean table for you.

Ron

nkelleh · Jul 16, 2015 09:02 AM

Thanks for this. I did see an example of this all right, but I was hoping there is a way to accomplish this by specifying a condition within the << Select Where() message to the datatable. I get the feeling it should be possible, but not sure how you would go about it.

Also, another consideration is that within a test batch, the date and time stamp don't change, which means that you'll still have a duplicates where the only difference is in the row number when limited to the columns I specified. Cheers in any case.

ron_horne · Oct 28, 2016 6:19 AM

take a look at these discussions:

Re: Syntax: How do I create an index column by ID

JMP formula : Rank by group

once you will establish the ranking you can filter.

ron

jvillaumie · Jun 15, 2018 6:28 AM

I would sort the table by ID, WaferNum, date and time (in that order), so that rows with the same ID and WaferNum follow each other in the table, with the older one(s) above the most recent.

I would then create a new column called "Duplicate", with this formula (you should be able to copy it directly and paste into JMP):

If( :ID == :ID[Row() + 1, Empty()] & :WaferNum == :WaferNum[Row() + 1, Empty()],
1,
0
)

It will flag rows that are followed by another row with the same ID and WaferNum (i.e. replicatesthat are not the most recent) with a "1".

Use SelectWhere Duplicate == 0.

Jeff_Perkinson · Oct 18, 2016 4:56 PM

Like so many things with JMP there are multiple ways to accomplish this.

The first thing to note is that the Col Maximum() function will take optional ByVariable arguments.

So, I think you can use this to construct your condition for selecting rows:


(:Date ==Col Maximum(:Date,:ID,:WaferNum ))&(:Time ==

Col Maximum(:Time,:ID,:WaferNum,:Date ));

-Jeff

nkelleh · Jul 21, 2015 09:49 AM

Hi Everyone,

Thanks very much for all your help.

Jeff, your approach would have worked great for me except that on further examination of the data, turns out there were duplicates with the same Date and Time Values (apologies, this was not shown in example).

In the end, I went along with jvillaumie's suggestion. Below is my implementation.


mydt = Current Data Table();


//Sort columns by ID, wafer num, date in that order

mydt << Sort( Replace Table, By(:ID), Order(Ascending));

mydt << Sort( Replace Table, By(:WaferNum), Order(Ascending));

mydt << Sort( Replace Table, By(:Date), Order(Ascending));


//Select duplicates, i.e. if the next row has the same values for ID and waferNum as the current row, i.e. everything but the last duplicate

mydt << Select Where(

      :ID == :ID[Row() + 1]  & :WaferNum == :WaferNum[Row() + 1]

);


//Delete Selection

mydt <<DeleteRows;

jvillaumie · Jun 15, 2018 6:28 AM

Hi nkelleh,

when I said "I would sort the table by ID, WaferNum, date and time (in that order)", I meant

Current Data Table() << Sort(By(:ID, :WaferNum, :Date, :Time), Order(Ascending));

I have a very strong suspicion that running several Sort (single column as an argument for each) one after the other, like you did, will not give the same results as running a single Sort with several columns as arguments.

Hope this helps