Solved: Re: Selecting and removing duplicates row with condition

OtisZeca · Aug 8, 2019 02:54 AM

Hi guys,

I've searched around the forum however didnt come close to what I wanted to achieve, or maybe I missed out.

Anyway, I need help on the table above. As can be seen, I have duplicates as per highlighted by color. The serial number is same for most part because the part is inspected on different section(as show 1A,1B,2A,2B,3A,3B,4A,4B). Usually if the part fails inspection it will be re-inspected, which show in the re-test column(0 for no retest, 1 for 1st, etc..) Now I can only differentiate the part by the number of retest or date within the same serial number, so I would like to retain the latest re-test result and delete the duplicates of previous. Meaning to say as per highlighted box above, for the blue or red, I want to retain the re-test 1 and remove the re-test 0. This is kinda headache for me, so I would need community's advise on the script to select the condition within the serial number and section to figure out the latest re-test, then retain the latest re-test and remove previous duplicates. I hope you all have any input for this. Thank you

txnelson · Aug 8, 2019 04:40 AM

There is a function called

Select Duplicate Rows()

that can be used for this, however it is based upon finding rows following a unique row. Therefore, the table needs to be sorted in reverse before using the function. Below is a sample script to do this. Since I did not have a sample data table to run this on, the script has not been tested, however it is very close to being what you need.

Documentation on the function can be found in the Scripting Index

Help==>Scripting Indes

Names Default To Here( 1 );
dt = Current Data Table();

// Add a new column to preserve the current order
dt << New Column( "rowNum", formula( Row() ) );
// Remove the formula to convert values to static values
:rowNum << delete property( "formula" );

// Sort the data table in referse order, since one wants to 
// keep the most recent value
dt << sort( by( rowNum ), order( descending ), Replace Table( 1 ) );

// Find the duplicate rows
dt << Select Duplicate Rows( Match( :SerialNo, :Date, :Retest ) );

// Delete Duplicates
dt << delete rows;

// Reorder the table
dt << sort( by( rowNum ), order( ascending ), Replace Table( 1 ) );

// Delete rowNum column
dt << delete columns( "rowNum" );

Jim

View solution in original post

ms · Aug 8, 2019 06:43 AM

If you have an earlier version of JMP, <<Select Duplicate Rows may not be available. Then try the below approach. The code should select and delete the duplicate rows (based on matching SerialNo and Section) with the earliest date. Sorting does not matter.

dt = Current Data Table();
dt << select where(Col Max(:Date, :SerialNo, :Section) > :Date());
dt << delete rows();

View solution in original post

txnelson · Aug 8, 2019 04:40 AM

There is a function called

Select Duplicate Rows()

that can be used for this, however it is based upon finding rows following a unique row. Therefore, the table needs to be sorted in reverse before using the function. Below is a sample script to do this. Since I did not have a sample data table to run this on, the script has not been tested, however it is very close to being what you need.

Documentation on the function can be found in the Scripting Index

Help==>Scripting Indes

Names Default To Here( 1 );
dt = Current Data Table();

// Add a new column to preserve the current order
dt << New Column( "rowNum", formula( Row() ) );
// Remove the formula to convert values to static values
:rowNum << delete property( "formula" );

// Sort the data table in referse order, since one wants to 
// keep the most recent value
dt << sort( by( rowNum ), order( descending ), Replace Table( 1 ) );

// Find the duplicate rows
dt << Select Duplicate Rows( Match( :SerialNo, :Date, :Retest ) );

// Delete Duplicates
dt << delete rows;

// Reorder the table
dt << sort( by( rowNum ), order( ascending ), Replace Table( 1 ) );

// Delete rowNum column
dt << delete columns( "rowNum" );

Jim

OtisZeca · Aug 8, 2019 09:02 PM

I made abit of amendments to suit my table. Thanks!

u757707 · Oct 25, 2023 12:55 AM

If you want to remove duplicate rows matching all columns except one(or a few), this is my working solution.
Advantage is no need to specify every column name.

// delete duplicate rows, where only column ID is different 
	column_list = dtProcess << Get Column Names( string );		
	Remove From( column_list, As List( Loc( column_list, "ID" ) ) ); // remove "ID"
	dtProcess << Select Duplicate Rows( Match(As List(column_list)));
	doublerows = dtProcess << get selected rows;
	num_selected = n rows(doublerows);	
	if(num_selected >0, 
		dtProcess << delete rows;
		print("Removed " || char(num_selected) || " rows from process table.")
		,
		print("No duplicate rows in process table, excluding column ID.")
	);

:

ms · Aug 8, 2019 06:43 AM

If you have an earlier version of JMP, <<Select Duplicate Rows may not be available. Then try the below approach. The code should select and delete the duplicate rows (based on matching SerialNo and Section) with the earliest date. Sorting does not matter.

dt = Current Data Table();
dt << select where(Col Max(:Date, :SerialNo, :Section) > :Date());
dt << delete rows();

OtisZeca · Aug 8, 2019 09:04 PM

Great help! Thanks!

I was building a something like this, somehow my combination got wrong on the column selection.

Thank you!