Solved: Re: Delete Duplicate Data from Table using Time Stamp

Yngeinstn · Jun 26, 2018 03:07 PM

I apologize, i know this question has been asked quite a few times on this forum but I haven't been able to find a solution to my problem.

I have a large data set ~24,000 rows of test data. Our process is to run a 5 step validation on 5 die and once we get the results we begin testing which ultimately means i am going to have repeat data on those 5 sites. This 5 step validation verifies the contact points which sometimes results in tearing down our test equipment and setting it back up. I am looking to exclude the duplicate data from the 1st test (touchdown). Our tests are timestamped so it is easy to distinguish between old and new. If anyone has a great solution to my problem i am all ears. Attached is the data table where i am trying to remove the duplicates.

Thank You,

Yng

txnelson · Jun 26, 2018 7:03 PM

I think this will do the trick....it simply finds all of the rows for a given RowCol that have a timestamp(ts) less than the maximum timestamp for that RowCol, and then deletes them. It takes a couple of minutes to process, but it seems to work

Names Default To Here( 1 );
dt = Current Data Table();

// remove formula leaving static data
try(dt:RowCol << delete formula);

Summarize( dt, byGroup = by( "RowCol" ) );

For( i = 1, i <= N Items( byGroup ), i++,
	theRows = dt << get rows where( dt:RowCol == byGroup[i] & :ts < Col Max( dt:ts, :RowCol ) );
	If( N Rows( theRows ) > 1,
		dt << delete rows( theRows )
	);
);

Jim

View solution in original post

txnelson · Jun 29, 2018 02:48 PM

I rethought through the issue, and came up with a totally methodology. This code will work for wafers

Names Default To Here( 1 );
dt = Current Data Table();

// Summarize the data to find the maximum :ts for each wafer/rowcol
dtSum = dt << Summary(
	Group( :wafer_number, :RowCol ),
	Max( :ts ),
	Freq( "None" ),
	Weight( "None" ),
	statistics column name format( "column" ),
	link to original table( 0 )
);

// Join the data together matching only where the max :ts for each
// wafer/rowcol
dtFinal = dt << Join(
	With( dtSum ),
	Merge Same Name Columns,
	By Matching Columns( :wafer_number = :wafer_number, :RowCol = :RowCol, :ts = :ts ),
	Drop multiples( 0, 0 ),
	Include Nonmatches( 1, 0 ),
	Preserve main table order( 1 )
);
dtSum << delete columns("N Rows");
Close( dtSum, nosave );

// Delete all rows that do not match the maximum value of :ts within
// each Wafer, RowCol
dtFinal << select where( Not( :Match Flag == 3 ) );
dtFinal << delete rows;

// Get rid of the Match Flag column which is an artifact of the Join
dtFinal << delete columns("Match Flag");

Jim

View solution in original post

txnelson · Jun 26, 2018 03:39 PM

I am having trouble identifying from your attached data table, the rows that contain the data for Test 1 that you want to have deleted. Could you please provide the row numbers for 1 or two cases that represent the target data?

Jim

Yngeinstn · Jun 28, 2018 9:07 AM

Sorry about that.. I deleted the unnecessary columns and reattached.

test_output is the data that i need. RowCol is the identifier of the devices i am testing.

I attached the summary table of the RowCol {r2c3sr1sc1, r2c8sr1sc1, r4c5sr1sc1, r7c3sr1sc1, r7c8sr1sc1} have the duplicate data. I need to delete the initial tests of those devices from the data set. They can be will have the earliest Time Stamp.. These 5 devices are used in every set-up which will require me to remove from each test_output.

txnelson · Jun 26, 2018 7:03 PM

I think this will do the trick....it simply finds all of the rows for a given RowCol that have a timestamp(ts) less than the maximum timestamp for that RowCol, and then deletes them. It takes a couple of minutes to process, but it seems to work

Names Default To Here( 1 );
dt = Current Data Table();

// remove formula leaving static data
try(dt:RowCol << delete formula);

Summarize( dt, byGroup = by( "RowCol" ) );

For( i = 1, i <= N Items( byGroup ), i++,
	theRows = dt << get rows where( dt:RowCol == byGroup[i] & :ts < Col Max( dt:ts, :RowCol ) );
	If( N Rows( theRows ) > 1,
		dt << delete rows( theRows )
	);
);

Jim

Yngeinstn · Jun 26, 2018 07:35 PM

Thank you so much! I am absolutely amazed at the skill you gentlefolk have with regards to scripting.. Here i was thinking that it was going to be some elaborate script that would accomplish this and you manage to get it in 10 lines..

You and Mr. gzmorgan0 have been extremely helpful..

Side note: my company has a P.O in the cycle that gets us JMP 14 which makes me super excited..

Thank you again Mr. txnelson..

Yng (not so Yng anymore)

Yngeinstn · Jun 26, 2018 07:43 PM

I am curious! I am an Industrial Engineer and coding is very difficult for me.. If() and For() have thrown me for a loop (forgive the pun) since the beginning. How were you able to determine 9 as the number?

Thanks Again

txnelson · Jun 26, 2018 08:53 PM

Oops....that should be a 1 not a nine

Jim

Yngeinstn · Jun 29, 2018 08:25 AM

txnelson,

One more question regarding removing duplicate data.. I forgot to mention that i do a blanket data pull that includes all wafers in order for me to do an overall distribution and one way t-test analysis.

The data table that i attached was just the test data for one wafer.. Where in the script would i put the wafer_number? Would i include it in the summarize() part?

I have attached the new data table for reference.

I had to delete the other one because it was pointed out to me that i had my sql script that pull from our servers...

txnelson · Jun 29, 2018 02:48 PM

I rethought through the issue, and came up with a totally methodology. This code will work for wafers

Names Default To Here( 1 );
dt = Current Data Table();

// Summarize the data to find the maximum :ts for each wafer/rowcol
dtSum = dt << Summary(
	Group( :wafer_number, :RowCol ),
	Max( :ts ),
	Freq( "None" ),
	Weight( "None" ),
	statistics column name format( "column" ),
	link to original table( 0 )
);

// Join the data together matching only where the max :ts for each
// wafer/rowcol
dtFinal = dt << Join(
	With( dtSum ),
	Merge Same Name Columns,
	By Matching Columns( :wafer_number = :wafer_number, :RowCol = :RowCol, :ts = :ts ),
	Drop multiples( 0, 0 ),
	Include Nonmatches( 1, 0 ),
	Preserve main table order( 1 )
);
dtSum << delete columns("N Rows");
Close( dtSum, nosave );

// Delete all rows that do not match the maximum value of :ts within
// each Wafer, RowCol
dtFinal << select where( Not( :Match Flag == 3 ) );
dtFinal << delete rows;

// Get rid of the Match Flag column which is an artifact of the Join
dtFinal << delete columns("Match Flag");

Jim

Yngeinstn · Jun 30, 2018 08:34 AM

Thank you txnelson,

I am going to try this out this morning and will incorporate it in my server pull script / table manipulation.

Much appreciated again..