topic Is it possible to use the Missing Data Pattern to speed up data preparation? in Discussions

Is it possible to use the Missing Data Pattern to speed up data preparation?

jpol — Sat, 10 Jun 2023 23:24:27 GMT

Lack of consistent data quality hinders data analysis and deters many from embarking on a Machine Learning journey.

In the semiconductor industry it is expected that processes are continuously improved. Unfortunately, one of the side effects of such endeavours is a change in data structure. One reason for this could be the renaming of the process step.

I have simulated such a case below where 20 lots have been processed through 4 process steps having step name changes.

For root cause analysis it is necessary to compare the probe results from equipment associated with a specific step. In the case below we can see that the data have become stratified.

I know that I could continue to use the "Combine Columns" feature in JMP (Cols\Utilities\Combine columns...) but in reality I have several hundreds of columns which need to be checked and combined.

When using the Tables\Missing Data Pattern feature in JMP I can readily see the data structures and potential columns to be combined.

I was wondering if the information from the Missing Data Pattern could be used to script the combine columns requirement?

There would be several checks required before automating the process such as:
Are columns representing the same Step in the process?
Are the contents of both steps similar?
Is the Missing Data pattern and inverse for paired columns?

Broken data looks like:

Missing Data Pattern:

Re constructed data:

- jpol

Re: Is it possible to use the Missing Data Pattern to speed up data preparation?

txnelson — Tue, 29 Dec 2020 12:45:37 GMT

I think you can use the missing data pattern to identify the columns that have data with no intersection for either the non missing patterns and the missing patterns. This script uses the output table from the Missing Pattern platform, using a slightly modified Missing Data Pattern data table from the JMP Sample data tables. It finds all of the inversely matched patterns and then displays them in a display window

names default to here(1);
dt=New Table( "example",
	Add Rows( 4 ),
	New Column( "Count",
		Numeric,
		"Continuous",
		Format( "Fixed Dec", 10, 0 ),
		Preselect Role( Freq ),
		Set Values( [1, 3, 1, 2] )
	),
	New Column( "Number of columns missing",
		Numeric,
		"Continuous",
		Format( "Fixed Dec", 10, 0 ),
		Set Values( [1, 2, 1, 2] )
	),
	New Column( "Patterns",
		Character,
		"Ordinal",
		List Check( {"0010", "0011", "0100", "0101"} ),
		Set Values( {"0010", "0011", "0100", "0101"} )
	),
	New Column( "Trial 1", Numeric( 1 ), "Nominal", Set Values( [0, 0, 0, 0] ) ),
	New Column( "Trial 2", Numeric( 1 ), "Nominal", Set Values( [0, 0, 1, 1] ) ),
	New Column( "Trial 3", Numeric( 1 ), "Nominal", Set Values( [1, 1, 0, 0] ) ),
	New Column( "Trial 4", Numeric( 1 ), "Nominal", Set Values( [0, 1, 0, 1] ) )
);
cols1List = {};
cols2List = {};
For( i = 4, i <= N Cols( dt ), i++,
	comp = Associative Array( dt << get rows where( As Column( dt, i ) == 1 ) );
	compMissing = Associative Array( dt << get rows where( As Column( dt, i ) == 0 ) );
	If( N Items( comp ) > 0,
		For( k = i + 1, k <= N Cols( dt ), k++,
			comp2 = Associative Array( dt << get rows where( As Column( dt, k ) == 1 ) );
			comp2Missing = Associative Array( dt << get rows where( As Column( dt, k ) == 0 ) );
			comp2 << intersect( comp );
			comp2Missing << intersect( compMissing );
			If( N Items( comp2 << get keys ) == 0 & N Items( comp2Missing << get keys ) == 0,
				Insert Into( cols1List, Column( dt, i ) << get name );
				Insert Into( cols2List, Column( dt, k ) << get name );
			);
		)
	);
);
nw = New Window( "Results",
	Outline Box( "Columns that can be combined",
		Table Box( String Col Box( "Column", cols1List ), String Col Box( "With", cols2List ) )
	)
);

Re: Is it possible to use the Missing Data Pattern to speed up data preparation?

jpol — Wed, 30 Dec 2020 05:45:20 GMT

Thanks Jim for the very fast response and solution.

For those wishing to see your script in action I am including the data building script from the example I provided and your JSL to report the columns which could be combined.

names default to here(1);
// Construct the initial data table
New Table( "Missing data initial",
	Add Rows( 20 ),
	New Column( "LOT_ID",
		Character,
		"Nominal",
		Set Values(
			{"AA_234_xx.1", "AA_234_xx.2", "AA_234_xx.3", "AA_234_xx.4",
			"AA_234_xx.5", "AA_234_xx.6", "AA_234_xx.7", "AA_234_xx.8",
			"AA_234_xx.9", "AA_234_xx.10", "AA_234_xx.11", "AA_234_xx.12",
			"AA_234_xx.13", "AA_234_xx.14", "AA_234_xx.15", "AA_234_xx.16",
			"AA_234_xx.17", "AA_234_xx.18", "AA_234_xx.19", "AA_234_xx.20"}
		),
		Set Display Width( 81 )
	),
	New Column( "Step1.v1",
		Character( 6 ),
		"Nominal",
		Set Values(
			{"SRD_01", "SRD_02", "SRD_01", "SRD_04", "SRD_02", "", "", "", "", "",
			"SRD_01", "SRD_02", "SRD_01", "SRD_04", "SRD_02", "", "", "", "", ""}
		),
		Set Display Width( 58 )
	),
	New Column( "Step1.v2",
		Character( 16 ),
		"Nominal",
		Set Values(
			{"", "", "", "", "", "SRD_01", "SRD_02", "SRD_01", "SRD_04", "SRD_02",
			"", "", "", "", "", "SRD_02", "SRD_02", "SRD_02", "SRD_02", "SRD_02"}
		),
		Set Display Width( 58 )
	),
	New Column( "Step2.v2",
		Character( 16 ),
		"Nominal",
		Set Values(
			{"COATER_02", "COATER_01", "", "", "COATER_02", "COATER_01", "COATER_02",
			"COATER_02", "COATER_02", "COATER_01", "COATER_02", "COATER_02",
			"COATER_01", "COATER_01", "COATER_01", "", "", "", "COATER_01",
			"COATER_01"}
		),
		Set Display Width( 71 )
	),
	New Column( "Step2.v3",
		Character( 16 ),
		"Nominal",
		Set Values(
			{"", "", "COATER_01", "COATER_01", "", "", "", "", "", "", "", "", "",
			"", "", "COATER_02", "COATER_02", "COATER_02", "", ""}
		),
		Set Display Width( 71 )
	),
	New Column( "Step3.v5",
		Character( 16 ),
		"Nominal",
		Set Values(
			{"DRIE_03", "DRIE_04", "DRIE_03", "", "", "", "", "", "", "DRIE_04",
			"DRIE_03", "DRIE_04", "DRIE_03", "DRIE_04", "", "", "", "", "", ""}
		),
		Set Display Width( 58 )
	),
	New Column( "Step3.v6",
		Character( 16 ),
		"Nominal",
		Set Values(
			{"", "", "", "DRIE_04", "DRIE_03", "DRIE_04", "DRIE_03", "DRIE_04",
			"DRIE_03", "", "", "", "", "", "DRIE_03", "DRIE_04", "DRIE_03",
			"DRIE_04", "DRIE_03", "DRIE_04"}
		),
		Set Display Width( 58 )
	),
	New Column( "Step4.v1",
		Character( 16 ),
		"Nominal",
		Set Values(
			{"PRIME_06", "", "PRIME_07", "", "PRIME_06", "", "PRIME_07", "",
			"PRIME_06", "", "PRIME_07", "", "PRIME_06", "", "PRIME_07", "",
			"PRIME_06", "", "PRIME_07", ""}
		),
		Set Display Width( 61 )
	),
	New Column( "Step4.v2",
		Character( 16 ),
		"Nominal",
		Set Values(
			{"", "PRIME_06", "", "PRIME_07", "", "PRIME_06", "", "PRIME_07", "",
			"PRIME_06", "", "PRIME_07", "", "PRIME_06", "", "PRIME_07", "",
			"PRIME_06", "", "PRIME_07"}
		),
		Set Display Width( 61 )
	)
);

// Check the Missings Data Patern
Data Table( "Missing data initial" ) <<
Missing Data Pattern(
	columns(
		:Step1.v1,
		:Step1.v2,
		:Step2.v2,
		:Step2.v3,
		:Step3.v5,
		:Step3.v6,
		:Step4.v1,
		:Step4.v2
	),
	Output Table( "Missing Data Pattern" )
);

// Check for columns which could be combined
//  *** JSL written by txnelson ***
dt=Data Table( "Missing Data Pattern");	
cols1List = {};
cols2List = {};
For( i = 4, i <= N Cols( dt ), i++,
	comp = Associative Array( dt << get rows where( As Column( dt, i ) == 1 ) );
	compMissing = Associative Array( dt << get rows where( As Column( dt, i ) == 0 ) );
	If( N Items( comp ) > 0,
		For( k = i + 1, k <= N Cols( dt ), k++,
			comp2 = Associative Array( dt << get rows where( As Column( dt, k ) == 1 ) );
			comp2Missing = Associative Array( dt << get rows where( As Column( dt, k ) == 0 ) );
			comp2 << intersect( comp );
			comp2Missing << intersect( compMissing );
			If( N Items( comp2 << get keys ) == 0 & N Items( comp2Missing << get keys ) == 0,
				Insert Into( cols1List, Column( dt, i ) << get name );
				Insert Into( cols2List, Column( dt, k ) << get name );
			);
		)
	);
);
nw = New Window( "Results",
	Outline Box( "Columns that can be combined",
		Table Box( String Col Box( "Column", cols1List ), String Col Box( "With", cols2List ) )
	)
);

- Philip

Re: Is it possible to use the Missing Data Pattern to speed up data preparation?

jthi — Wed, 30 Dec 2020 14:34:52 GMT

Went maybe a bit too heavy on scripting.. I didn't use Missing Data Patterns in this solution due to possibility of having more than 2 similar columns (not sure if it is the case here). Most likely these could also be detected with Missing Data Patterns but I made my own solution.

Result table:

New columns are created and old ones are left alone (could be easily deleted). Value in the new column name between brackets tells how many columns were combined to create the new column. Step5.v1 and Step5.v2 are not combined due to row 12 having same value twice. Should be easy to change by modifying function F_checkForInversePatterns in the script to allow columns which have matching data on same row.

This solution doesn't take into account how similar the data is inside columns (would require one additional check). Also column name similarity is simply checked if the Step name before period is same between columns.

Script attached.

Re: Is it possible to use the Missing Data Pattern to speed up data preparation?

jpol — Thu, 31 Dec 2020 12:34:03 GMT

Whilst evaluating txnelson's script on a much larger data table from manufacturing, it was found (Jim found) that if certain rows have no data in all targeted columns the script will not give an output listing. Jim has kindly provided a fix to the script. See below:

Names Default To Here( 1 );
dtRaw = Current Data Table();
colNamesList = dtRaw << get column names( string );
Remove From( colNamesList, 1, 1 );
dt = dtRaw << Missing Data Pattern(
	columns( Eval( colNamesList ) ),
	Output Table( "Missing Data Pattern" )
);

// Get rid of all missing and all present patterns
dt << lock data table( 0 );
dt << select where( Contains( :patterns, "0" ) & !Contains( :Patterns, "1" ) );
Try( dt << delete rows );
dt << select where( Contains( :patterns, "1" ) & !Contains( :Patterns, "0" ) );
Try( dt << delete rows );

// Check for columns which could be combined
//  *** JSL written by txnelson ***
cols1List = {};
cols2List = {};
For( i = 4, i <= N Cols( dt ), i++,
	comp = Associative Array( dt << get rows where( As Column( dt, i ) == 1 ) );
	compMissing = Associative Array( dt << get rows where( As Column( dt, i ) == 0 ) );
	If( N Items( comp ) > 0,
		For( k = i + 1, k <= N Cols( dt ), k++,
			comp2 = Associative Array( dt << get rows where( As Column( dt, k ) == 1 ) );
			comp2Missing = Associative Array( dt << get rows where( As Column( dt, k ) == 0 ) );
			comp2 << intersect( comp );
			comp2Missing << intersect( compMissing );
			If( N Items( comp2 << get keys ) == 0 & N Items( comp2Missing << get keys ) == 0,
				Insert Into( cols1List, Column( dt, i ) << get name );
				Insert Into( cols2List, Column( dt, k ) << get name );
			);
		)
	);
);
nw = New Window( "Results",
	Outline Box( "Columns that can be combined",
		Table Box( String Col Box( "Column", cols1List ), String Col Box( "With", cols2List ) )
	)
);

- Philip

Re: Is it possible to use the Missing Data Pattern to speed up data preparation?

jpol — Thu, 31 Dec 2020 12:44:56 GMT

Hi jthi,

I like your alternative solution also. Especially the way you leave the original columns untouched and create new columns to the data table. Another issue that you address is the possibility to combine from more than 2 columns should the need arise. I have had this need also on a few occasions.

I would be interested in further developing your script to include the other pre-combine checks that you mention.

Thanks,

Philip

Re: Is it possible to use the Missing Data Pattern to speed up data preparation?

jpol — Mon, 08 Feb 2021 06:30:08 GMT

Hi, for those of you interested in using the Missing Data Pattern to improve data quality I am giving a brief update here:

I had been using the script by @txnelson to build better data sets but then I realised that there is an instance where I would like to combine columns despite their respective missing data pattern (and complement) not matching. It is when both columns have missing data in the same rows.

Here is an example, building upon the original example :

I would like to combine columns associated with Step6 as seen above.

It then occurred to me that it would help to know the percent matching between columns prior to any data handling. This would give the possibility to use such information when deciding whether to combine or not.

Having communicated with @txnelson he kindly developed his script.

Attached is the final version.

I am also including below some examples using "real" data where this method has been used to improvement data quality:

Note: Cells coloured blue denote missing data

Case 1:

From

Case 2:

From

There is a vast reduction in the columns and much fewer "empty" cells

As you can imagine, this allows for improved, faster and more efficient root cause analyses.

Attached are the scripts to make the simulated missing data example and the final script which reports "Percent Combinable" and "Columns that can be combined" based on missing data patterns.

- Philip