Share your ideas for the JMP Scripting Unsession at Discovery Summit by September 17th. We hope to see you there!
Choose Language Hide Translation Bar
Highlighted

Data QA and Reformatting

Hi all, 

 

A common slow down procedure in my workflow is that a client is sending us a spreadsheet with results containing '< 0.01' or some other arbitrary number for samples that are below a detection limit. We have an internal R code that calculates imputation for these values, but we need to reformat the data set (example attached here). From the file we need to create two separate CSVs, one with non-detects replaced with zero (Detect file attached) and another one where all the values are zeros and the non-detects are replaced with the detection limit without the '<' symbol (Non-Detect attached file). 

Does anyone know how to automate this process to create these two CSVs from the data set? Would be thrilled if JMP has this functionality or if anyone could provide a general script. 

 

2 ACCEPTED SOLUTIONS

Accepted Solutions
Highlighted
txnelson
Super User

Re: Data QA and Reformatting

This is a fairly simple task for JMP.  Run this code (after editing the <Your Path> sections) and see if it gives you what you want.  Then you need to go to the JMP Documentation Library, and read the Scripting Guide, so you can understand what the code below is doing

Names Default To Here( 1 );

// Open the Excel spreadsheet that has the issues
dt = Open( "<your path>\Sample Client Data.xlsx" );

// Get all of the character columns
colNames = dt << get column names( string, character );

// Go across all of the character columns, and if "<0.0" is found in one
// or more cells, then change those cells to 0.  Then convert the column
// to a numeric column
For( i = 1, i <= N Items( colNames ), i++,
	dt << select where( Contains( As Column( colNames[i] ), "<0.0" ) );
	selRows = dt << get selected rows;
	If( N Rows( selRows ) > 0,
		Column( dt, colNames[i] )[selRows] = "0";
		Column( dt, colNames[i] ) << data type( numeric ) << modeling type( continuous );
	);
);

// Save the Detect version of the data
dt << save( "<your path>\Detect.csv" );

Close( dt, nosave );

// Reopen the Excel spreadsheet that has issues
dt = Open( "<your path>\Sample Client Data.xlsx" );

// Get all of the character columns
colNames = dt << get column names( string, character );

// Go across all of the character columns, and if "<0.0" is found in one
// or more cells, then loop through the cells getting rid of the "<"  
// Then convert the column to a numeric column
For( i = 1, i <= N Items( colNames ), i++,
	dt << select where( Contains( As Column( colNames[i] ), "<0.0" ) );
	selRows = dt << get selected rows;
	If( N Rows( selRows ) > 0,
		For( k = 1, k <= N Rows( selRows ), k++,
			theRow = selRows[k];
			Column( dt, colNames[i] )[theRow] = Substr( Column( dt, colNames[i] )[theRow], 2 );
		);
		Column( dt, colNames[i] ) << data type( numeric ) << modeling type( continuous );
	);
);

// Save the NonDetect version of the data
dt << save( "<your path>\NonDetect.csv" );

Close( dt, nosave );
Jim

View solution in original post

Highlighted
txnelson
Super User

Re: Data QA and Reformatting

Ah, I see the issue.....and I have corrected the issue below.....I believe.  But first, let me clarify something first.  When JMP reads in the different columns, it read through all of the data for a given column, and decides if the column is a numeric or a character column.  All values in a column need to be numeric values for the column to be considered numeric.  Therefore, when JMP finds a column with a non numeric cell, i.e <0.0005, it declares the column to be a character column.  My code takes advantage of that, in that it allow the code to find the columns where a "<" has been found.  It also allows for JSL to use character manipulation functions such as Substr() to get rid of the "<" and to leave a valid numeric in the cell.  But then, once all of the "<" are removed, the column can then be changed to a numeric column......<< data type( numeric ).  Here is my new code, converting all of the columns it finds as being numeric right after the data has been read in, and before the "<" columns have been processed, to 0.  I am assuming you want to change all values in a numeric column to 0, even if it was a negative value.  The code then goes on to correct the "<" columns, which changes all numeric cells to 0 except for the cells that had "<".

dt = Open( "C:\users\jim\documents\discussion group\Sample Client Data.xlsx" );

// Get all of the character columns
colNames = dt << get column names( string, character );
colNamesNum = dt << get column names( string, numeric );

// zero out all current numeric columns
For( i = 2, i <= N Items( colNamesNum ), i++,
	column( dt, colNamesNum[i] )[1::N Rows(dt)] = 0;
);

// Go across all of the character columns, and if "<0.0" is found in one
// or more cells, then loop through the cells getting rid of the "<"  
// Then convert the column to a numeric column
For( i = 1, i <= N Items( colNames ), i++,
	dt << select where( ! Contains( As Column( colNames[i] ), "<0.0" ) );
	
	selRows = dt << get selected rows;
	If( N Rows( selRows ) > 0,
		column( dt, colNames[i] )[selRows] = "0";
	);
	
	dt << invert row selection;
	selRows = dt << get selected rows;
	
	If( N Rows( selRows ) > 0,
		For( k = 1, k <= N Rows( selRows ), k++,
			theRow = selRows[k];
			Column( dt, colNames[i] )[theRow] = Substr( Column( dt, colNames[i] )[theRow], 2 );
		);
		Column( dt, colNames[i] ) << data type( numeric ) << modeling type( continuous );
	);
);

// Save the NonDetect version of the data
dt << save( "<your path>\NonDetect.csv" );

Close( dt, nosave );
Jim

View solution in original post

6 REPLIES 6
Highlighted
txnelson
Super User

Re: Data QA and Reformatting

This is a fairly simple task for JMP.  Run this code (after editing the <Your Path> sections) and see if it gives you what you want.  Then you need to go to the JMP Documentation Library, and read the Scripting Guide, so you can understand what the code below is doing

Names Default To Here( 1 );

// Open the Excel spreadsheet that has the issues
dt = Open( "<your path>\Sample Client Data.xlsx" );

// Get all of the character columns
colNames = dt << get column names( string, character );

// Go across all of the character columns, and if "<0.0" is found in one
// or more cells, then change those cells to 0.  Then convert the column
// to a numeric column
For( i = 1, i <= N Items( colNames ), i++,
	dt << select where( Contains( As Column( colNames[i] ), "<0.0" ) );
	selRows = dt << get selected rows;
	If( N Rows( selRows ) > 0,
		Column( dt, colNames[i] )[selRows] = "0";
		Column( dt, colNames[i] ) << data type( numeric ) << modeling type( continuous );
	);
);

// Save the Detect version of the data
dt << save( "<your path>\Detect.csv" );

Close( dt, nosave );

// Reopen the Excel spreadsheet that has issues
dt = Open( "<your path>\Sample Client Data.xlsx" );

// Get all of the character columns
colNames = dt << get column names( string, character );

// Go across all of the character columns, and if "<0.0" is found in one
// or more cells, then loop through the cells getting rid of the "<"  
// Then convert the column to a numeric column
For( i = 1, i <= N Items( colNames ), i++,
	dt << select where( Contains( As Column( colNames[i] ), "<0.0" ) );
	selRows = dt << get selected rows;
	If( N Rows( selRows ) > 0,
		For( k = 1, k <= N Rows( selRows ), k++,
			theRow = selRows[k];
			Column( dt, colNames[i] )[theRow] = Substr( Column( dt, colNames[i] )[theRow], 2 );
		);
		Column( dt, colNames[i] ) << data type( numeric ) << modeling type( continuous );
	);
);

// Save the NonDetect version of the data
dt << save( "<your path>\NonDetect.csv" );

Close( dt, nosave );
Jim

View solution in original post

Highlighted

Re: Data QA and Reformatting

Hi Jim,

I've looked this over and it's missing the second part of the task. We need a script for the 'Non Detect' to also change all the values into zeros.

 

I want to run the loop, select all the cells that have values > 0 (excluding the ones that have "<0.0" DL values). 

The last part of the code you sent works great, I just need to insert 'NonDetectPreprocess.csv' into it.  

 

I've tried using this custom script I made:

// Reopen the Excel spreadsheet that has issues
dt = Open( "D:\MD\Datto Workplace\Projects\Internal\JMP Scripts\Zero Imputation\Fake Client Data.csv" );

// Go across all of the character columns, and if concentration is greater than zero is found in one
// or more cells, then change those cells to 0.  Then convert the column
// to a numeric column
For( i = 1, i <= N Items( colNames ), i++,
	dt << select where( dt > 0 );
	selRows = dt << get selected rows;
	If( N Rows( selRows ) > 0,
		Column( dt, colNames[i] )[selRows] = "0";
		Column( dt, colNames[i] ) << data type( numeric ) << modeling type( continuous );
	);
);

//Save this sheet for next step

dt << save("D:\MD\Datto Workplace\Projects\Internal\JMP Scripts\Zero Imputation\NonDetectPreprocess.csv" )

Close( dt, nosave );

 

 

Highlighted
txnelson
Super User

Re: Data QA and Reformatting

Good try....but I think this is what you want

// Reopen the Excel spreadsheet that has issues
dt = Open( "<your path>\Sample Client Data.xlsx" );

// Get all of the character columns
colNames = dt << get column names( string, character );

// Go across all of the character columns, and if "<0.0" is found in one
// or more cells, then loop through the cells getting rid of the "<"  
// Then convert the column to a numeric column
For( i = 3 i <= N Items( colNames ), i++,
	dt << select where( ! Contains( As Column( colNames[i] ), "<0.0" ) );
	
	selRows = dt << get selected rows;
	If( N Rows( selRows ) > 0,
		column( dt, colNames[i] )[selRows] = "0";
	);
	
	dt << invert row selection;
	selRows = dt << get selected rows;
	
	If( N Rows( selRows ) > 0,
		For( k = 1, k <= N Rows( selRows ), k++,
			theRow = selRows[k];
			Column( dt, colNames[i] )[theRow] = Substr( Column( dt, colNames[i] )[theRow], 2 );
		);
		Column( dt, colNames[i] ) << data type( numeric ) << modeling type( continuous );
	);
);

// Save the NonDetect version of the data
dt << save( "<your path>\NonDetect.csv" );

Close( dt, nosave );
Jim
Highlighted

Re: Data QA and Reformatting

Hi Jim, 

I noticed you did an inverted row selection. Although this helps, the problem is that only the columns that contained the, "<0.0", are selected. How do I select all the columns that have numeric data in the data set and change those? 

Thanks for the help! 

Highlighted
txnelson
Super User

Re: Data QA and Reformatting

Ah, I see the issue.....and I have corrected the issue below.....I believe.  But first, let me clarify something first.  When JMP reads in the different columns, it read through all of the data for a given column, and decides if the column is a numeric or a character column.  All values in a column need to be numeric values for the column to be considered numeric.  Therefore, when JMP finds a column with a non numeric cell, i.e <0.0005, it declares the column to be a character column.  My code takes advantage of that, in that it allow the code to find the columns where a "<" has been found.  It also allows for JSL to use character manipulation functions such as Substr() to get rid of the "<" and to leave a valid numeric in the cell.  But then, once all of the "<" are removed, the column can then be changed to a numeric column......<< data type( numeric ).  Here is my new code, converting all of the columns it finds as being numeric right after the data has been read in, and before the "<" columns have been processed, to 0.  I am assuming you want to change all values in a numeric column to 0, even if it was a negative value.  The code then goes on to correct the "<" columns, which changes all numeric cells to 0 except for the cells that had "<".

dt = Open( "C:\users\jim\documents\discussion group\Sample Client Data.xlsx" );

// Get all of the character columns
colNames = dt << get column names( string, character );
colNamesNum = dt << get column names( string, numeric );

// zero out all current numeric columns
For( i = 2, i <= N Items( colNamesNum ), i++,
	column( dt, colNamesNum[i] )[1::N Rows(dt)] = 0;
);

// Go across all of the character columns, and if "<0.0" is found in one
// or more cells, then loop through the cells getting rid of the "<"  
// Then convert the column to a numeric column
For( i = 1, i <= N Items( colNames ), i++,
	dt << select where( ! Contains( As Column( colNames[i] ), "<0.0" ) );
	
	selRows = dt << get selected rows;
	If( N Rows( selRows ) > 0,
		column( dt, colNames[i] )[selRows] = "0";
	);
	
	dt << invert row selection;
	selRows = dt << get selected rows;
	
	If( N Rows( selRows ) > 0,
		For( k = 1, k <= N Rows( selRows ), k++,
			theRow = selRows[k];
			Column( dt, colNames[i] )[theRow] = Substr( Column( dt, colNames[i] )[theRow], 2 );
		);
		Column( dt, colNames[i] ) << data type( numeric ) << modeling type( continuous );
	);
);

// Save the NonDetect version of the data
dt << save( "<your path>\NonDetect.csv" );

Close( dt, nosave );
Jim

View solution in original post

Highlighted

Re: Data QA and Reformatting

Just following up with a photo if that is more clear. The yellow are the detection limits, "<0.0", that were replaced. Those columns got replaced with zeros, but the rest remained unchanged. 

MikeDereviankin_0-1598391713990.png

 

Article Labels