cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
Try the Materials Informatics Toolkit, which is designed to easily handle SMILES data. This and other helpful add-ins are available in the JMP® Marketplace
Choose Language Hide Translation Bar
sam_t
Level III

columns data split and rename column headers

Hi,

 

I have Data_original.csv which have many columns of | or _ or none. In this example, I'm showing only 3 columns of |.

I tried the data preview to see if I can split but my column got messed up.

 

Data_original.csv

Fruit,ID,Colors,Weigth,CountryCode
Apple,123456,5_2|1_4,3|3|3|oz,1_2_5|3_2_1_123_2343_342|1_2
Apricot,5555666,9_0|55_298,5|3|2|1|lb,0_9_1|23
Grape,98746,12_1|23_99,1|2|1|7|6|4|2|1|8|g,9_1_3|23_45_4
Orange,25897,8_8|0_9,2|3|fl,1_0_4|99_999_888|0_8
Pear,113355,9999_77|78_123,4|6|l,-9999_76_123|9
Plum,8855446,32_-8888|00_12,5|9|lb,123_1_55|123_90
Strawberries,981254,981_123|34_123,8|3|g,89_12_45|67_1_23_8|90


Open( "C:\JSL\Data_original.csv", Import Settings( End Of Line( CRLF, CR, LF ), End Of Field( Comma, Other( "|" ), CSV( 0 ) ), Strip Quotes( 1 ), Use Apostrophe as Quotation Mark( 0 ), Scan Whole File( 1 ), Treat empty columns as numeric( 0 ), CompressNumericColumns( 0 ), CompressCharacterColumns( 0 ), CompressAllowListCheck( 0 ), Labels( 1 ), Column Names Start( 1 ), Data Starts( 2 ), Lines To Read( "All" ), Year Rule( "20xx" ) ) );

 

First, I want to open up Data_original.csv. Delete any columns of | split (if there any and keep the first value split only). I was thinking once I split, I should be able to delete any column header name with Column* and save it out to csv.

See the output from Data_original_EDIT1.csv.

 

Fruit,ID,Colors,Weigth,CountryCode
Apple,123456,5_2,3,1_2_5
Apricot,5555666,9_0,5,0_9_1
Grape,98746,12_1,1,9_1_3
Orange,25897,8_8,2,1_0_4
Pear,113355,9999_77,4,-9999_76_123
Plum,8855446,32_-8888,5,123_1_55
Strawberries,981254,981_123,8,89_12_45

Second, I want to open the previous save csv, Data_original_EDIT1.csv. Split by _ column (if any) and rename the column header as the same column header but concatenate a sequence 0..n.

See Data_original_EDIT2_Final.csv, the final output export to csv.

 

Fruit,ID,Colors0,Colors1,Weigth,CountryCode0,CountryCode1,CountryCode2
Apple,123456,5,2,3,1,2,5
Apricot,5555666,9,0,5,0,9,1
Grape,98746,12,1,1,9,1,3
Orange,25897,8,8,2,1,0,4
Pear,113355,9999,77,4,-9999,76,123
Plum,8855446,32,-8888,5,123,1,55
Strawberries,981254,981,123,8,89,12,45

Let me know how I start the JSL coding. Thanks.

1 ACCEPTED SOLUTION

Accepted Solutions
txnelson
Super User

Re: columns data split and rename column headers

Here is a better version of the code, and I believe that it will handle all | columns in your data table

Names Default To Here( 1 );
dt = Open(
	"C:\JSL\Data_original.csv",
	Import Settings(
		End Of Line( CRLF, CR, LF ),
		End Of Field( Comma, CSV( 0 ) ),
		Strip Quotes( 1 ),
		Use Apostrophe as Quotation Mark( 0 ),
		Use Regional Settings( 0 ),
		Scan Whole File( 1 ),
		Treat empty columns as numeric( 0 ),
		CompressNumericColumns( 0 ),
		CompressCharacterColumns( 0 ),
		CompressAllowListCheck( 0 ),
		Labels( 1 ),
		Column Names Start( 1 ),
		Data Starts( 2 ),
		Lines To Read( "All" ),
		Year Rule( "20xx" )
	)
);
stopVal = Min( N Rows( dt ), 100 );

// Loop across columns back to front to expand the columns as necessary
For( theColumn = N Cols( dt ), theColumn >= 3, theColumn--, 

// Process if a character column
	If( Column( dt, theColumn ) << get data type == "Character", 
// See if a "|" is in the data value
		For( i = 1, i <= stopVal + 1, i++,
			If( Contains( Column( dt, theColumn )[i], "|" ),
				Break()
			)
		);
		
		// Get rid of everything after the first "|"
		If( i <= stopVal,
			For( theRow = 1, theRow <= N Rows( dt ), theRow++,
				Column( dt, theColumn )[theRow] = Word( 1, As Column( dt, theColumn )[theRow], "|" )
			);
			
			// Determine the potential number of values
			// Loop through the column data for upto 100 rows to find max number of  values to be processed
			
			Count = 0;
			For( i = 1, i <= stopVal, i++,
				cnt = 0;
				While( Word( cnt + 1, Trim( Column( dt, theColumn )[i] ), "_" ) != "", cnt++ );
				Count = Max( Count, cnt );
			);
		
			// If Count is == 1, then just change the column to numeric
			If( Count == 1,
				Column( dt, theColumn ) << data type( numeric ) << modeling type( continuous ),
				origName = Column( dt, theColumn ) << get name;
				Column( dt, theColumn ) << set name( origName || "0" );
				Eval(
					Substitute(
							Expr(
								dt << add multiple columns( __orig__, Count - 1, after( __col__ ), numeric )
							),
						Expr( __orig__ ), origName,
						Expr( __col__ ), Parse( ":" || Char( Column( dt, thecolumn ) << get name ) )
					)
				);
				// Change name of new column if only 2 new column is created
				If( Count == 2,
					Column( dt, origName ) << Set name( origName || "1" )
				);
				For( i = 1, i <= N Rows( dt ), i++,
					For( k = 1, k <= Count - 1, k++,
						Column( dt, origName || Char( k ) )[i] = Num( Word( k + 1, Column( dt, theColumn )[i], "_" ) )
					)
				);
				// Correct original 0 column
				For( i = 1, i <= N Rows( dt ), i++,
					Column( dt, theColumn )[i] = Word( 1, Column( dt, theColumn )[i], "_" )
				);
				Column( dt, theColumn ) << data type( numeric ) << modeling type( continuous );
			);
		);
	)
);

 

Jim

View solution in original post

4 REPLIES 4
txnelson
Super User

Re: columns data split and rename column headers

I think the solution is easier than you think.  After looking at the data and what you want to do with it, I took the approach to read in the original data, cell by cell of what is in the original csv file.  I then simply examine the data and create and modify the columns as needed.  Here are my results from the script at the bottom of my entry

pipes.PNG

Names Default To Here( 1 );
dt = Open(
	"C:\JSL\Data_original.csv",
	columns(
		New Column( "Fruit", Character, "Nominal" ),
		New Column( "ID", Numeric, "Continuous", Format( "Best", 12 ) ),
		New Column( "Colors", Character, "Nominal" ),
		New Column( "Weigth", Character, "Nominal" ),
		New Column( "CountryCode", Character, "Nominal" )
	),
	Import Settings(
		End Of Line( CRLF, CR, LF ),
		End Of Field( Comma, CSV( 0 ) ),
		Strip Quotes( 1 ),
		Use Apostrophe as Quotation Mark( 0 ),
		Use Regional Settings( 0 ),
		Scan Whole File( 1 ),
		Treat empty columns as numeric( 0 ),
		CompressNumericColumns( 0 ),
		CompressCharacterColumns( 0 ),
		CompressAllowListCheck( 0 ),
		Labels( 1 ),
		Column Names Start( 1 ),
		Data Starts( 2 ),
		Lines To Read( "All" ),
		Year Rule( "20xx" )
	)
);

// Work on the Colors data
// Only keep data up to |
For Each Row( :Colors = Word( 1, :Colors, "|" ) );

// Determine the potential number of colors
// Loop through Colors data for upto 100 rows to find max number of 
// colors
dt:Colors << set name( "Colors0" );
stopVal = Min( N Rows( dt ), 100 );
colorsCount = 0;
For( i = 1, i <= stopVal, i++,
	cnt = 0;
	While( Word( cnt + 1, Trim( :Colors0[i] ), "_" ) != "", cnt++ );
	colorsCount = Max( colorsCount, cnt );
);
dt << add multiple columns( "Colors", colorsCount - 1, after( :Colors0 ), numeric );
// Change name of new column if only 2 new column is created
If( colorsCount == 2,
	dt:Colors << Set name( "Colors1" )
);
For(i=1,i<=nrows(dt),i++,
	for(k=1,k<=colorsCount-1,k++,
		column(dt,"Colors" || char(k))[i]=num(word(k+1,:Colors0[i],"_"));
	)
);
// Correct Colors0 column
For Each Row(:Colors0 = word(1,:colors0,"_"));
dt:colors0 << data type(numeric) << modeling type(continuous);

// Correct Weigth column
for each row( :Weigth = word(1,:Weigth,"|"));
dt:Weigth << set data type(numeric) << set modeling type(continuous);

// Work on the Country data
// Only keep data up to |
For Each Row( :CountryCode = Word( 1, :CountryCode, "|" ) );

// Determine the potential number of CountryCode
// Loop through CountryCode data for upto 100 rows to find max number of 
// CountryCode
dt:CountryCode << set name( "CountryCode0" );
stopVal = Min( N Rows( dt ), 100 );
CountryCodeCount = 0;
For( i = 1, i <= stopVal, i++,
	cnt = 0;
	While( Word( cnt + 1, Trim( :CountryCode0[i] ), "_" ) != "", cnt++ );
	CountryCodeCount = Max( CountryCodeCount, cnt );
);
dt << add multiple columns( "CountryCode", CountryCodeCount - 1, after( :CountryCode0 ), numeric );
// Change name of new column if only 2 new column is created
If( CountryCodeCount == 2,
	dt:CountryCode << Set name( "CountryCode1" )
);
For(i=1,i<=nrows(dt),i++,
	for(k=1,k<=CountryCodeCount-1,k++,
		column(dt,"CountryCode" || char(k))[i]=num(word(k+1,:CountryCode0[i],"_"));
	)
);
// Correct CountryCode0 column
For Each Row(:CountryCode0 = word(1,:CountryCode0,"_"));
dt:CountryCode0 << data type(numeric) << modeling type(continuous);

This example is hardcoded to the 3 different | columns, but it could easily be modified to just read through the column list, and given the column name, and the elimination of everything after the first | and the counting of the "_" the the script could handle as many columns as it finds.

Jim
txnelson
Super User

Re: columns data split and rename column headers

Here is a better version of the code, and I believe that it will handle all | columns in your data table

Names Default To Here( 1 );
dt = Open(
	"C:\JSL\Data_original.csv",
	Import Settings(
		End Of Line( CRLF, CR, LF ),
		End Of Field( Comma, CSV( 0 ) ),
		Strip Quotes( 1 ),
		Use Apostrophe as Quotation Mark( 0 ),
		Use Regional Settings( 0 ),
		Scan Whole File( 1 ),
		Treat empty columns as numeric( 0 ),
		CompressNumericColumns( 0 ),
		CompressCharacterColumns( 0 ),
		CompressAllowListCheck( 0 ),
		Labels( 1 ),
		Column Names Start( 1 ),
		Data Starts( 2 ),
		Lines To Read( "All" ),
		Year Rule( "20xx" )
	)
);
stopVal = Min( N Rows( dt ), 100 );

// Loop across columns back to front to expand the columns as necessary
For( theColumn = N Cols( dt ), theColumn >= 3, theColumn--, 

// Process if a character column
	If( Column( dt, theColumn ) << get data type == "Character", 
// See if a "|" is in the data value
		For( i = 1, i <= stopVal + 1, i++,
			If( Contains( Column( dt, theColumn )[i], "|" ),
				Break()
			)
		);
		
		// Get rid of everything after the first "|"
		If( i <= stopVal,
			For( theRow = 1, theRow <= N Rows( dt ), theRow++,
				Column( dt, theColumn )[theRow] = Word( 1, As Column( dt, theColumn )[theRow], "|" )
			);
			
			// Determine the potential number of values
			// Loop through the column data for upto 100 rows to find max number of  values to be processed
			
			Count = 0;
			For( i = 1, i <= stopVal, i++,
				cnt = 0;
				While( Word( cnt + 1, Trim( Column( dt, theColumn )[i] ), "_" ) != "", cnt++ );
				Count = Max( Count, cnt );
			);
		
			// If Count is == 1, then just change the column to numeric
			If( Count == 1,
				Column( dt, theColumn ) << data type( numeric ) << modeling type( continuous ),
				origName = Column( dt, theColumn ) << get name;
				Column( dt, theColumn ) << set name( origName || "0" );
				Eval(
					Substitute(
							Expr(
								dt << add multiple columns( __orig__, Count - 1, after( __col__ ), numeric )
							),
						Expr( __orig__ ), origName,
						Expr( __col__ ), Parse( ":" || Char( Column( dt, thecolumn ) << get name ) )
					)
				);
				// Change name of new column if only 2 new column is created
				If( Count == 2,
					Column( dt, origName ) << Set name( origName || "1" )
				);
				For( i = 1, i <= N Rows( dt ), i++,
					For( k = 1, k <= Count - 1, k++,
						Column( dt, origName || Char( k ) )[i] = Num( Word( k + 1, Column( dt, theColumn )[i], "_" ) )
					)
				);
				// Correct original 0 column
				For( i = 1, i <= N Rows( dt ), i++,
					Column( dt, theColumn )[i] = Word( 1, Column( dt, theColumn )[i], "_" )
				);
				Column( dt, theColumn ) << data type( numeric ) << modeling type( continuous );
			);
		);
	)
);

 

Jim
sam_t
Level III

Re: columns data split and rename column headers

Thank you Jim for the generic solution. I validate with my set of data with 10 | split and 20 _ split. Works perfectly!!

txnelson
Super User

Re: columns data split and rename column headers

@sam_t ,

I am glad it works for you.

The payment I want for providing the script, is for you to take the time to walk through the script line by line so you can understand it and learn from it.  I actually suggest that you start with my initial script, since it is more hardcoded.  

As with most scripts, this one will probably need to be modified in the future because of something new in the data, and you need to be able to make the changes.

Jim