cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
Try the Materials Informatics Toolkit, which is designed to easily handle SMILES data. This and other helpful add-ins are available in the JMP® Marketplace
Choose Language Hide Translation Bar
Analyst
Level I

Cleaning up a large data set with disorganized formats in JMP 14.2.0

Hi, I have a large data set with a combination of useful information in columns and in rows. The data in rows has a row with the type of information being collected which then needs to be linked to a result in a row further down in the column. The amount of rows with this information is variable and in some cases is a free form input rather than a specified input. I've been using recoding, subsetting, and transposing to get the data into a manageable format but it's time consuming and difficult to data verify as I have >50K rows to work with. I have tried to generate a simplified example of what I have and what I would like below. Can anybody help?

 

What I have 

Assay Useful info 1Useful info 2Useful info 3Useful info 4Other infoValue for other info
1Control Lot numberabcdefgControl result10Useful info 5Method ID
1Control Lot numberabcdefgControl result10Useful info 6Method Version
1Control Lot numberabcdefgControl result10Result for useful info 5Method x
1Control Lot numberabcdefgControl result10Result for useful info 61
2Control Lot numberabcdefgControl result11Useful info 5Method ID
2Control Lot numberabcdefgControl result11Useful info 6Method Version
2Control Lot numberabcdefgControl result11Useful info 7Sample result
2Control Lot numberabcdefgControl result11Useful info 8Equipment Number
2Control Lot numberabcdefgControl result11Result for useful info 5Method x
2Control Lot numberabcdefgControl result11Result for useful info 61
2Control Lot numberabcdefgControl result11Result for useful info 712
2Control Lot numberabcdefgControl result11Result for useful info 8123456
3Control Lot numberabcdefgControl result10Useful info 5Method ID
3Control Lot numberabcdefgControl result10Useful info 6Method Version
3Control Lot numberabcdefgControl result10Useful info 7Sample result
3Control Lot numberabcdefgControl result10Useful info 8Equipment Type
3Control Lot numberabcdefgControl result10Useful info 9Equipment Number
3Control Lot numberabcdefgControl result10Result for useful info 5Method x
3Control Lot numberabcdefgControl result10Result for useful info 61
3Control Lot numberabcdefgControl result10Result for useful info 712
3Control Lot numberabcdefgControl result10Result for useful info 8Reader
3Control Lot numberabcdefgControl result10Result for useful info 9123456

 

What I would like:

Assay Control Lot numberControl resultMethod IDMethod VersionSample resultEquipment NumberEquipment Type
1abcdefg10Method x1   
2abcdefg11Method x112123456 
3abcdefg10Method x112123456Reader

 

Or this could also work:

Assay ParameterResult
1Control result10
1Method IDMethod x
1Method Version1
1Sample result 
1Equipment Number 
1Equipment Type 
1Equipment Number 
2Method IDMethod x
2Method Version1
2Sample result12
2Equipment Number123456
2Equipment Type 
2Equipment Number 
3Method IDMethod x
3Method Version1
3Sample result12
3Equipment Number123456
3Equipment TypeReader
3Equipment Number123456

 

1 ACCEPTED SOLUTION

Accepted Solutions
jthi
Super User

Re: Cleaning up a large data set with disorganized formats in JMP 14.2.0

Are you familiar with SQL? This way you could avoid the issue with Other Info having different indices for different keys. You could build a query for JMP table to get it into this format:

jthi_0-1654895472437.png

Then create summary and split tables and join them together

summary:

jthi_1-1654895493838.png

split:

jthi_2-1654895505697.png

split updated to summary:

jthi_3-1654895517840.png

Names Default To Here(1);
dt = Current Data Table();

dt_query = Query(
	Table(dt, "INFO"),
	"\[
	select t1.assay, t1."Useful info 2" as "Control Lot number", 
	t1."Useful info 4" as "Control result", t1."Value for other info", t2."Value for other info"
	from INFO t1
    join INFO t2 on "Result for " || t1."Other info" = t2."Other info" and t1.assay = t2.assay
    ]\"
);

dt_summary = dt_query << Summary(
	Group(:Assay, :Control Lot number, :Control result),
	Freq("None"),
	Weight("None"),
	Link to original data table(0)
);

dt_split = dt_query << Split(
	Split By(:Value for other info),
	Split(:Value for other info 2),
	Group(:Assay),
	Remaining Columns(Drop All),
	Sort by Column Property
);

dt_summary << Update(With(dt_split), Match Columns(:Assay = :Assay));
//Close(dt_summary, No save);

Edit:

I had to recode original data columns Other info "Useful info X" values to "useful info X" (lowercase) to make query work. It could also be done in SQL query

-Jarmo

View solution in original post

3 REPLIES 3
jthi
Super User

Re: Cleaning up a large data set with disorganized formats in JMP 14.2.0

I would do this in couple of parts

  1. Getting Assay, Control Lot Number and Control Result can be done fairly easily with summary
  2. Getting rest is a bit more difficult as the Other info indices change (if they were always same this should would be most likely fairly simple Recode and Split)
    1. Depending on grouping (assay only? assay +  control lot number?) I would loop over the data table and add values as I go using associative arrays

Code is very messy (partially due to not able to use For Each and me being lazy):

Names Default To Here(1);

dt = New Table("Untitled",
	Add Rows(22),
	New Column("Assay", Numeric, "Continuous", Format("Best", 12), Set Values([1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3])),
	New Column("Useful info 1",
		Character(18),
		"Nominal",
		Set Values(
			{"Control Lot number", "Control Lot number", "Control Lot number", "Control Lot number", "Control Lot number", "Control Lot number",
			"Control Lot number", "Control Lot number", "Control Lot number", "Control Lot number", "Control Lot number", "Control Lot number",
			"Control Lot number", "Control Lot number", "Control Lot number", "Control Lot number", "Control Lot number", "Control Lot number",
			"Control Lot number", "Control Lot number", "Control Lot number", "Control Lot number"}
		)
	),
	New Column("Useful info 2",
		Character(16),
		"Nominal",
		Set Values(
			{"abcdefg", "abcdefg", "abcdefg", "abcdefg", "abcdefg", "abcdefg", "abcdefg", "abcdefg", "abcdefg", "abcdefg", "abcdefg", "abcdefg",
			"abcdefg", "abcdefg", "abcdefg", "abcdefg", "abcdefg", "abcdefg", "abcdefg", "abcdefg", "abcdefg", "abcdefg"}
		)
	),
	New Column("Useful info 3",
		Character(16),
		"Nominal",
		Set Values(
			{"Control result", "Control result", "Control result", "Control result", "Control result", "Control result", "Control result",
			"Control result", "Control result", "Control result", "Control result", "Control result", "Control result", "Control result",
			"Control result", "Control result", "Control result", "Control result", "Control result", "Control result", "Control result",
			"Control result"}
		)
	),
	New Column("Useful info 4",
		Numeric,
		"Continuous",
		Format("Best", 12),
		Set Values([10, 10, 10, 10, 11, 11, 11, 11, 11, 11, 11, 11, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10])
	),
	New Column("Other info",
		Character(24),
		"Nominal",
		Set Values(
			{"Useful info 5", "Useful info 6", "Result for useful info 5", "Result for useful info 6", "Useful info 5", "Useful info 6",
			"Useful info 7", "Useful info 8", "Result for useful info 5", "Result for useful info 6", "Result for useful info 7",
			"Result for useful info 8", "Useful info 5", "Useful info 6", "Useful info 7", "Useful info 8", "Useful info 9",
			"Result for useful info 5", "Result for useful info 6", "Result for useful info 7", "Result for useful info 8",
			"Result for useful info 9"}
		)
	),
	New Column("Value for other info",
		Character(16),
		"Nominal",
		Set Values(
			{"Method ID", "Method Version", "Method x", "1", "Method ID", "Method Version", "Sample result", "Equipment Number", "Method x", "1",
			"12", "123456", "Method ID", "Method Version", "Sample result", "Equipment Type", "Equipment Number", "Method x", "1", "12", "Reader",
			"123456"}
		)
	)
);

dt_result = dt << Summary(
	Group(:Assay, :Useful info 2, :Useful info 4),
	Freq("None"),
	Weight("None"),
	Link to original data table(0)
);

a_m = dt[0, "Assay"];
aa = Associative Array(a_m);
// will use slower for loop due to usage of JMP14 instead of For Each
aa_keys = aa << get keys;
For(i = 1, i <= N Items(aa_keys), i++,
	// initialize associative array to aa
	aa[aa_keys[i]] = Associative Array();
	//Get rows for current assay
	cur_rows = Loc(a_m, aa_keys[i]);
	
	//
	aa_temp = Associative Array(dt[cur_rows, "Other info"], dt[cur_rows, "Value for other info"]);
	aa_temp_keys = aa_temp << get keys;
	For(k = i, k <= N Items(aa_temp_keys), k++,
		// if starts with Useful info -> this will be new key to aa[aa_keys]
		If(Starts With(aa_temp_keys[k], "Useful info"),
			aa[aa_keys[i]][aa_temp[aa_temp_keys[k]]] = aa_temp["Result for " || Lowercase(aa_temp_keys[k])]; // get value
		);
	);
);

//then loop over the AA to set values to dt_result. create new columns as you go
aa_keys = aa << get keys;
For(i = 1, i <= N Items(aa_keys), i++,
	aa_new_cols = aa[aa_keys[i]];
	cur_cols = Associative Array(dt_result <<  get column names("String"));
	aa_new_cols << Remove(cur_cols);
	new_cols = aa_new_cols << get keys;
	For(k = 1, k <= N Items(new_cols), k++,
		dt_result << New Column(new_cols[k], Character, Nominal);
	);
	dt_result[i, aa[aa_keys[i]] << get keys] = aa[aa_keys[i]] << get values;
);

jthi_0-1654878820399.png

 

-Jarmo
Analyst
Level I

Re: Cleaning up a large data set with disorganized formats in JMP 14.2.0

 Thank you! I haven't used coding much though so that solution is probably too complex for me to put into practice. I'm hoping to be able to sort it out in pivot tables before I get it to JMP but that is also proving problematic.

jthi
Super User

Re: Cleaning up a large data set with disorganized formats in JMP 14.2.0

Are you familiar with SQL? This way you could avoid the issue with Other Info having different indices for different keys. You could build a query for JMP table to get it into this format:

jthi_0-1654895472437.png

Then create summary and split tables and join them together

summary:

jthi_1-1654895493838.png

split:

jthi_2-1654895505697.png

split updated to summary:

jthi_3-1654895517840.png

Names Default To Here(1);
dt = Current Data Table();

dt_query = Query(
	Table(dt, "INFO"),
	"\[
	select t1.assay, t1."Useful info 2" as "Control Lot number", 
	t1."Useful info 4" as "Control result", t1."Value for other info", t2."Value for other info"
	from INFO t1
    join INFO t2 on "Result for " || t1."Other info" = t2."Other info" and t1.assay = t2.assay
    ]\"
);

dt_summary = dt_query << Summary(
	Group(:Assay, :Control Lot number, :Control result),
	Freq("None"),
	Weight("None"),
	Link to original data table(0)
);

dt_split = dt_query << Split(
	Split By(:Value for other info),
	Split(:Value for other info 2),
	Group(:Assay),
	Remaining Columns(Drop All),
	Sort by Column Property
);

dt_summary << Update(With(dt_split), Match Columns(:Assay = :Assay));
//Close(dt_summary, No save);

Edit:

I had to recode original data columns Other info "Useful info X" values to "useful info X" (lowercase) to make query work. It could also be done in SQL query

-Jarmo