Solved: Cleaning up a large data set with disorganized formats in JMP 14.2.0

Analyst · Jun 9, 2022 03:49 PM

Hi, I have a large data set with a combination of useful information in columns and in rows. The data in rows has a row with the type of information being collected which then needs to be linked to a result in a row further down in the column. The amount of rows with this information is variable and in some cases is a free form input rather than a specified input. I've been using recoding, subsetting, and transposing to get the data into a manageable format but it's time consuming and difficult to data verify as I have >50K rows to work with. I have tried to generate a simplified example of what I have and what I would like below. Can anybody help?

What I have

Assay	Useful info 1	Useful info 2	Useful info 3	Useful info 4	Other info	Value for other info
1	Control Lot number	abcdefg	Control result	10	Useful info 5	Method ID
1	Control Lot number	abcdefg	Control result	10	Useful info 6	Method Version
1	Control Lot number	abcdefg	Control result	10	Result for useful info 5	Method x
1	Control Lot number	abcdefg	Control result	10	Result for useful info 6	1
2	Control Lot number	abcdefg	Control result	11	Useful info 5	Method ID
2	Control Lot number	abcdefg	Control result	11	Useful info 6	Method Version
2	Control Lot number	abcdefg	Control result	11	Useful info 7	Sample result
2	Control Lot number	abcdefg	Control result	11	Useful info 8	Equipment Number
2	Control Lot number	abcdefg	Control result	11	Result for useful info 5	Method x
2	Control Lot number	abcdefg	Control result	11	Result for useful info 6	1
2	Control Lot number	abcdefg	Control result	11	Result for useful info 7	12
2	Control Lot number	abcdefg	Control result	11	Result for useful info 8	123456
3	Control Lot number	abcdefg	Control result	10	Useful info 5	Method ID
3	Control Lot number	abcdefg	Control result	10	Useful info 6	Method Version
3	Control Lot number	abcdefg	Control result	10	Useful info 7	Sample result
3	Control Lot number	abcdefg	Control result	10	Useful info 8	Equipment Type
3	Control Lot number	abcdefg	Control result	10	Useful info 9	Equipment Number
3	Control Lot number	abcdefg	Control result	10	Result for useful info 5	Method x
3	Control Lot number	abcdefg	Control result	10	Result for useful info 6	1
3	Control Lot number	abcdefg	Control result	10	Result for useful info 7	12
3	Control Lot number	abcdefg	Control result	10	Result for useful info 8	Reader
3	Control Lot number	abcdefg	Control result	10	Result for useful info 9	123456

What I would like:

Assay	Control Lot number	Control result	Method ID	Method Version	Sample result	Equipment Number	Equipment Type
1	abcdefg	10	Method x	1
2	abcdefg	11	Method x	1	12	123456
3	abcdefg	10	Method x	1	12	123456	Reader

Or this could also work:

Assay	Parameter	Result
1	Control result	10
1	Method ID	Method x
1	Method Version	1
1	Sample result
1	Equipment Number
1	Equipment Type
1	Equipment Number
2	Method ID	Method x
2	Method Version	1
2	Sample result	12
2	Equipment Number	123456
2	Equipment Type
2	Equipment Number
3	Method ID	Method x
3	Method Version	1
3	Sample result	12
3	Equipment Number	123456
3	Equipment Type	Reader
3	Equipment Number	123456

jthi · Jun 10, 2022 2:17 PM

Are you familiar with SQL? This way you could avoid the issue with Other Info having different indices for different keys. You could build a query for JMP table to get it into this format:

Then create summary and split tables and join them together

summary:

split:

split updated to summary:

Names Default To Here(1);
dt = Current Data Table();

dt_query = Query(
	Table(dt, "INFO"),
	"\[
	select t1.assay, t1."Useful info 2" as "Control Lot number", 
	t1."Useful info 4" as "Control result", t1."Value for other info", t2."Value for other info"
	from INFO t1
    join INFO t2 on "Result for " || t1."Other info" = t2."Other info" and t1.assay = t2.assay
    ]\"
);

dt_summary = dt_query << Summary(
	Group(:Assay, :Control Lot number, :Control result),
	Freq("None"),
	Weight("None"),
	Link to original data table(0)
);

dt_split = dt_query << Split(
	Split By(:Value for other info),
	Split(:Value for other info 2),
	Group(:Assay),
	Remaining Columns(Drop All),
	Sort by Column Property
);

dt_summary << Update(With(dt_split), Match Columns(:Assay = :Assay));
//Close(dt_summary, No save);

Edit:

I had to recode original data columns Other info "Useful info X" values to "useful info X" (lowercase) to make query work. It could also be done in SQL query

-Jarmo

View solution in original post

jthi · Jun 10, 2022 12:34 PM

I would do this in couple of parts

Getting Assay, Control Lot Number and Control Result can be done fairly easily with summary
Getting rest is a bit more difficult as the Other info indices change (if they were always same this should would be most likely fairly simple Recode and Split)
1. Depending on grouping (assay only? assay + control lot number?) I would loop over the data table and add values as I go using associative arrays

Code is very messy (partially due to not able to use For Each and me being lazy):

Names Default To Here(1);

dt = New Table("Untitled",
	Add Rows(22),
	New Column("Assay", Numeric, "Continuous", Format("Best", 12), Set Values([1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3])),
	New Column("Useful info 1",
		Character(18),
		"Nominal",
		Set Values(
			{"Control Lot number", "Control Lot number", "Control Lot number", "Control Lot number", "Control Lot number", "Control Lot number",
			"Control Lot number", "Control Lot number", "Control Lot number", "Control Lot number", "Control Lot number", "Control Lot number",
			"Control Lot number", "Control Lot number", "Control Lot number", "Control Lot number", "Control Lot number", "Control Lot number",
			"Control Lot number", "Control Lot number", "Control Lot number", "Control Lot number"}
		)
	),
	New Column("Useful info 2",
		Character(16),
		"Nominal",
		Set Values(
			{"abcdefg", "abcdefg", "abcdefg", "abcdefg", "abcdefg", "abcdefg", "abcdefg", "abcdefg", "abcdefg", "abcdefg", "abcdefg", "abcdefg",
			"abcdefg", "abcdefg", "abcdefg", "abcdefg", "abcdefg", "abcdefg", "abcdefg", "abcdefg", "abcdefg", "abcdefg"}
		)
	),
	New Column("Useful info 3",
		Character(16),
		"Nominal",
		Set Values(
			{"Control result", "Control result", "Control result", "Control result", "Control result", "Control result", "Control result",
			"Control result", "Control result", "Control result", "Control result", "Control result", "Control result", "Control result",
			"Control result", "Control result", "Control result", "Control result", "Control result", "Control result", "Control result",
			"Control result"}
		)
	),
	New Column("Useful info 4",
		Numeric,
		"Continuous",
		Format("Best", 12),
		Set Values([10, 10, 10, 10, 11, 11, 11, 11, 11, 11, 11, 11, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10])
	),
	New Column("Other info",
		Character(24),
		"Nominal",
		Set Values(
			{"Useful info 5", "Useful info 6", "Result for useful info 5", "Result for useful info 6", "Useful info 5", "Useful info 6",
			"Useful info 7", "Useful info 8", "Result for useful info 5", "Result for useful info 6", "Result for useful info 7",
			"Result for useful info 8", "Useful info 5", "Useful info 6", "Useful info 7", "Useful info 8", "Useful info 9",
			"Result for useful info 5", "Result for useful info 6", "Result for useful info 7", "Result for useful info 8",
			"Result for useful info 9"}
		)
	),
	New Column("Value for other info",
		Character(16),
		"Nominal",
		Set Values(
			{"Method ID", "Method Version", "Method x", "1", "Method ID", "Method Version", "Sample result", "Equipment Number", "Method x", "1",
			"12", "123456", "Method ID", "Method Version", "Sample result", "Equipment Type", "Equipment Number", "Method x", "1", "12", "Reader",
			"123456"}
		)
	)
);

dt_result = dt << Summary(
	Group(:Assay, :Useful info 2, :Useful info 4),
	Freq("None"),
	Weight("None"),
	Link to original data table(0)
);

a_m = dt[0, "Assay"];
aa = Associative Array(a_m);
// will use slower for loop due to usage of JMP14 instead of For Each
aa_keys = aa << get keys;
For(i = 1, i <= N Items(aa_keys), i++,
	// initialize associative array to aa
	aa[aa_keys[i]] = Associative Array();
	//Get rows for current assay
	cur_rows = Loc(a_m, aa_keys[i]);
	
	//
	aa_temp = Associative Array(dt[cur_rows, "Other info"], dt[cur_rows, "Value for other info"]);
	aa_temp_keys = aa_temp << get keys;
	For(k = i, k <= N Items(aa_temp_keys), k++,
		// if starts with Useful info -> this will be new key to aa[aa_keys]
		If(Starts With(aa_temp_keys[k], "Useful info"),
			aa[aa_keys[i]][aa_temp[aa_temp_keys[k]]] = aa_temp["Result for " || Lowercase(aa_temp_keys[k])]; // get value
		);
	);
);

//then loop over the AA to set values to dt_result. create new columns as you go
aa_keys = aa << get keys;
For(i = 1, i <= N Items(aa_keys), i++,
	aa_new_cols = aa[aa_keys[i]];
	cur_cols = Associative Array(dt_result <<  get column names("String"));
	aa_new_cols << Remove(cur_cols);
	new_cols = aa_new_cols << get keys;
	For(k = 1, k <= N Items(new_cols), k++,
		dt_result << New Column(new_cols[k], Character, Nominal);
	);
	dt_result[i, aa[aa_keys[i]] << get keys] = aa[aa_keys[i]] << get values;
);

-Jarmo

Analyst · Jun 10, 2022 04:44 PM

Thank you! I haven't used coding much though so that solution is probably too complex for me to put into practice. I'm hoping to be able to sort it out in pivot tables before I get it to JMP but that is also proving problematic.

jthi · Jun 10, 2022 2:17 PM

Are you familiar with SQL? This way you could avoid the issue with Other Info having different indices for different keys. You could build a query for JMP table to get it into this format:

Then create summary and split tables and join them together

summary:

split:

split updated to summary:

Names Default To Here(1);
dt = Current Data Table();

dt_query = Query(
	Table(dt, "INFO"),
	"\[
	select t1.assay, t1."Useful info 2" as "Control Lot number", 
	t1."Useful info 4" as "Control result", t1."Value for other info", t2."Value for other info"
	from INFO t1
    join INFO t2 on "Result for " || t1."Other info" = t2."Other info" and t1.assay = t2.assay
    ]\"
);

dt_summary = dt_query << Summary(
	Group(:Assay, :Control Lot number, :Control result),
	Freq("None"),
	Weight("None"),
	Link to original data table(0)
);

dt_split = dt_query << Split(
	Split By(:Value for other info),
	Split(:Value for other info 2),
	Group(:Assay),
	Remaining Columns(Drop All),
	Sort by Column Property
);

dt_summary << Update(With(dt_split), Match Columns(:Assay = :Assay));
//Close(dt_summary, No save);

Edit:

I had to recode original data columns Other info "Useful info X" values to "useful info X" (lowercase) to make query work. It could also be done in SQL query

-Jarmo

Cleaning up a large data set with disorganized formats in JMP 14.2.0

Re: Cleaning up a large data set with disorganized formats in JMP 14.2.0

Re: Cleaning up a large data set with disorganized formats in JMP 14.2.0

Re: Cleaning up a large data set with disorganized formats in JMP 14.2.0

Re: Cleaning up a large data set with disorganized formats in JMP 14.2.0