Solved: Re: How to select & exclude by UniqueVals combo of two columns?

aliegner1 · Jun 10, 2023 4:31 PM

So I'm stumped here. I'm trying to output a screened dt of just data that is statistically significant, based on two columns: Chart_ID & Recipe_ID. I can get it to work w/ a single column but can't figure out two columns.

I'm plotting the raw data using Oneway (y=mean, x=chamber, BY:Chart_ID, Recipe_ID) and running the All-pairs Tukey tTest.
Then taking the combined pVal data table.
then selecting all rows where pVal >0.05 and excluding
then creating a summarize list of the UniqueVals by Chart_ID (this is all Chart_ID's with significant pVals.)
then selecting all rows in source dt w/ matching UniqueVals
then inverting selection and excluding
--> this leaves the source dt with only data that is tTest significant by Chart_ID.

I can't figure out how to add in the Recipe_ID into the summarize UniqueVals.

Is there a better way to end up w/ a source dt that has excluded all the Charts/Recipes that are not significant?

//dtSPC --> my source data, raw SPC data. Need to exclude rows from Charts, Recipes with high pVals

/*plotting a basic Fit YbyX just to get pValues. Formatting and style done on final plot*/
ow1 = Oneway(
	Y( :MEAN ),
	X( :CHAMBER ),
	By(:CHART_ID, :RECIPE),
	
	//All pairs Tukey HSD test
	All Pairs(
		1,
		Confidence Quantile( 0 ),
		Difference Matrix( 0 ),
		LSD Threshold Matrix( 0 ),
		Connecting Letters Report( 0 ),
		Ordered Differences Report( 1 )
	),
);
/***********************************/

/***********************************/
// Create the combined data table, using xPath (return a list of the OutlineBoxes from the report that have the title Ordered Differences Report.  Then, use the first of those to make the combined data table.)
outlines = (Report( ow1[1] ) << Top Parent) << XPath( "//OutlineBox[text()='Ordered Differences Report']" );
dt1 = outlines[1][Table Box( 1 )] << Make Combined Data Table;

dt1 = current data table();
dt1 << set name( "pValue PairWise by Channel Tukey");

dt1:CHART_ID << Data type( Numeric ) << Set Modeling Type( Nominal );
/***********************************/

/**********************************/
//Screen and show only the CHART_ID's w/ a Tukey pair pVal<0.05

//exclude all high pVal rows, leaving only low pVal Channels. Does not work in reverse.
dt1 << select where(:name("p-Value") > 0.05);
dt1 << exclude;
dt1 << clear select;

//create a list of remaining Unique CHART_ID's
summarize(dt1, uniqueVals=by(:CHART_ID));
//Convert to numeric
For( i = 1, i <= N Items(uniqueVals ), i++,
	uniqueVals[i] = Num(uniqueVals[i] )
);

//extra, just to look at the data
//dtUnique = New Table( "uniqueVals less 0.05");
//dtUnique << New Column( "CHART_ID", Values(uniqueVals));

//select on SPC source data just the channels w/ low pVals
//need to choose SPC data table
dtSPC << clear select;
dtSPC << select where( contains( uniqueVals, :CHART_ID));

//invert and exclude, leaving only low pVal CHART_ID's
dtSPC << invert row selection << exclude;
dtSPC << clear select;

jthi · Jun 16, 2021 12:05 PM

My previous Associative Array() solution wont' work correctly if there are duplicated values in the columns (it could be quite easily fixed thou), so I'm using the for-loop version.

Does this do what you want?

Names Default To Here(1);

dtPopCorn = Open( "$SAMPLE_DATA/Popcorn.jmp" );

/*plotting a basic Fit YbyX just to get pValues. Formatting and style done on final plot*/
ow1 = Oneway(
	Y( :yield ),
	X( :batch),
	By(:trial, :popcorn),
	
	//All pairs Tukey HSD test
	All Pairs(
		1,
		Confidence Quantile( 0 ),
		Difference Matrix( 0 ),
		LSD Threshold Matrix( 0 ),
		Connecting Letters Report( 0 ),
		Ordered Differences Report( 1 )
	),
);
/***********************************/

/***********************************/
// Create the combined data table, using xPath (return a list of the OutlineBoxes from the report that have the title Ordered Differences Report.  Then, use the first of those to make the combined data table.)
outlines = (Report( ow1[1] ) << Top Parent) << XPath( "//OutlineBox[text()='Ordered Differences Report']" );
dt1 = outlines[1][Table Box( 1 )] << Make Combined Data Table;
dt1 << set name( "pValue PairWise by Channel Tukey");

/***********************************/

/**********************************/
pVals = Loc(dt1:"p-Value"n << get as matrix <= 0.05); 
comparisonStrings = {};
For(i = 1, i <= N Items(pVals), i++,
	Insert Into(comparisonStrings, char(dt1:trial[pVals[i]]) || dt1:popcorn[pVals[i]]);
);
show(comparisonStrings);

dtPopCorn << clear select;
dtPopCorn << select where(contains(comparisonStrings, char(:trial)||:popcorn));

Only combination with p-value below 0.05 is trial = 2 and popcorm = gourmet.

-Jarmo

View solution in original post

Kevin_Anderson · Jun 15, 2021 05:34 PM

Hi, aliegner1!

Maybe I don't understand, but you might try selecting significant p-values, subsetting a new table, then summarizing by Chart_ID and Recipe_ID?

Again, maybe I don't understand, but what you're doing smacks of "p hacking". At the very least, I recommend you read the ASA Statement on p-Values and then look up False Discovery Rate in the Statistics Index under JMP Help.

Good luck!

aliegner1 · Jun 15, 2021 7:47 PM

Thank you for the info. I read the ASA statement and it makes sense mostly. However, my situation (and I assume many other ppl) are dealing with thousands of charts and I need a way to quickly screen and trim down the list to look at the most important. Here's a screen shot as an example. I've got 5000 charts to sift through and need a good method to identify the most important ones to spend time looking at. I gotta draw the line somewhere.

I also looked up FDR, but I don't think this applies, as I'm only using a single X characteristic. There's no FDR's happening with just a single X.

Regarding the select and summarize:

That's the thing. I can't figure out how to get the summarize to then translate over to letting me select the source data.

Previously, I was selecting a single column (Chart_ID) then converting back to numeric because Summarize outputs character. From there, I was able to select the source dt Chart_ID Col by the list of UniqueVals (which was the Chart_ID's).

How to make this happen with two different columns that are numeric and character?

jthi · Jun 16, 2021 04:15 AM

You could convert the numeric column to character and then concat the two columns to get list of comparison strings. Then use that list to select wanted rows (or use associative arrays). Don't have example dataset so here is some sort of an example:

Names Default To Here(1);

dt = New Table("Untitled 22",
	Add Rows(5),
	Compress File When Saved(1),
	New Column("CHART_ID",
		Character,
		"Nominal",
		Set Values({"123", "1", "12", "0", "135"})
	),
	New Column("RECIPE",
		Character,
		"Nominal",
		Set Values({"A", "A", "B", "B", "C"})
	),
	New Column("pval",
		Numeric,
		"Continuous",
		Format("Best", 12),
		Set Values([0.1, 0.001, 0.051, 0.002, 0.1])
	)
);

pVals = Loc(dt:pval << get as matrix > 0.05); //get rows over 0.05
comparisonStrings = {};
For(i = 1, i <= N Items(pVals), i++,
	Insert Into(comparisonStrings, char(dt:CHART_ID[pVals[i]]) || dt:RECIPE[pVals[i]]);
);
wait(1);
dt << Select Where(Contains(comparisonStrings, char(:CHART_ID)||:RECIPE));
wait(1);
dt << clear select;
wait(1);
aa = Associative Array(Eval List(dt:CHART_ID[pVals]), EvalList(dt:RECIPE[pVals]));
dt << Select Where(Contains(aa, :CHART_ID) & aa[:CHART_ID] == :RECIPE);

Use Scripting Index to get better understanding about what is going on(and for debugging if needed).

-Jarmo

aliegner1 · Jun 16, 2021 10:13 AM

@jthi

Thank you for the tips, but I'm a bit confused about your script. It looks like its just doing select a few times but doesn't do anything to circle back to a source dt.

I can select the pVal's from the combined data table just fine. The trouble I'm having is trying to then summarize by two columns in the Combined Data Table, then circle back to the main source data and select using the summarize.

I found a sample set that looks like we can use to talk about this problem.

I can't figure out how to summarize back to the source dt using the two BY columns

By(:trial, :oil amt),

dtPopCorn = Open( "$SAMPLE_DATA/Popcorn.jmp" );

/*plotting a basic Fit YbyX just to get pValues. Formatting and style done on final plot*/
ow1 = Oneway(
	Y( :yield ),
	X( :popcorn ),
	By(:trial, :oil amt),
	
	//All pairs Tukey HSD test
	All Pairs(
		1,
		Confidence Quantile( 0 ),
		Difference Matrix( 0 ),
		LSD Threshold Matrix( 0 ),
		Connecting Letters Report( 0 ),
		Ordered Differences Report( 1 )
	),
);
/***********************************/

/***********************************/
// Create the combined data table, using xPath (return a list of the OutlineBoxes from the report that have the title Ordered Differences Report.  Then, use the first of those to make the combined data table.)
outlines = (Report( ow1[1] ) << Top Parent) << XPath( "//OutlineBox[text()='Ordered Differences Report']" );
dt1 = outlines[1][Table Box( 1 )] << Make Combined Data Table;

dt1 = current data table();
dt1 << set name( "pValue PairWise by Channel Tukey");

dt1:trial << Data type( Numeric ) << Set Modeling Type( Nominal );
/***********************************/

/**********************************/
//Screen and show only the trial's w/ a Tukey pair pVal<0.05

//exclude all high pVal rows, leaving only low pVal Channels. Does not work in reverse.
dt1 << select where(:name("p-Value") > 0.5); //I know...just setting 0.5 so that one row stands out.
dt1 << exclude;
dt1 << clear select;

//create a list of remaining Unique trial's
summarize(dt1, uniqueVals=by(:trial));
//Convert to numeric
For( i = 1, i <= N Items(uniqueVals ), i++,
	uniqueVals[i] = Num(uniqueVals[i] )
);

//extra, just to look at the data
//dtUnique = New Table( "uniqueVals less 0.5");
//dtUnique << New Column( "trial", Values(uniqueVals));

//select on SPC source data just the channels w/ low pVals
//need to choose SPC data table
dtPopCorn << clear select;
dtPopCorn << select where( contains( uniqueVals, :trial));

//invert and exclude, leaving only low pVal trial's
dtPopCorn << invert row selection << exclude;
dtPopCorn << clear select;

jthi · Jun 16, 2021 12:05 PM

My previous Associative Array() solution wont' work correctly if there are duplicated values in the columns (it could be quite easily fixed thou), so I'm using the for-loop version.

Does this do what you want?

Names Default To Here(1);

dtPopCorn = Open( "$SAMPLE_DATA/Popcorn.jmp" );

/*plotting a basic Fit YbyX just to get pValues. Formatting and style done on final plot*/
ow1 = Oneway(
	Y( :yield ),
	X( :batch),
	By(:trial, :popcorn),
	
	//All pairs Tukey HSD test
	All Pairs(
		1,
		Confidence Quantile( 0 ),
		Difference Matrix( 0 ),
		LSD Threshold Matrix( 0 ),
		Connecting Letters Report( 0 ),
		Ordered Differences Report( 1 )
	),
);
/***********************************/

/***********************************/
// Create the combined data table, using xPath (return a list of the OutlineBoxes from the report that have the title Ordered Differences Report.  Then, use the first of those to make the combined data table.)
outlines = (Report( ow1[1] ) << Top Parent) << XPath( "//OutlineBox[text()='Ordered Differences Report']" );
dt1 = outlines[1][Table Box( 1 )] << Make Combined Data Table;
dt1 << set name( "pValue PairWise by Channel Tukey");

/***********************************/

/**********************************/
pVals = Loc(dt1:"p-Value"n << get as matrix <= 0.05); 
comparisonStrings = {};
For(i = 1, i <= N Items(pVals), i++,
	Insert Into(comparisonStrings, char(dt1:trial[pVals[i]]) || dt1:popcorn[pVals[i]]);
);
show(comparisonStrings);

dtPopCorn << clear select;
dtPopCorn << select where(contains(comparisonStrings, char(:trial)||:popcorn));

Only combination with p-value below 0.05 is trial = 2 and popcorm = gourmet.

-Jarmo

aliegner1 · Jun 16, 2021 06:41 PM

that worked! such a simple WIN. send a concatted value to a string.