Re: Graph Builder: Unclear statistics when using multiple response columns

gnu23 · May 13, 2024 04:24 AM

Hi all,

I have a problem reagrding statistics/bar graphs using multiple response columns. As an example, I will use the Big Class Families data table, and I'm using JMP18.
I want to plot the fraction of persons, which have visited specific countries, for simplicity I restricted the age to 14. I made a graph builder with the following script:

Graph Builder(
Size( 1097, 475 ),
Legend Position( "Right" ),
Fit to Window( "Off" ),
Summary Statistic( "Median" ),
Order Statistic( "Median" ),
Variables( X( :countries visited ), Group X( :age ) ),
Elements(
Bar(
X,
Legend( 5 ),
Summary Statistic( "N" ),
Label( "Label by Percent of Total Values" )
),
Caption Box( X, Legend( 6 ), Summary Statistic( "Median" ) )
),
Local Data Filter(
Add Filter(
columns( :age ),
Where( :age == 14 ),
Display( :age, N Items( 6 ) )
)
)
);

The percentage labels on the bar plot show values of 9.1 and 18% for counts of 1 and 2, respectively. I would expect (correct me if I'm wrong) to have a value of 1/12~ 8.3% and 2/12 ~ 17% if the percentage is calculated with the number of persons in the respective age (12) as a base value. For some reason, JMP seems to use the number of responses as the base value for the percentage calculation (which is 11 in this case, so 1/11 ~ 9.1% and 2/11 ~ 18%)
Is it intended, that the number of responses is used as a base value for such calculation with Multiple Response columns instead of the number of respective rows? What is the best practice to plot the values like I intend (percentage respective to the number of rows as labels)?

MRB3855 · May 13, 2024 2:14 AM

Hi @gnu23 : The way you suggest (dividing by 12, the number of persons) would result in the sum = 100*(5/12 + 6/12) = 91.67%, which is less than 100%. Dividing by number of responses (11) results in a total of 100*(5/11 + 6/11) = 100%, as it should. You could select "Include missing categories" (5/16) from the Graph Builder pull down red triangle to get a more complete view of the data. Or if you want to use the number of persons as your "n", you may have to change countries visited to a single nominal response and put "No response" (for counties visited) or something for non responders (or, as above, use the "Include missing categories" .option).

So...it depends on what you are trying to show on your plot. Are you focused on visits to a county (as shown) or travel history of individuals?

gnu23 · May 14, 2024 08:54 AM

Hi @MRB3855,
thank you for the fast response. I am aware that dividing by the number of rows would result in the sum of percentages being not 100%, which is quite logical in my opinion if you are dealing with columns which have "multiple responses". Using a single nominal response would be an option in this case, but it also reduces the functionality.
What I want to outline in the plot would be a kind of "popularity" of a country, or, in other words, if I take a random person what is the probability that it has been to country x.

A more severe example of this kind of strange statistics would be a dataset of e.g. 12 persons, where only one has been to two countries, while all of the rest haven't been to any country. A plot like above would show 50% for both countries (1/2 responses), instead of 1/12 for both countries (which would make sense in my opinion). Using "Include missing" won't fix this as it will add the missing values as extra responses, but still count responses instead of rows, yielding a (in my opinion) wrong percentage value as long there is at least a row with more than a single response.
I think the discussion boils down to the question, if statistics on multiple response columns should generally either use the number of responses (standard now in JMP) or the number of rows as 'n'. In my opinion the second option would make more sense for many use cases and an option to switch between these two values for 'n' would increase the (already very nice) functionality of multiple response columns.

hogi · May 14, 2024 9:04 AM

I agree, use cases where ONE row counts 1 seems to be quite obvious - even for entries with multiple response

"Which percentage of the students visited Italy?"

is more likely to be asked than:

"Taking all cases where any of the students visited any country, how often was it Italy?"

I have to admit that I was trapped by the "100%" argument, but after a lunch time discussion about the topic, I will accept that ratios like "Which percentage of the students visited ...?" don't need to sum up to 100%. How to calculate the values?

The strange behavior of counting responses gets even stranger when multiple histograms are combined in one graph:

Graph Builder: multiple response + multiple histograms

Let's assume a school evaluates the sports activities via Jmp. If a student has enough time to travel, she can cheat and boost the scores for her sports activity by orders of magnitude! Watch LOUISE travel the world - and the effect on the statistics of "Basketball":

dt = Open( "$SAMPLE_DATA/Big Class Families.jmp" );

NewWindow("compare",H List box(
dt << Graph Builder(
	Show Control Panel( 0 ),
	Variables(
		X( :sports ),
		X( :countries visited )
	),
	Elements( Position( 1, 1 ), Bar( X, Legend( 3 ) ) ),
	Elements( Position( 2, 1 ), Bar( X, Legend( 8 ) ) ),
	SendToReport(
		Dispatch(
			{},
			"",
			ScaleBox,
			{Min( 0 ), Max( 50 ), Inc( 10 ), Minor Ticks( 1 )}
		)
	)
);

dt << Graph Builder(
	Show Control Panel( 0 ),
	Variables( X( :sex ) ),
	Elements( Bar( X, Legend( 3 ) ) ),
	SendToReport(
		Dispatch(
			{},
			"",
			ScaleBox,
			{Min( 0 ), Max( 50 ), Inc( 10 ), Minor Ticks( 1 )}
		)
	)
)));

dt << Minimize Window();

New Window( "",
		<<Type( "Modal Dialog" ),
		Text Box("Now Let's send LOUISE around the world ...")
		);



dt:countries visited[2] = char(as list(1::100)[1])

MRB3855 · May 14, 2024 2:46 PM

Hi @hogi @gnu23 Great conversation! A couple comments particularly struck me.

@hogi said:

(1)

"Which percentage of the students visited Italy?"

is more likely to be asked than:

"Taking all cases where any of the students visited any country, how often was it Italy?"

Really? I dunno…if I’m in minister of tourism in country A, which am I more interested in?

I think it depends on what you are interested in, who your audience is, etc.

And:

(2)

”I will accept that ratios like "Which percentage of the students visited ...?" don't need to sum up to 100%.”

Really? OK, I’ll play; help me with that. I.e., what exactly doesn’t need to add up to 100%? Certainly some individuals didn’t visit any country (or didn’t respond). So the sum of proportions of all countries won’t sum to 100%…you’ll need to add the case for the proportion of persons who visited no country (or didn’t respond). Then all the proportions sum to 1. Is that kind of thing what you mean?

hogi · May 14, 2024 05:36 PM

@MRB3855 wrote:
@hogi said:
"Which percentage of the students visited Italy?"
...
Really? I dunno…

... I just thought about the teacher of the class asking:

"who visited Italy this year? wow, everybody! And France? 30% of the class ..."

Let's assume there are 10 students in the class.

all of them visited Italy: 100%.

3 of them visited France: 30%

5 of them visited Brazil: 50%

in total (just looking at the 3 numbers): "180%"

I agree, this is definitely not the way, one calculates with percentages
My brain almost exploded - but one might agree: the 100%, 30% and 50% are "meaningful".

The only 2 things.

- they don't add up to 100%

- they cannot be calculated via "Label by Percent of Total Values" *) and similar approaches in JMP.

*) might be great to have something like: "Label by Percent of all rows" (like in all plots without multiple response)

so, I would support @gnu23 's suggestion to provied a second/alternative option of calculating by rows instead of "responses" ...

MRB3855 · May 15, 2024 1:13 AM

Hi @hogi No need to add up to 100% in your (legitimate) example of 100%, 30%, and 50% because countries visited are not mutually exclusive (a student could visit more than one country). And if that kind of thing is what you are after…it may be a candidate for the “JMP Wish List”?

hogi · May 15, 2024 04:15 AM

@MRB3855 wrote:
a student could visit more than one country

Yes, I think this is the basic Idea of "multiple response"

MRB3855 · May 15, 2024 04:49 AM

Indeed @hogi : And to complete the thought, and thoroughly beat this to death...a student, however, can't visit more than one country at a time (e.g., a student can't be in Italy and Germany at the same time). So, defined that way, those countries are mutually exclusive. And in that case, combined with the case that include students who did not visit any country, then it is also collectively exhaustive. Then, those will sum to 100% (via counting responses as is the current method).

As you were...

hogi · May 15, 2024 05:42 AM

If other users are interested in the "alternative" (Ntotal = Nrows) statistics, here is the wish:

Graph Builder: Option for "statistics by row" - not by response

The most important learning is that Graph Builder changes its whole way to calculate statistics, once a column with modeling type "Multiple Response" is used ANYWHERE in the plot:

(view in My Videos)