Discussions

profjmb · Jun 11, 2023 4:23 AM

I have data from several (22?) samples, which vary considerably in sample size. I understand that parameters are estimated better with larger samples, but that is not the issue I want to address here. It is that I would like to both graph distributions, and estimate parameters, as if the sample sizes didn't differ across the groups/samples. This means that large samples will need to be de-weighted and small samples up-weighted.

I tried to do this in "Fit Y by X," as follows: I computed a new variable, WEIGHT, as the sample size of the Group divided by the total sample size. So let's say that I have four groups:

Group 1: N1=100

Group 2: N1=500

Group 3: N1=200

Group 4: N1=400

Also Groups 1 and 2 are Type 1 and Groups 3 and 4 are Type 2

WEIGHT will be:

Group 1: 100/1200

Group 2: 500/1200

Group 3: 200/1200

Group 4: 400/1200

If I do "Fit Y by X" where Y is the DV and X is "Type" and weight by WEIGHT, then use "Compare Densities" to get a plot of overlapping densities, it looks right. However, if I continue and get the standard deviations for the DV, these are much larger than the standard deviations for any of the groups. This makes me think that I am not understanding what I'm doing.

Help?

profjmb · Mar 26, 2022 02:18 PM

l made a mistake in my original post, which I would delete if I knew how. The correct version is below:

I have data from several (22?) samples, which vary considerably in sample size. I understand that parameters are estimated better with larger samples, but that is not the issue I want to address here. It is that I would like to both graph distributions, and estimate parameters, as if the sample sizes didn't differ across the groups/samples. This means that large samples will need to be de-weighted and small samples up-weighted.

I tried to do this in "Fit Y by X," as follows: I computed a new variable, WEIGHT, as the sample size of the Group divided by the total sample size. So let's say that I have four groups:

Group 1: N1=100

Group 2: N1=500

Group 3: N1=200

Group 4: N1=400

Also Groups 1 and 2 are Type 1 and Groups 3 and 4 are Type 2

WEIGHT will be:

Group 1: 1200/100

Group 2: 1200/500

Group 3: 1200/200

Group 4: 1200/400

If I do "Fit Y by X" where Y is the DV and X is "Type" and weight by WEIGHT, then use "Compare Densities" to get a plot of overlapping densities, it looks right. However, if I continue and get the standard deviations for the DV, these are much larger than the standard deviations for any of the groups. This makes me think that I am not understanding what I'm doing.

Help?

Georg · Mar 28, 2022 08:27 AM

I think this post may help you:

Solved: Weighted Standard Deviation - JMP User Community

And probably the following script helps to understand what can happen.

So in your approach calculation of mean works, whatever method you take (role weight or frequency).

I made a different definition of weight in comparison to yours, I wanted the total sum of weights to be 1200 (you have 4*1200).

This should not matter, and for weight in the role of frequency it does not, but for weight in role weight it does. See script.

Unfortunately I cannot exactly explain why, the small dataset gets a very large stddev in comparison to the total average. Its perhaps due to square and root ...

I personally would not use your approach, because it's not clear, what is happening. I would do the group summary, and then the average over groups each weighted 1. And then combine that result into your graph.

Names Default To Here( 1 );
// about the role of weight and frequency for calculation of mean and stddev
//
// web("https://www.jmp.com/support/help/en/16.1/?os=win&source=application&utm_source=helpmenu&utm_medium=application#page/jmp/summary-statistics.shtml");
//
nelem_lst = {100, 500, 200, 400};
table_lst = {};
For Each( {value, index}, nelem_lst,
	Eval(
		Eval Expr(
			table_lst[index] = New Table( "Table " || Char( index ),
				add rows( nelem_lst[index] ),
				New Column( "Group", "Character", set each value( "Group " || Char( index ) ) ),
				New Column( "Type", "Character", set each value( If( index <= 2, "Type 1", "Type 2" ) ) ),
				New Column( "DV", "Continuous", formula( Random Normal( Expr( Mod( index, 2 ) ), Expr( Mod( index, 2 ) + 1 ) ) ) )
			)
		)
	);
	Wait( 0.1 );
	table_lst[index]:DV << delete formula;
);
Wait( 0 );
dt = table_lst[1] << concatenate( Table Name( "All" ), table_lst[2 :: 4] );
For Each( {value}, table_lst, Close( value, "NoSave" ) );

Summarize( dt, group_lst = by( :group ) );
ngroups = N Items( group_lst );

dt << New Column( "ColMean[Group]", formula( Col Mean( :DV, :group ) ) );
dt << New Column( "ColStd[Group]", formula( Col Std Dev( :DV, :group ) ) );
Eval( Eval Expr( dt << New Column( "weight[Group]", formula( Col Number( :DV ) / Expr( ngroups ) / Col Number( :DV, :group ) ) ) ) );

nw = New Window( "oneway comparison",
	H List Box(
		Panel Box( "w/o weight",
			dt << Oneway( Y( :DV ), X( :Group ),  Means and Std Dev( 1 ), Mean Error Bars( 1 ), Std Dev Lines( 1 ) );
		),
		Panel Box( "weight in role frequency",
			dt << dt << Oneway(
				Y( :DV ),
				X( :Group ),
				Freq( :"weight[Group]"n ),
				Means and Std Dev( 1 ),
				Mean Error Bars( 1 ),
				Std Dev Lines( 1 )
			);

		),
		Panel Box( "weight in role weight",
			dt << dt << Oneway(
				Y( :DV ),
				X( :Group ),
				Weight( :"weight[Group]"n ),
				Means and Std Dev( 1 ),
				Mean Error Bars( 1 ),
				Std Dev Lines( 1 )
			);

		)
	)
);

nw = New Window( "Tabulate comparison",
	H List Box(
		Panel Box( "w/o weight",
			dt << Tabulate(
				Change Item Label( Grouping Columns( :Type( "All" ), "All" ) ),
				Show Control Panel( 0 ),
				Add Table(
					Column Table( Statistics( N ) ),
					Column Table( Analysis Columns( :"weight[Group]"n ), Statistics( Sum ) ),
					Column Table( Analysis Columns( :DV ), Statistics( Mean ) ),
					Column Table( Statistics( Std Dev ), Analysis Columns( :DV ) ),
					Row Table( Grouping Columns( :Type, :Group ), Add Aggregate Statistics( :Type, :Group ) )
				)
			)
		),
		Panel Box( "weight in role frequency",
			dt << Tabulate(
				Change Item Label( Grouping Columns( :Type( "All" ), "All" ) ),
				Freq( :"weight[Group]"n ),
				Show Control Panel( 0 ),
				Add Table(
					Column Table( Statistics( N ) ),
					Column Table( Analysis Columns( :"weight[Group]"n ), Statistics( Sum ) ),
					Column Table( Analysis Columns( :DV ), Statistics( Mean ) ),
					Column Table( Statistics( Std Dev ), Analysis Columns( :DV ) ),
					Row Table( Grouping Columns( :Type, :Group ), Add Aggregate Statistics( :Type, :Group ) )
				)
			)
		),
		Panel Box( "weight in role weight",
			dt << Tabulate(
				Change Item Label( Grouping Columns( :Type( "All" ), "All" ) ),
				weight( :"weight[Group]"n ),
				Show Control Panel( 0 ),
				Add Table(
					Column Table( Statistics( N ) ),
					Column Table( Analysis Columns( :"weight[Group]"n ), Statistics( Sum ) ),
					Column Table( Analysis Columns( :DV ), Statistics( Mean ) ),
					Column Table( Statistics( Std Dev ), Analysis Columns( :DV ) ),
					Row Table( Grouping Columns( :Type, :Group ), Add Aggregate Statistics( :Type, :Group ) )
				)
			)
		)
	)
);

dt << Summary( Group( :Group, :Type, :"ColMean[Group]"n, :"ColStd[Group]"n, :"weight[Group]"n ), Freq( "None" ), Weight( "None" ) );

Georg

Mark_Bailey · Mar 28, 2022 08:37 AM

I don't think parameter estimation requires equal sample sizes or normalization.

Discussions

"Weight" in "Fit Y by X" (and perhaps generally in JMP platforms)

Re: "Weight" in "Fit Y by X" (and perhaps generally in JMP platforms)

Re: "Weight" in "Fit Y by X" (and perhaps generally in JMP platforms)

Re: "Weight" in "Fit Y by X" (and perhaps generally in JMP platforms)

Recommended Articles