Subscribe Bookmark RSS Feed

Validate / Test for overfit model by re-ordering Y variables?

ih

Community Trekker

Joined:

Sep 30, 2016

A co-worker described a test in another software package which randomly re-arranges the y values in an analysis and then re-fits the model to make sure it has a poor fit.  Is anyone familiar with this techinque or attempted to automate it in JMP?

 

Comparing these two fits demonstrates the technique, although it sounds like many y' columns are created and tested.

 

dt = New Table( "Untitled 2",
	Add Rows( 10 ),
	New Column( "x",
		Numeric,
		"Continuous",
		Format( "Best", 12 ),
		Set Selected,
		Set Values( [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] )
	),
	New Column( "y",
		Numeric,
		"Continuous",
		Format( "Best", 12 ),
		Formula( :x * 2 + Random Normal( 0, 0.1 ) )
	),
	New Column( "y'",
		Numeric,
		"Continuous",
		Format( "Best", 12 ),
		Formula( Col Stored Value( :y, Col Shuffle() ) )
	)
);

dt << Fit Model(
	Y( :y, :y' ),
	Effects( :x ),
	Personality( "Standard Least Squares" ),
	Emphasis( "Effect Leverage" ),
	Run(
		:y << {Lack of Fit( 0 ), Plot Actual by Predicted( 0 ),
		Plot Residual by Predicted( 0 ), Plot Effect Leverage( 0 )},
		:y' << {Lack of Fit( 0 ), Plot Actual by Predicted( 0 ),
		Plot Residual by Predicted( 0 ), Plot Effect Leverage( 0 )}
	)
);
2 REPLIES
markbailey

Staff

Joined:

Jun 23, 2011

This resampling method is usually used to determine significance of the effect of changing the levels of the independent variable. This method is often used instead of the t test or the ANOVA, which make assumptions about the data and the model, for inference. The resampling approach generates the empirical distribution instead of assuming a particular model. You compare your sample statistic to the empirical distribution to obtain a p-value.

JMP Pro can bootstrap any result, such as a parameter estimate, in order to determine its significance.

You don't need the second dependent variable (random normal deviate).

Learn it once, use it forever!
ih

Community Trekker

Joined:

Sep 30, 2016

Mark,

 

If I understand your post I think bootstrapping would be an alternate approach to (hopefully) arrive at the same conclusion.  Honestly though I have not had good luck bootstrapping parameter estimates for any more than a basic model, perhaps I am doing something wrong :-).  For example, attempting to bootstrap parameter estimates for a second order model results in multiple columns for each possible coefficient.

 

I think I found a way to automate this using the simulation method.  Instead of using the simulation function generated by a platform just swap the original y with the re-arranged y' column.  I am having trouble applying the technique to partition methods though; I suspect the same trees are used for each simulation.

 

Attached is a script showing what this looks like for a few different methods but here is the basic idea:

Random Reset(1);

dt = New Table( "Untitled 2",
	Add Rows( 4 ),
	New Column( "x",
		Numeric,
		"Continuous",
		Format( "Best", 12 ),
		Set Selected,
		Set Values( [1, 2, 3, 4] )
	),
	New Column( "y",
		Numeric,
		"Continuous",
		Format( "Best", 12 ),
		Formula( :x * 2 + Random Normal( 0, 0.1 ) )
	),
	New Column( "y'",
		Numeric,
		"Continuous",
		Format( "Best", 12 ),
		Formula( Col Stored Value( :y, Col Shuffle() ) )
	)
);

// ------ Linear Model ------
linmdl = dt << Fit Model(
	Y( :y ),
	Effects( :x ),
	Personality( "Standard Least Squares" ),
	Emphasis( "Effect Leverage" ),
	Run(
		:y << {Lack of Fit( 0 ), Plot Actual by Predicted( 0 ),
		Plot Residual by Predicted( 0 ), Plot Effect Leverage( 0 )},
		:y' << {Lack of Fit( 0 ), Plot Actual by Predicted( 0 ),
		Plot Residual by Predicted( 0 ), Plot Effect Leverage( 0 )}
	)
);

linrpt = linmdl << Report;
lindt = linrpt["Summary of Fit"][1][2] << Simulate(
	100,
	Out( :y ),
	In( :y' )
);

lindt[2] << Distribution( Continuous Distribution( Column( :RSquare ) ) );