cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
Try the Materials Informatics Toolkit, which is designed to easily handle SMILES data. This and other helpful add-ins are available in the JMP® Marketplace
Choose Language Hide Translation Bar
shampton82
Level VII

This might be a big ask, but can someone help with a script to try and select a normal distribution when it is pretty close to the best fitted distribution?

So here's what I'm hoping for:

When you click fit all in Distribution platform you get a lot of fits

shampton82_0-1724034774388.png

However, if the AICc of the best fit is within 5 of the normal distribution you might as well use the normal fit.  Sooooooo, if there a way to script something that would go through a bunch of columns that have already had the best fit ran and then adjust the selected fit (assuming it is non-normal) to Normal if it is within an AICc of 5 to the normal distribution?  Bonus points would be for being able to have an input box to enter the delta of the AICc you are willing to live with.  Double bonus would be to remove Students t, Cauchy, and ExGaussian from the selection options as you can't calculate process capabilities on these distributions (and that will be the next step to run after this clean up script is ran).

 

I've tried and can't get it to work, any help would be greatly appreciated!!

 

Steve

15 REPLIES 15
shampton82
Level VII

Re: This might be a big ask, but can someone help with a script to try and select a normal distribution when it is pretty close to the best fitted distribution?

Okay I got there!  I'm sure it's not very eloquent but I wanted to throw it out there in case it would help anyone else.

The first step is selecting the columns on your data table that you want to fit distributions for.

 

//rev 8-24-24

names default to here(1);
dt=current data table();

colnames=dt<<Get selected Columns(continuous, "string");

nw=new window("What AICc is comparable?", Show Menu( 0 ), Show Toolbars( 0 ),<<modal,<<size(300,100),<<return results,
			vlistbox(	
				hlistbox(
					Text Box("Put in what difference between Normal and the best fit you consider the same"),
					neb1=Number Edit Box(5);
					
				),
				 Button Box( "OK",var1 = neb1 << Get;)
			)	
				
	
);




rpt=new window("test",<<WindowView( "Invisible" ),
		obj=dt<<distribution(column(eval(colnames)),Fit All
			
		);
	
	
	
);

Wait( 0 );
dt1=rpt["Distributions", "Compare Distributions", Table Box( 1 )] <<
Make Combined Data Table(invisible);
rpt << Close Window;

//get rid of distribution types that can't have a Process Capability Analysis
// Delete selected rows
dt1 << Select Where(
	:Distribution == "Cauchy" | :Distribution == "ExGaussian" | :Distribution ==
	"Student's t"
) << Delete Rows;


//creat a column that will identify the order of the fitted distributions
// New column: Column 10
dt1 << New Column( "Column 10",
	Numeric,
	"Continuous",
	Format( "Best", 12 )
);

// Change column formula: Column 10
dt1:Column 10 << Set Formula( Col Cumulative Sum( 1, :Y ) );

//Select the best fit as well as the normal fits for all Y's then deleted all other rows
// Delete selected rows
dt1 << Select where(
	:Distribution == "Normal" | :Column 10 == 1
) << Invert Row Selection << Delete Rows;

//Create columns to determien and select the normal fit (if it is the best fir or withing our delta criteria we input at the start)
dt1 << New Column( "Column 11",
		Numeric,
		"Continuous",
		Format( "Best", 12 ),
		Formula( If( :Column 10 == 1, Empty(), :AICc - Lag( :AICc, -1 ) ) )
	);
dt1 << New Column( "Column 12",
		Numeric,
		"Continuous",
		Format( "Best", 12 ),
		Formula( If( :Distribution == "Normal" & :Column 10 == 1, 1 ) )
	);
eval(eval expr(dt1 << New Column( "Column 13",
		Numeric,
		"Continuous",
		Format( "Best", 12 ),
		Formula(
			If( :Column 10 == 1 & :Distribution != "Normal",
				If( Abs( Lag( :Column 11, -1 ) ) > expr(var1),
					1
				)
			)
		)
	)));
dt1<<	New Column( "Column 14 2",
		Numeric,
		"Continuous",
		Format( "Best", 12 ),
		Formula(
			If( Is Missing( Col Maximum( :Column 13, :Y ) ) & :Column 10 == 2,
				1
			)
		),
		Set Selected
	);
dt1 << New Column( "Column 14",
		Numeric,
		"Continuous",
		Format( "Best", 12 ),
		Formula( Sum( :Column 12, :Column 13,:Column 14 2 ) ),
		Set Selected
	);


wait(0);

// Delete column formula: Column 10
dt1:Column 10 << Delete Formula;


// Delete column formula: Column 11
dt1:Column 11 << Delete Formula;

// Delete column formula: Column 11
dt1:Column 12 << Delete Formula;

// Delete column formula: Column 11
dt1:Column 13 << Delete Formula;

// Delete column formula: Column 11
dt1:Column 14 << Delete Formula;

//Delete non-Normal fits that are wihtin our criteria
// Delete selected rows
dt1 << Select Where( :Column 14 == 1 ) <<
Invert Row Selection << Delete Rows;

//Puts all the Y's and distributiosn into lists
col={};
dist={};
for each row(dt1,
		insertinto(col,:Y);
		insertinto(dist,:Distribution);
);

close(dt1, nosave);


//bring back up the distribution platform but only with the Y's that we could fit a distribution to
rpt=new window("Best Distribution",
		obj=dt<<distribution(column(eval(col)),Process Capability( 0 ),
			
		);
	
	
	
);

//Apply the distributions

for(i=1, i<=n items(col), i++,

	//whatbox = column(colnames[i])<<get name;
	//test=(Report(obj) << XPath( "//OutlineBox[text() = '"||col[i]||"']"))<< get title();
	//if(eval(test[1])==eval(col[i]),
		if(dist[i]=="Normal",obj[i]<< Fit Normal);
		if(dist[i]=="Exponential",obj[i]<< Fit Exponential);
		if(dist[i]=="Gamma",obj[i]<< Fit Gamma);
		if(dist[i]=="Johnson Su",obj[i]<< Fit Johnson);
		if(dist[i]=="Lognormal",obj[i]<< Fit Lognormal);
		if(dist[i]=="Normal 2 Mixture",obj[i]<<Fit Normal 2 Mixture);
		if(dist[i]=="Normal 3 Mixture",obj[i]<<Fit Normal 3 Mixture);
		if(dist[i]=="SHASH",obj[i]<< Fit Shash);
		if(dist[i]=="Weibull",obj[i]<< Fit Weibull);
		if(dist[i]=="ZI SHASH",obj[i]<< Fit ZI SHASH);
		if(dist[i]=="Beta",obj[i]<< Fit Beta;);
		//);
	);	

Thanks for the inspiration @jthi and @txnelson (I used a bunch of your other posts to help me get here)

txnelson
Super User

Re: This might be a big ask, but can someone help with a script to try and select a normal distribution when it is pretty close to the best fitted distribution?

Nice job.

I went through your script and have made a few changes.

  1. I changed the syntax of the creation of Column 10 to be the same as the syntax used for column, Column 11 - Column 14.  I suggest that you change the names you are using.  The current names, Column 10 etc. is a name that JMP uses when it is creating columns and needs a default column name.  Therefore, there is a good chance that a data table you are going to run your script on may have a column by one of the names you are using.  My suggestion is to simply change Column 10 etc. to _Column 10_ etc. to avoid the issue.
  2. I changed your New Column code from specifying a Formula, to using Set Each Value.  It provides the same functionality, but without creating a formula, which your code ends up having to delete.
  3. I eliminated a For Each Row that you are using to create the lists, col and Dist.  This can be done in single statements, thus eliminating the need to loop through the data table
  4. I also suggest that you put in a check at the beginning of your code to determine if there are actually selected columns, and if not, display a message telling the user to select columns and run the script again..

See my changes below

//rev 8-24-24

names default to here(1);
dt=current data table();

colnames=dt<<Get selected Columns(continuous, "string");

nw=new window("What AICc is comparable?", Show Menu( 0 ), Show Toolbars( 0 ),<<modal,<<size(300,100),<<return results,
			vlistbox(	
				hlistbox(
					Text Box("Put in what difference between Normal and the best fit you consider the same"),
					neb1=Number Edit Box(5);
					
				),
				 Button Box( "OK",var1 = neb1 << Get;)
			)	
				
	
);




rpt=new window("test",<<WindowView( "Invisible" ),
		obj=dt<<distribution(column(eval(colnames)),Fit All
			
		);
	
	
	
);

Wait( 0 );
dt1=rpt["Distributions", "Compare Distributions", Table Box( 1 )] <<
Make Combined Data Table(invisible);
rpt << Close Window;

//get rid of distribution types that can't have a Process Capability Analysis
// Delete selected rows
dt1 << Select Where(
	:Distribution == "Cauchy" | :Distribution == "ExGaussian" | :Distribution ==
	"Student's t"
) << Delete Rows;


//creat a column that will identify the order of the fitted distributions
// New column: Column 10
dt1 << New Column( "Column 10",
	Numeric,
	"Continuous",
	Format( "Best", 12 )
/*);

// Change column formula: Column 10
dt1:Column 10 << Set Formula*/
	,
	Set Each Value( Col Cumulative Sum( 1, :Y ) );
);

//Select the best fit as well as the normal fits for all Y's then deleted all other rows
// Delete selected rows
dt1 << Select where(
	:Distribution == "Normal" | :Column 10 == 1
) << Invert Row Selection << Delete Rows;

//Create columns to determien and select the normal fit (if it is the best fir or withing our delta criteria we input at the start)
dt1 << New Column( "Column 11",
		Numeric,
		"Continuous",
		Format( "Best", 12 ),
		Set Each Value( //Formula( 
			If( :Column 10 == 1, Empty(), :AICc - Lag( :AICc, -1 ) ) )
	);
dt1 << New Column( "Column 12",
		Numeric,
		"Continuous",
		Format( "Best", 12 ),
		Set Each Value( //Formula( 
			If( :Distribution == "Normal" & :Column 10 == 1, 1 ) )
	);
eval(eval expr(dt1 << New Column( "Column 13",
		Numeric,
		"Continuous",
		Format( "Best", 12 ),
		Set Each Value( //Formula(
			If( :Column 10 == 1 & :Distribution != "Normal",
				If( Abs( Lag( :Column 11, -1 ) ) > expr(var1),
					1
				)
			)
		)
	)));
dt1<<	New Column( "Column 14 2",
		Numeric,
		"Continuous",
		Format( "Best", 12 ),
		Set Each Value( //Formula(
			If( Is Missing( Col Maximum( :Column 13, :Y ) ) & :Column 10 == 2,
				1
			)
		),
		Set Selected
	);
dt1 << New Column( "Column 14",
		Numeric,
		"Continuous",
		Format( "Best", 12 ),
		Set Each Value( //Formula( 
			Sum( :Column 12, :Column 13,:Column 14 2 ) ),
		Set Selected
	);


wait(0);
/*

// Delete column formula: Column 10
dt1:Column 10 << Delete Formula;

// Delete column formula: Column 11
dt1:Column 11 << Delete Formula;

// Delete column formula: Column 11
dt1:Column 12 << Delete Formula;

// Delete column formula: Column 11
dt1:Column 13 << Delete Formula;

// Delete column formula: Column 11
dt1:Column 14 << Delete Formula;*/

//Delete non-Normal fits that are wihtin our criteria
// Delete selected rows
/*dt1 << Select Where( :Column 14 == 1 ) <<
Invert Row Selection << Delete Rows;*/
dt1 << delete rows( dt1 << get rows where( :Column 14 != 1) );

//Puts all the Y's and distributiosn into lists
/*col={};
dist={};
for each row(dt1,
		insertinto(col,:Y);
		insertinto(dist,:Distribution);
);*/
col = dt1:Y << get values;
dist = dt1:Distribution << get values;

close(dt1, nosave);


//bring back up the distribution platform but only with the Y's that we could fit a distribution to
rpt=new window("Best Distribution",
		obj=dt<<distribution(column(eval(col)),Process Capability( 0 ),
			
		);
	
	
	
);

//Apply the distributions

for(i=1, i<=n items(col), i++,

	//whatbox = column(colnames[i])<<get name;
	//test=(Report(obj) << XPath( "//OutlineBox[text() = '"||col[i]||"']"))<< get title();
	//if(eval(test[1])==eval(col[i]),
		if(dist[i]=="Normal",obj[i]<< Fit Normal);
		if(dist[i]=="Exponential",obj[i]<< Fit Exponential);
		if(dist[i]=="Gamma",obj[i]<< Fit Gamma);
		if(dist[i]=="Johnson Su",obj[i]<< Fit Johnson);
		if(dist[i]=="Lognormal",obj[i]<< Fit Lognormal);
		if(dist[i]=="Normal 2 Mixture",obj[i]<<Fit Normal 2 Mixture);
		if(dist[i]=="Normal 3 Mixture",obj[i]<<Fit Normal 3 Mixture);
		if(dist[i]=="SHASH",obj[i]<< Fit Shash);
		if(dist[i]=="Weibull",obj[i]<< Fit Weibull);
		if(dist[i]=="ZI SHASH",obj[i]<< Fit ZI SHASH);
		if(dist[i]=="Beta",obj[i]<< Fit Beta;);
		//);
	);	

 

 

Jim
shampton82
Level VII

Re: This might be a big ask, but can someone help with a script to try and select a normal distribution when it is pretty close to the best fitted distribution?

Thanks @txnelson !

Those changes were great and helped me learn even more.  Greatly appreciate your time and help!

 

Steve

jthi
Super User

Re: This might be a big ask, but can someone help with a script to try and select a normal distribution when it is pretty close to the best fitted distribution?

 

Here is my approach based on your script

Names Default To Here(1);

/*
dt = Open("$SAMPLE_DATA/Cities.jmp");
*/

NON_VALID_DIST = {"Cauchy", "ExGaussian", "Student's t"};

If(N Table() == 0,
	Throw("No tables open");
);

dt = Current Data Table();

colnames = dt << Get selected Columns(continuous, "string");

If(N Items(colnames) == 0,
	Throw("No continuous columns selected");
);


nw = New Window("What AICc is comparable?", Show Menu(0), Show Toolbars(0), <<modal, <<return result,
	H List Box(
		Panel Box("Options",
			Lineup Box(N Col(2),
				Text Box("Put in what difference between Normal and the best fit you consider the same", << Set Wrap(200)),
				neb = Number Edit Box(5)
			)
		),
		Panel Box("Actions",
			Lineup Box(N Col(1),
				Button Box("OK"),
				Button Box("Cancel")		
			)
		)
	)
);

If(nw["Button"] != 1,
	Throw("Cancelled");
);

Caption("Please wait, calculating fits");

same_limit = nw["neb"];

dist = dt << distribution(Column(Eval(colnames)), Fit All, Invisible);

dt_fits = Report(dist)[Outline Box("Compare Distributions"), Table Box(1)] << Make Combined Data Table(invisible);
dist << Close Window;

dt_fits << Get Rows Where(Contains(NON_VALID_DIST, :Distribution));
dt_fits << Delete Rows;

// Leave only the lowest AIC and normal distribution
rows_to_delete = dt_fits << Get Rows Where(!(:Distribution == "Normal" | Col Rank(:AICc, :Y) == 1));
dt_fits << Delete Rows(rows_to_delete);

//"Fix" Johnsons to Johnson
For Each Row(dt_fits,
	If(Starts With(:Distribution, "Johnson"),
		:Distribution = "Johnson";
	);
);


// Could also store AIC for normal and other distribution (if available) to aa_y if needed
aa_y = Associative Array(Column(dt_fits, "Y"));
For Each({ycol}, aa_y << get keys,
	yrows = Loc(dt_fits[0, "Y"], ycol);
	If(N Items(yrows) > 1, // Normal isn't the only distribution
		If(dt_fits[yrows[2], "AICc"] - same_limit <= dt_fits[yrows[1], "AICc"],
			aa_y[ycol] = "Normal";
		,
			aa_y[ycol] = dt_fits[yrows[1], "Distribution"];
		);
	,
		aa_y[ycol] = "Normal";
	);
);

Close(dt_fits, No save);

dist = dt << Distribution(
	Column(Eval(aa_y << get keys)),
	Process Capability(0),
	Invisible
);


For Each({{col, best_dist}, idx}, aa_y,
	Eval(Substitute(
		Expr(dist[idx] << _fit_),
		Expr(_fit_), Parse("Fit " || best_dist); // I hate using Evil Parse, but in this case it might work as long as you "fix" Johnson fits
	));
);


dist << Show Window(1);

Caption(Remove);

Write();

I would also consider creating the first distribution for best fits one column at the time. This way you could update the caption after each column as calculating the best fit can take a long time (could also consider updating progress bar).

 

-Jarmo
shampton82
Level VII

Re: This might be a big ask, but can someone help with a script to try and select a normal distribution when it is pretty close to the best fitted distribution?

Thanks @jthi , lots of good to stuff to learn here as well!  Appreciate all the help!

 

Steve

jthi
Super User

Re: This might be a big ask, but can someone help with a script to try and select a normal distribution when it is pretty close to the best fitted distribution?

Also, you could replace your multiple if-statements with either if-elseif OR with Match. I haven't tested these, but these should give an idea.

 

With multiple if-statements like you have, it will always check for all of them but only one should match. If you use if-elseif it will stop on first found match (If())

 

For(i = 1, i <= N Items(col), i++,
	If(
		dist[i] == "Normal", obj[i]<< Fit Normal,
		dist[i] == "Exponential", obj[i]<< Fit Exponential,
		dist[i] == "Gamma", obj[i]<< Fit Gamma,
		dist[i] == "Johnson Su", obj[i]<< Fit Johnson,
		dist[i] == "Lognormal", obj[i]<< Fit Lognormal,
		dist[i] == "Normal 2 Mixture", obj[i]<<Fit Normal 2 Mixture,
		dist[i] == "Normal 3 Mixture", obj[i]<<Fit Normal 3 Mixture,
		dist[i] == "SHASH", obj[i]<< Fit Shash,
		dist[i] == "Weibull", obj[i]<< Fit Weibull,
		dist[i] == "ZI SHASH", obj[i]<< Fit ZI SHASH,
		dist[i] == "Beta", obj[i]<< Fit Beta
	);
);

 

 

And Match() should just be generally easier to read and if I remember correctly it is also slightly faster

 

For(i = 1, i <= N Items(col), i++,
	Match(dist[i],
		"Normal", obj[i]<< Fit Normal,
		"Exponential", obj[i]<< Fit Exponential,
		"Gamma", obj[i]<< Fit Gamma,
		"Johnson Su", obj[i]<< Fit Johnson,
		"Lognormal", obj[i]<< Fit Lognormal,
		"Normal 2 Mixture", obj[i]<<Fit Normal 2 Mixture,
		"Normal 3 Mixture", obj[i]<<Fit Normal 3 Mixture,
		"SHASH", obj[i]<< Fit Shash,
		"Weibull", obj[i]<< Fit Weibull,
		"ZI SHASH", obj[i]<< Fit ZI SHASH,
		"Beta", obj[i]<< Fit Beta
	);
);

 

 

-Jarmo