cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
Try the Materials Informatics Toolkit, which is designed to easily handle SMILES data. This and other helpful add-ins are available in the JMP® Marketplace
Choose Language Hide Translation Bar
eangello
Level II

How to Specify a Correlation Matrix in Profiler for Simulation?

Hello,

 

I am writing a script to create multivariate normal data with a specified number of rows, mean/standard deviation vectors, and correlation structure from a set of sample data. I would like to take advantage of the Simulator in the Profiler platform to generate the data table through the following steps:

  1. Calculate sample mean and standard deviation for each variable.
  2. Generate correlation matrix from sample data.
  3. Simulate multivariate data using the summary statistics from (1) and (2).

Currently, my script generates the simulated data table with the appropriate means and standard deviations for each variable, but I cannot manage to maintain the correlation structure between variables correctly. I believe the cause of my issue is how I specify the [NxN correlations] with X Correlations using a variable instead of explicitly specifying the correlation matrix (e.g. [1 0.3 0, 0.3 1 0, 0 0 1]), but my troubleshooting skills have left me with no solution. Current script and expected vs. actual output provided below, any suggestions on how to achieve these steps are welcomed:

 

JSL:

names default to here(1);
Try( Close( Data Table(dt) ) );
If( Is Empty( dt ),
	Try( dt = Open(), Throw( "No data found" ) )
);
ncol = NCol(dt);

//Dialog Window.
Dlg = New Window("New Window",
	<< Modal,
	BorderBox(left(3), top(2),
		V List Box(
			TextBox("Simulate multivariate data"),
			HListBox(
				VListBox(
					PanelBox("Select variables",
						colListData = ColListBox(all,nLines,(min(ncol,10)))
					),
				),
				PanelBox("Cast Selected Variables",
					LineupBox(NCol(2), Spacing(3),
					ButtonBox("Variables", colListY << Append(colListData << Get Selected)),
					colListY = ColListBox(nLines(6), "numeric"))
				),
				//specifications - not in use.
				PanelBox("Enter Specifications",
					Spacer Box(Size(10, 1)),
						HListBox(
							VListBox(
								Spacer Box(Size(5, 10)),
							   lsl1 = Number Edit Box(, 3),
							   Spacer Box(Size(5, 10)),
							   lsl2 = Number Edit Box(, 3),
							   Spacer Box(Size(5, 10)),
							   lsl3 = Number Edit Box(, 3),
							   Spacer Box(Size(5, 10)),
							   lsl4 = Number Edit Box(, 3)
							),		
						SpacerBox(Size(10,10)),
						VListBox(
						   	Spacer Box(Size(5, 10)),
						   usl1 = Number Edit Box(, 3),
						   Spacer Box(Size(5, 10)),
						   usl2 = Number Edit Box(, 3),
						   Spacer Box(Size(5, 10)),
						   usl3 = Number Edit Box(, 3),
						   Spacer Box(Size(5, 10)),
						   usl4 = Number Edit Box(, 3)
						)
						),
				),
				PanelBox("Action", 
				LineupBox(NCol(1),
				// OK button
				Button Box("OK",
					lsl_mat = J(4,1, .);
					lsl_mat[1] = Eval(lsl1 << Get);
					lsl_mat[2] = Eval(lsl2 << Get);
					lsl_mat[3] = Eval(lsl3 << Get);
					lsl_mat[4] = Eval(lsl4 << Get);
					usl_mat = J(4,1, .);
					usl_mat[1] = Eval(usl1 << Get);
					usl_mat[2] = Eval(usl2 << Get);
					usl_mat[3] = Eval(usl3 << Get);
					usl_mat[4] = Eval(usl4 << Get);			
					// Store summary statistics
					tpListY = colListY << Get Items;
					tpMean = J(4, 1, 0);
					tpSD = J(4, 1, 0);
				
					for (i=1, i <= nitems(tpListY), i++,
							one_col = tpListY[i];
							Meani = Col Mean(Column(dt, one_col));
							SDi = Col Std Dev(Column(dt, one_col));
						
						tpMean[i] = Meani;
						tpSD[i] = SDi;
					);
					Dlg << CloseWindow
				),
				
				// Cancel button
				Button Box("Cancel", Dlg << CloseWindow),
				Spacer Box(Size(10,10)),
				// Remove button
				Button Box("Remove", collistY << RemoveSelected)
				)
				),	
			)
		)
	)
);

//correlation matrix
corr = multivariate(
	Y(Eval List(tpListY)),
	estimation method("Row-wise"),
	matrix format("Square"),
	scatterplot matrix(0)
);

corr.run = corr << run;
corr.rep = corr.run << report;
corr.mat = corr.rep["Correlations"][matrix box(1)] << get(1);

//only used to launch profiler
prof.x = J(2,4 ,1);

prof.tbl = as table(prof.x, << column names(tpListY), << invisible);

prof.tbl << new column("Y",
	Formula( :TP1 + :TP2 + :TP3 + :TP4)
);

prof.tbl = current data table();

random reset(1234);

//specify multivariate data from summary statistics
sim.design = Profiler(
	Y(:Y),
	Invisible
);
sim.design << Simulator(
	1,
	Factors(
		TP1 << multivariate( tpMean[1], tpSD[1]),
		TP2 << multivariate( tpMean[2], tpSD[2]),
		TP3 << multivariate( tpMean[3], tpSD[3]),
		TP4 << multivariate( tpMean[4], tpSD[4]),
	),
	X Correlations(
		1,
		tpListY,
		corr.mat
	),
	N Runs(10000)
);

//Make data table
report( sim.design )["Simulate to Table"][button box( 1 )] << click( 1 );
sim.dt = data table( 1 );
sim.dt << set name( "Simulated multivariate data" );
sim.dt << delete columns({"Y","Obj"});

Expected vs. actual output:

expectedvsactualexpectedvsactual

10 REPLIES 10
SDF1
Super User

Re: How to Specify a Correlation Matrix in Profiler for Simulation?

Hi @eangello ,

 

  This is an interesting question, and I'd like to try and help if I can, but I'm not able to reproduce your issue. When I try to run your script, it asks me to open a data table from the sample data directory. One issue here is that I don't know what data table you're working with and won't get the same issue. Can you provide the data table that you're working with?

 

  If I do select a data table from the sample data -- e.g. the 2D Gaussian process example.jmp file, the next dialogue window asks to cast columns into roles, but then when I click OK, I get the error that there is an unresolved name Name in excess or evaluation of 'Button', Button(-1) /*###*/. Then, another error comes up from line 124 in your code where it doesn't like the sim.design << Simulator()  send command.

 

  Apparently, you have a data table where this works. Can you share it? Or give more details on whichever table you are using to generate the output that you showed in your post?

 

Thanks!,

DS

eangello
Level II

Re: How to Specify a Correlation Matrix in Profiler for Simulation?

Hi @SDF1 ,

 

I didn't due my due diligence before posting - here is a sample of 10 rows from the sample data (4 X's; TP1-TP4). The current script is constrained to 4 variables, but I hope to be able to scale this to N variables in the future. This sample data table will recreate my problem. I believe the unresolved name error is from the dialog box (maybe cancel/remove buttons), but I have not found root cause. 

 

Thanks,

EA

SDF1
Super User

Re: How to Specify a Correlation Matrix in Profiler for Simulation?

Hi @eangello ,

 

  Thanks for the data table. I can recreate your issue as you mention. One thing I am noticing is that when you run the simulator, it doesn't create the histogram next to the profiler, not like the tiretread example in the help index on X Correlations.

 

  I think there are two things going on:

1. calling tpListY

2. calling corr.mat

 

  I am not sure why, but I think you need to explicitly "spell"out the list for the X Correltions in your code and this includes the correlation matrix., so instead of calling tpListY, you need to replace that with {TP1, TP2, TP3, TP4}. Similarly, you need to put the actual 4x4 (or NxN) correlation matrix in there between the square brackets [].

 

  I thought that doing something like Parse(tpListY[1]) would work, but I can't get it to work, even with an Eval() wrapped around it. I can only get it to work automatically if I put in the JSL code as {TP1, TP2, TP3, TP4}. Note that when you put in the NxN correlation matrix, that it actually uses the lower left triangle of the matrix to fill in the table.

 

   I also thought that calling the elements of the corr.mat might work, as in corr.mat[1], etc., but that doesn't work either. If I try that, I get an invalid matrix token error, and not really sure why, because the elements of corr.mat are numbers.

 

The below code (I modified your function to change it up a bit) works to provide the expected output, but it doesn't help with your desire to expand from a 4x4 system to an NxN system.

names default to here(1);
Try( Close( Data Table(dt) ) );
If( Is Empty( dt ),
	Try( dt = Open(), Throw( "No data found" ) )
);
ncol = NCol(dt);

//Dialog Window.
Dlg = New Window("New Window",
	<< Modal,
	BorderBox(left(3), top(2),
		V List Box(
			TextBox("Simulate multivariate data"),
			HListBox(
				VListBox(
					PanelBox("Select variables",
						colListData = ColListBox(all,nLines,(min(ncol,10)))
					),
				),
				PanelBox("Cast Selected Variables",
					LineupBox(NCol(2), Spacing(3),
					ButtonBox("Variables", colListY << Append(colListData << Get Selected)),
					colListY = ColListBox(nLines(6), "numeric"))
				),
				//specifications - not in use.
				PanelBox("Enter Specifications",
					Spacer Box(Size(10, 1)),
						HListBox(
							VListBox(
								Spacer Box(Size(5, 10)),
							   lsl1 = Number Edit Box(, 3),
							   Spacer Box(Size(5, 10)),
							   lsl2 = Number Edit Box(, 3),
							   Spacer Box(Size(5, 10)),
							   lsl3 = Number Edit Box(, 3),
							   Spacer Box(Size(5, 10)),
							   lsl4 = Number Edit Box(, 3)
							),		
						SpacerBox(Size(10,10)),
						VListBox(
						   	Spacer Box(Size(5, 10)),
						   usl1 = Number Edit Box(, 3),
						   Spacer Box(Size(5, 10)),
						   usl2 = Number Edit Box(, 3),
						   Spacer Box(Size(5, 10)),
						   usl3 = Number Edit Box(, 3),
						   Spacer Box(Size(5, 10)),
						   usl4 = Number Edit Box(, 3)
						)
						),
				),
				PanelBox("Action", 
				LineupBox(NCol(1),
				// OK button
				Button Box("OK",
					lsl_mat = J(4,1, .);
					lsl_mat[1] = Eval(lsl1 << Get);
					lsl_mat[2] = Eval(lsl2 << Get);
					lsl_mat[3] = Eval(lsl3 << Get);
					lsl_mat[4] = Eval(lsl4 << Get);
					usl_mat = J(4,1, .);
					usl_mat[1] = Eval(usl1 << Get);
					usl_mat[2] = Eval(usl2 << Get);
					usl_mat[3] = Eval(usl3 << Get);
					usl_mat[4] = Eval(usl4 << Get);			
					// Store summary statistics
					tpListY = colListY << Get Items;
					tpMean = J(4, 1, 0);
					tpSD = J(4, 1, 0);
				
					for (i=1, i <= nitems(tpListY), i++,
							one_col = tpListY[i];
							Meani = Col Mean(Column(dt, one_col));
							SDi = Col Std Dev(Column(dt, one_col));
						
						tpMean[i] = Meani;
						tpSD[i] = SDi;
					);
					Dlg << CloseWindow
				),
				
				// Cancel button
				Button Box("Cancel", Dlg << CloseWindow),
				Spacer Box(Size(10,10)),
				// Remove button
				Button Box("Remove", collistY << RemoveSelected)
				)
				),	
			)
		)
	)
);

//correlation matrix
corr = multivariate(
	Y(Eval List(tpListY)),
	estimation method("Row-wise"),
	matrix format("Square"),
	scatterplot matrix(0)
);

corr.run = corr << run;
corr.rep = corr.run << report;
corr.mat = corr.rep["Correlations"][matrix box(1)] << get(1);

//only used to launch profiler
prof.x = J(200,4 ,1);

prof.tbl = as table(prof.x, << column names(tpListY), << invisible);

prof.tbl << new column("Y",
	Formula( :TP1+:TP2^2-:TP3+:TP4*Random Normal(0,2))
);

prof.tbl = current data table();

random reset(1234);

//specify multivariate data from summary statistics
sim.design = Profiler( Y( :Y ), Invisible );
sim.design << Simulator(
	1,
	Factors(
		TP1 << multivariate( tpMean[1], tpSD[1] ), TP2 << multivariate( tpMean[2], tpSD[2] ),
		TP3 << multivariate( tpMean[3], tpSD[3] ), TP4 << multivariate( tpMean[4], tpSD[4] ),
	),
	Responses( Y << No Noise ),
	Automatic Histogram Update( 1 ),
	X Correlations( 1, {TP1, TP2, TP3, TP4}, 
	[1 0.6401843997 0.670478399654806 0.6405126152, 0.6401843997 1	0.9443055859	0.8815992957, 0.6704783997	0.9443055859	1	0.9355872238, 0.6405126152	0.8815992957	0.9355872238	1] 
	),
	N Runs( 10000 ),
	Simulate
);

//Make data table
report( sim.design )["Simulate to Table"][button box( 1 )] << click( 1 );
sim.dt = data table( 1 );
sim.dt << set name( "Simulated multivariate data" );
sim.dt << delete columns({"Y","Obj"});

  There must be a way to do it, since you can call the elements of tpMean and tpSD for filling in the multivariate noise for the factors. I think someone with more experience in JSL might be needed to comment on that.

 

  Lastly, I think you are right about one of the button boxes, but I couldn't pinpoint it exactly either -- was more focused on trying to solve the tpListY and corr.mat problems!

 

I'm really interested in what you find out.

 

Good luck,

DS

eangello
Level II

Re: How to Specify a Correlation Matrix in Profiler for Simulation?

Hi @SDF1 ,

 

Thank you for the help troubleshooting, much appreciated. I played around with similar functions like Eval, Eval Expr( Expr(x) )and Evalinsert and found a solution by forcing the column and matrix variable(s) to evaluate before sending to Profiler (which X Correlations seems to accept). I applied the same solution above to make my Profiler flexible to different column names. Now my correlation structure is maintained in the simulated data.

 

new.design = evalinsert(
"\[sim.design << Simulator(
	1,
	Factors(
		Name( "^col1^" ) << multivariate( tpMean[1], tpSD[1]),
		Name( "^col2^" ) << multivariate( tpMean[2], tpSD[2]),
		Name( "^col3^" ) << multivariate( tpMean[3], tpSD[3]),
		Name( "^col4^" ) << multivariate( tpMean[4], tpSD[4]),
	),
	X Correlations(
		1,
		^tpListY^,
		^corr.mat^
	),
	N Runs(10000)) ]\");
eval(parse(new.design));

Thanks again,

EA

SDF1
Super User

Re: How to Specify a Correlation Matrix in Profiler for Simulation?

Hi @eangello ,

 

  Great solution, and I wish I would have remembered to do it that way. I had to do something similar when writing a script to automate some model tuning. The EvalInsert() and escape characters for quotation marks and ^^ for inserting values was very helpful. I just don't use it that often and forgot about that particular way of doing things.

 

  Out of curiosity, why do you need to generate simulated data like that with the specified correlation structure. I'm quite interested, as it has given me some ideas on how to incorporate that into some things I'm working on.

 

Thanks!,

DS

eangello
Level II

Re: How to Specify a Correlation Matrix in Profiler for Simulation?

Hi @SDF1 ,

 

I frequently simulate multivariate datasets for modeling purposes (model tuning among them), typically based on some historical parameters. This script was just a simple solution to serve this mock data. I have no specific use case right now, but I intend to scale this script to n-dimensions and potentially use it as a functional module I can incorporate into more extensive scripts that use large-sample techniques. Given the constraints with sending mean/sd vectors and correlation matrix to the Profiler platform I will likely resort to explicitly writing the linear operations in order to scale the simulation to n-dimensions.

 

Regards,

eangello  

SDF1
Super User

Re: How to Specify a Correlation Matrix in Profiler for Simulation?

Hi @eangello ,

 

  That's exactly what went through my mind when I realized what you were wanting to do with the correlation structure. I would like to borrow on your idea and modify it to some of the specific issues I'm facing. I've posted in another thread about generalizing to N inputs. I think I'm pretty close, but not quite there. I'm having difficulties scaling it up when calling the X Correlations() and setting the mean and std dev values for the factors.

 

  Attached is a mock-up data table and script that I'm working, all based off your original script. Maybe you can find the mistakes I'm making? I'd love to see this working for an arbitrary number of columns, it would be very helpful for historical data, as you point out.

 

Thanks!,

DS

SDF1
Super User

Re: How to Specify a Correlation Matrix in Profiler for Simulation?

Hi @eangello ,

 

  I think I've been able to work up a script that is scalable to N columns. I give @ms a HUGE amount of credit for help in generalizing the correlation portion of the Profiler. That thread is here.

 

There are two scripts, a standalone version like the one you implemented, but without the extra variables that you had in your original post, as well as a script that can run from a data table itself. The standalone is below and the one to run from the data table is in the Gendt.jmp data table. It works with both your data table with 4 columns and the other data table I made with 11 columns. I've tried to also generalize the standalone script for data tables with other column types.

 

 

Names Default To Here( 1 );

// Expression to clear all current settings
clearRoles = Expr( ColListY << RemoveAll );

// Expression to store the current settings in global variables
recallRoles = Expr(
	::ycolRecall = ColListY << GetItems
);

//get data table
Try( Close( Data Table( dt ) ) );
If( Is Empty( dt ),
	Try( dt = Open(), Throw( "No data found" ) )
);

//Find the number of columns in the data table
ncol = N Col( dt );
ColList = dt << Get Column Names( Numeric, Continuous );

//Dialog Window.
Dlg = New Window( "New Window",
	<<Return Result,
	BorderBox( Left( 3 ), top( 2 ),
		V List Box(
			Text Box( "Simulate multivariate data" ),
			HListBox(
				V List Box( Panel Box( "Select variables", colListData = Col List Box( all, nLines, (Min( ncol, 10 )) ) ), ),
				PanelBox( "Cast Selected Columns into Roles",
					LineupBox( N Col( 2 ), Spacing( 3 ),
						Button Box( "Variables", colListY << Append( colListData << Get Selected ) ),
						colListY = Col List Box( nLines( 6 ), "numeric" )
					)
				),
				PanelBox( "Action",
					LineupBox( N Col( 1 ), 
				// OK button
						Button Box( "OK",
							recallRoles;		
							tpListY = colListY << Get Items;
							SimData;
							Dlg << Close Window;
						), 
						//Cancel Button
						Button Box( "Cancel", Dlg << Close Window ),
						Text Box( " " ), 
						//Remove Button
						Button Box( "Remove", tpListY << Remove Selected ), 
						//Recall Button
						Button Box( "Recall",
							clearRoles;
							Try( tpListY << Append( ::ycolRecall ) );
						)
					)
				), 

			)
		)
	)
);


SimData = Expr(

//Summary Statistics
	MMean = J( N Items( tpListY ), 1, 0 );
	MSD = J( N Items( tpListY ), 1, 0 );
							
	For( i = 1, i <= N Items( tpListY ), i++,
		Meani = Col Mean( Column( dt, tpListY[i] ) );
		SDi = Col Std Dev( Column( dt, tpListY[i] ) );
						
		MMean[i] = Meani;
		MSD[i] = SDi;
	);
			
//correlation matrix
	corr = multivariate( Y( Eval( tpListY ) ), estimation method( "Row-wise" ), matrix format( "Square" ), scatterplot matrix( 0 ) );

	corr_run = corr << run;
	corr_rep = corr_run << report;
	corr_mat = corr_rep["Correlations"][Matrix Box( 1 )] << get( 1 );

//only used to launch profiler
	Xs = N Items( tpListY );
	prof_x = J( Xs, Xs, 1 );

	prof_tbl = As Table( prof_x, <<column names( tpListY ), <<invisible );
	prof_tbl = Current Data Table();

prof_tbl_cols = prof_tbl << Get Column Names;

	FSum = Eval Expr( Sum( Expr(  prof_tbl_cols ) ) );
	prof_tbl << New Column( "SimPred", Formula( Name Expr( FSum ) ) );


	FS_expr = Expr( Factors() );
	For( i = 1, i <= N Items( tpListY ), i++,
		Insert Into( FS_expr, Eval Expr( Expr( tpListY[i] ) << Multivariate( Expr( MMean[i] ), Expr( MSD[i] ) ) ) )
	);

// Make the complete Simulator expression
	sim = Eval Expr(
		sim_design << Simulator( 1, Expr( Name Expr( FS_expr ) ), X Correlations( 1, Expr( tpListY ), Expr( corr_mat ) ), N Runs( 1e4 ) )
	);

// Name Expr( sim ) // Run this line to show the expression in the log
	// Launch Profiler and run the Simulator expression
	sim_design = Profiler( Y( :SimPred ) );
	sim;

//Make data table
	Report(sim_design)["Simulate to Table"][Button Box( 1 )] << Click( 1 );
	sim_dt = Data Table( 1 );
	sim_dt << Set Name( "Simulated MV Data" );
	sim_dt << Delete Columns( {:SimPred, :Obj} );

//Close down other tables
	open_dts = Get Data Table List();

	For( i = 1, i <= N Items( open_dts ), i++,
		dt_name = open_dts[i] << Get Name;
		If( Starts With( dt_name, "Sim" ) != 1,
			Close( open_dts[i], NoSave )
		);
	);
);

 

  Also, I think the Button Box error you receive has to do with the <<Modal call for the dialogue window. I replaced that with << Return Result, and I don't get the error anymore. I also included a Recall button, but you don't have to have it.

 

  I hope this helps, this was really cool and I learned a lot from it. I know I'll definitely be able to use this for my work that I do!

 

Thanks!,

DS

eangello
Level II

Re: How to Specify a Correlation Matrix in Profiler for Simulation?

Hi @SDF1 ,

 

I never would have thought of that approach, nicely done. As for the dialog window I think that you're right, ill have to take a closer look. It turned out to be less of a headache to avoid the Profiler platform all together, while it was a nice idea it did not pan out for me. Here is the same result but with less fuss (similar to the steps I take to implement in R) and only mean vector and covariance matrix needed as inputs. 

 

//tpListY is variables.

cov = multivariate( Y(eval list(tpListY)), estimation method("Row-wise"), matrix format("Lower Triangular"), scatterplot matrix(1) ); cov << covariance matrix(1); cov.run = cov << run; cov.rep = cov.run << report; cov.mat = cov.rep["Covariance Matrix"][matrix box(1)] << get(1); random reset(1234); //generate X = (X1, . . . , Xn) where X ∼ MNn(µ, Σ) using the Cholesky decomposition of Σ //to find (nxn) matrix, C, such that C*C` = Σ. //let Z = (Z1, . . . , Zn)` where the Zi’s are IID N(µ, 1) for i = 1, . . . , n. //mean.M is mean vector. sim.n = 1e4; dummy.mat = j(1,sim.n,1); mean.mat = mean.M*dummy.mat; n.means = n row(mean.M); sim.C = cholesky(cov.mat); sim.Z = j(n.means, sim.n, random normal()); sim.X = sim.C*sim.Z; sim.X2 = sim.X + mean.mat; sim.Xt = transpose(sim.X2); as table(sim.Xt, << column names(tpListY));

 

I prefer using less code to get the same job done so this was more appealing rather than the Profiler route. Hopefully this makes it easier to implement elsewhere.

 

Thanks again,

eangello