Solved: Save Parameter Estimates of Linear Regression with by Variable

clausa · Jul 3, 2015 05:29 PM

I am trying to do a linear regression of 2 columns as grouped by a 3rd column and am stuck. I want a column with slope for that row, and another with intercept for that row.

Short background:

I have 30k different formulations that have each been tested with 13 different levels of an acid. I want to understand the slope of the pH as a function of acid concentration. For my range (and purpose) I can do this as a linear regression.

My data table is set up such as this, but with 30k Formula Combos and 13 acid levels for a total of 390k rows.

Formula	Acid Level	pH
ComboA	5	6.7
ComboB	5	6.75
ComboC	5	6.5
ComboA	10	6.4
ComboB	10	6.45

I know that I can plot them in Fit Y by X, group the data, and then add linear fits but that 1) takes forever with 30k groups and 2) is a pain to pulls the slopes out of that report

I current have split my table by "Acid Level", splitting "pH", and keeping "Formula". I know have something like this, and want a way to fill in the right 2 columns.

Formula	ph 5	ph 10	ph 15	ph 20
ComboA	data	data	data	data
ComboB	data	data	data	data
ComboC	data	data	data	data

First preference is to do with a formula. If I need to do a script, I can do that as well.

Thanks

pmroz · Jul 6, 2015 09:56 AM

Got most of the relevant code from Ian Cox's posting here. This will do the analysis and then put all of the slopes and intercepts into a table.

dt = New Table( "Sample Data", Add Rows( 6 ),

New Column( "Formula", Character, Nominal,

Set Values( {"ComboA", "ComboC", "ComboB", "ComboC", "ComboA", "ComboB"} )

),

New Column( "Acid Level", Numeric, Continuous, Format( "Best", 12 ),

Set Values( [5, 15, 5, 5, 10, 10] )

),

New Column( "pH", Numeric, Continuous, Format( "Best", 12 ),

Set Values( [6.7, 4.8, 6.75, 6.5, 6.4, 6.45] )

)

);

// Run Fit Y by X

bv = dt << Bivariate(

Y( :pH ),

X( :Acid Level ),

Fit Line( {Line Color( {213, 72, 87} )} ),

By( :Formula )

);

// A reference to the report layer: bivREp is also a list

bivRep = bv << Report;

// 'Make Combined Data Table' avoids having to message to each item in 'bivRep'

dt2 = bivRep [1] [TableBox(3)] << Make Combined Data Table;

View solution in original post

KarenC · Jul 6, 2015 10:39 AM

Peter's script will serve you well. However, for reference if you were to do this by JMP rather than by JSL (it is often a toss up time wise depending on if it is a one time analysis or not) then if you fit your y-by-x you can "broadcast" your fit line, go get a cup of coffee if needed, then when JMP is done running right click on one of the parameter estimate tables and select "make combined data table" to get a table with the estimates. Which is how to execute Peter's script without the script.

View solution in original post

pmroz · Jul 6, 2015 09:56 AM

Got most of the relevant code from Ian Cox's posting here. This will do the analysis and then put all of the slopes and intercepts into a table.

dt = New Table( "Sample Data", Add Rows( 6 ),

New Column( "Formula", Character, Nominal,

Set Values( {"ComboA", "ComboC", "ComboB", "ComboC", "ComboA", "ComboB"} )

),

New Column( "Acid Level", Numeric, Continuous, Format( "Best", 12 ),

Set Values( [5, 15, 5, 5, 10, 10] )

),

New Column( "pH", Numeric, Continuous, Format( "Best", 12 ),

Set Values( [6.7, 4.8, 6.75, 6.5, 6.4, 6.45] )

)

);

// Run Fit Y by X

bv = dt << Bivariate(

Y( :pH ),

X( :Acid Level ),

Fit Line( {Line Color( {213, 72, 87} )} ),

By( :Formula )

);

// A reference to the report layer: bivREp is also a list

bivRep = bv << Report;

// 'Make Combined Data Table' avoids having to message to each item in 'bivRep'

dt2 = bivRep [1] [TableBox(3)] << Make Combined Data Table;

KarenC · Jul 6, 2015 10:39 AM

Peter's script will serve you well. However, for reference if you were to do this by JMP rather than by JSL (it is often a toss up time wise depending on if it is a one time analysis or not) then if you fit your y-by-x you can "broadcast" your fit line, go get a cup of coffee if needed, then when JMP is done running right click on one of the parameter estimate tables and select "make combined data table" to get a table with the estimates. Which is how to execute Peter's script without the script.

ian_jmp · Jul 6, 2015 11:22 AM

Coincidentally, I was just about to point out that, if you know you need linear regression and only want the parameters, then JSL is much faster. The code below took about 100 seconds on my laptop.

NamesDefaultToHere(1);

// Make some grouped data

gs = 10; // Group size

ng = 30000; // Number of groups

groupID = SortAscending(Repeat((1::ng)`, gs));

xVals = J(NRow(groupID), 1, RandomNormal());

yVals = J(NRow(groupID), 1, RandomNormal());

// Put the data in a table

dt1 = NewTable("Grouped Regression Data",

NewColumn("Group", Numeric, Nominal, Values(groupID)),

NewColumn("x", Numeric, Continuous, Values(xVals)),

NewColumn("y", Numeric, Continuous, Values(yVals)),

);

// Uncomment this block to use Fit Y By X to do the fitting

/*

// Use Fit Y By X

fxy = dt1 << Bivariate(Y( :y ), X( x ), Fit Line( {Line Color( {208, 64, 86} )} ), By(:Group));

// Get the pfitted parameters into a table

// (To do this interactively, right-click on any Parameter Estimates table in the report and select 'Make Combined Data Table')

fxyRep = fxy << Report;

dt2 = fxyRep[1][TableBox(3)] << makeCombinedDataTable;

// Split dt2

dt3 = dt2 << Split(Split By( :Term ), Split( :Estimate ), Group( :Group ), Remaining Columns( Drop All ));

dt3 << setName("Parameter Estimates from Fit Y By X");

Close(dt2, noSave);

*/

/// Do the linear regression in JSL, bypassing the platform . . .

// (1) Get the data from the table

xVals = Column(dt1, "x") << getAsMatrix;

yVals = Column(dt1, "y") << getAsMatrix;

groupID = Column(dt1, "Group") << getAsMatrix;

// (2) Add the unit vector to the design matrix

xVals = J(NRow(xVals), 1, 1)||xVals;

// (3) Do the linear regression for each group and store the results

beta = J(ng, 2, .);

for (g=1, g <= ng, g++,

thisGroup = Loc(groupID == g);

beta[g, 0] = Transpose(Inv(xVals[thisGroup, 0]`*xVals[thisGroup, 0])*xVals[thisGroup, 0]`*yVals[thisGroup]);

);

// (4) Make the table of results

dt4 = AsTable(beta);

dt4 << setName("Parameter Estimates from JSL");

Column(dt4, "Col1") << setName("Intercept");

Column(dt4, "Col2") << setName("x");

KinKame · May 10, 2016 07:24 AM

Ian

thank you that is a nice approach with matrix.

is there a way to get the Rsquare as well?

ian_jmp · May 10, 2016 08:05 AM

Once you have the fitted coefficients for each group (as above), you can get the predicted values, and hence residuals and the residual sum of squares (for each group). The total sum of squares (for each group) is easy too, so then you have everything needed to get R squared (for each group).

Bear in mind, though, that as you add more manipulations to the JSL it will take longer (though in this case still quicker than using the platform). Someone also needs to write and validate the code, of course!

KinKame · Jun 26, 2016 10:40 PM

will check it considering your last comment. Thank you Ian

KinKame · Sep 1, 2016 02:23 AM

coming back to this topic.

I understand all the manipulation and can aplply to my data but still very unclear about the following part:

beta = J(ng, 2, .);

for (g=1, g <= ng, g++,

thisGroup = Loc(groupID == g);

beta[g, 0] = Transpose(Inv(xVals[thisGroup, 0]`*xVals[thisGroup, 0])*xVals[thisGroup, 0]`*yVals[thisGroup]);

);

not able to figure out what is happenning except tht a beta matrix is created and all data replaced in a new table...

could you please help me understand this point?

KinKame · Sep 5, 2016 05:22 AM

I found following way to calculate expression.

SS_TOT = Summation( i = 1, N Row( dt1 ), (Column( dt1, k ) - a_mean) ^ 2 / N Row( dt1 ) );

but calculation is made for all column "k".

Any hint on how I could include some group separation?

SS_TOT[gp#1] / SS_TOT[gp#2] ...

kasias · Feb 19, 2025 2:46 PM

Running JSL is very useful for large datasets and large # of #groupIDs. Is there a way to iterate over non-numeric groupID.

And of course needs a column to store the non-numeric GroupID also.

Save Parameter Estimates of Linear Regression with by Variable

Re: Linear Regression of Data Set by a Column

Re: Linear Regression of Data Set by a Column

Re: Linear Regression of Data Set by a Column

Re: Linear Regression of Data Set by a Column

Re: Linear Regression of Data Set by a Column

Re: Linear Regression of Data Set by a Column

Re: Linear Regression of Data Set by a Column

Re: Linear Regression of Data Set by a Column

Re: Linear Regression of Data Set by a Column

Re: Linear Regression of Data Set by a Column

Re: Linear Regression of Data Set by a Column