## Linear Regression of Data Set by a Column

Community Trekker

Joined:

Jan 16, 2014

I am trying to do a linear regression of 2 columns as grouped by a 3rd column and am stuck. I want a column with slope for that row, and another with intercept for that row.

Short background:

I have 30k different formulations that have each been tested with 13 different levels of an acid. I want to understand the slope of the pH as a function of acid concentration. For my range (and purpose) I can do this as a linear regression.

My data table is set up such as this, but with 30k Formula Combos and 13 acid levels for a total of 390k rows.

FormulaAcid LevelpH

ComboA

56.7
ComboB56.75
ComboC56.5
ComboA106.4
ComboB106.45

I know that I can plot them in Fit Y by X, group the data, and then add linear fits but that 1) takes forever with 30k groups and 2) is a pain to pulls the slopes out of that report

I current have split my table by "Acid Level", splitting "pH", and keeping "Formula". I know have something like this, and want a way to fill in the right 2 columns.

Formulaph 5ph 10ph 15ph 20SlopeIntercept

First preference is to do with a formula. If I need to do a script, I can do that as well.

Thanks

8 REPLIES

Super User

Joined:

Jun 23, 2011

Got most of the relevant code from Ian Cox's posting here.  This will do the analysis and then put all of the slopes and intercepts into a table.

dt = New Table( "Sample Data", Add Rows( 6 ),

New Column( "Formula", Character, Nominal,

Set Values( {"ComboA", "ComboC", "ComboB", "ComboC", "ComboA", "ComboB"} )

),

New Column( "Acid Level", Numeric, Continuous, Format( "Best", 12 ),

Set Values( [5, 15, 5, 5, 10, 10] )

),

New Column( "pH", Numeric, Continuous, Format( "Best", 12 ),

Set Values( [6.7, 4.8, 6.75, 6.5, 6.4, 6.45] )

)

);

// Run Fit Y by X

bv = dt << Bivariate(

Y( :pH ),

X( :Acid Level ),

Fit Line( {Line Color( {213, 72, 87} )} ),

By( :Formula )

);

// A reference to the report layer: bivREp is also a list

bivRep = bv << Report;

// 'Make Combined Data Table' avoids having to message to each item in 'bivRep'

dt2 = bivRep [1] [TableBox(3)] << Make Combined Data Table;

Super User

Joined:

Feb 10, 2013

Peter's script will serve you well.  However, for reference if you were to do this by JMP rather than by JSL (it is often a toss up time wise depending on if it is a one time analysis or not) then if you fit your y-by-x you can "broadcast" your fit line, go get a cup of coffee if needed, then when JMP is done running right click on one of the parameter estimate tables and select "make combined data table" to get a table with the estimates.  Which is how to execute Peter's script without the script.

Staff

Joined:

Jun 23, 2011

Coincidentally, I was just about to point out that, if you know you need linear regression and only want the parameters, then JSL is much faster. The code below took about 100 seconds on my laptop.

NamesDefaultToHere(1);

// Make some grouped data

gs = 10;  // Group size

ng = 30000;  // Number of groups

groupID = SortAscending(Repeat((1::ng)`, gs));

xVals = J(NRow(groupID), 1, RandomNormal());

yVals = J(NRow(groupID), 1, RandomNormal());

// Put the data in a table

dt1 = NewTable("Grouped Regression Data",

NewColumn("Group", Numeric, Nominal, Values(groupID)),

NewColumn("x", Numeric, Continuous, Values(xVals)),

NewColumn("y", Numeric, Continuous, Values(yVals)),

);

// Uncomment this block to use Fit Y By X to do the fitting

/*

// Use Fit Y By X

fxy = dt1 << Bivariate(Y( :y ), X( x ), Fit Line( {Line Color( {208, 64, 86} )} ), By(:Group));

// Get the pfitted parameters into a table

// (To do this interactively, right-click on any Parameter Estimates table in the report and select 'Make Combined Data Table')

fxyRep = fxy << Report;

dt2 = fxyRep[1][TableBox(3)] << makeCombinedDataTable;

// Split dt2

dt3 = dt2 << Split(Split By( :Term ), Split( :Estimate ), Group( :Group ), Remaining Columns( Drop All ));

dt3 << setName("Parameter Estimates from Fit Y By X");

Close(dt2, noSave);

*/

/// Do the linear regression in JSL, bypassing the platform . . .

// (1) Get the data from the table

xVals = Column(dt1, "x") << getAsMatrix;

yVals = Column(dt1, "y") << getAsMatrix;

groupID = Column(dt1, "Group") << getAsMatrix;

// (2) Add the unit vector to the design matrix

xVals = J(NRow(xVals), 1, 1)||xVals;

// (3) Do the linear regression for each group and store the results

beta = J(ng, 2, .);

for (g=1, g <= ng, g++,

thisGroup = Loc(groupID == g);

beta[g, 0] = Transpose(Inv(xVals[thisGroup, 0]`*xVals[thisGroup, 0])*xVals[thisGroup, 0]`*yVals[thisGroup]);

);

// (4) Make the table of results

dt4 = AsTable(beta);

dt4 << setName("Parameter Estimates from JSL");

Column(dt4, "Col1") << setName("Intercept");

Column(dt4, "Col2") << setName("x");

Community Trekker

Joined:

Nov 30, 2015

Ian

thank you that is a nice approach with matrix.

is there a way to get the Rsquare as well?

Staff

Joined:

Jun 23, 2011

Once you have the fitted coefficients for each group (as above), you can get the predicted values, and hence residuals and the residual sum of squares (for each group). The total sum of squares (for each group) is easy too, so then you have everything needed to get R squared (for each group).

Bear in mind, though, that as you add more manipulations to the JSL it will take longer (though in this case still quicker than using the platform). Someone also needs to write and validate the code, of course!

Community Trekker

Joined:

Nov 30, 2015

will check it considering your last comment. Thank you Ian

Community Trekker

Joined:

Nov 30, 2015

coming back to this topic.

I understand all the manipulation and can aplply to my data but still very unclear about the following part:

beta = J(ng, 2, .);

for (g=1, g <= ng, g++,

thisGroup = Loc(groupID == g);

beta[g, 0] = Transpose(Inv(xVals[thisGroup, 0]`*xVals[thisGroup, 0])*xVals[thisGroup, 0]`*yVals[thisGroup]);

);

not able to figure out what is happenning except tht a beta matrix is created and all data replaced in a new table...

Community Trekker

Joined:

Nov 30, 2015

I found following way to calculate expression.

SS_TOT = Summation( i = 1, N Row( dt1 ), (Column( dt1, k ) - a_mean) ^ 2 / N Row( dt1 ) );

but calculation is made for all column "k".

Any hint on how I could include some group separation?

SS_TOT[gp#1] / SS_TOT[gp#2] ...