- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Get Direct Link
- Report Inappropriate Content
Save Parameter Estimates of Linear Regression with by Variable
I am trying to do a linear regression of 2 columns as grouped by a 3rd column and am stuck. I want a column with slope for that row, and another with intercept for that row.
Short background:
I have 30k different formulations that have each been tested with 13 different levels of an acid. I want to understand the slope of the pH as a function of acid concentration. For my range (and purpose) I can do this as a linear regression.
My data table is set up such as this, but with 30k Formula Combos and 13 acid levels for a total of 390k rows.
Formula | Acid Level | pH |
---|---|---|
ComboA |
5 | 6.7 |
ComboB | 5 | 6.75 |
ComboC | 5 | 6.5 |
ComboA | 10 | 6.4 |
ComboB | 10 | 6.45 |
I know that I can plot them in Fit Y by X, group the data, and then add linear fits but that 1) takes forever with 30k groups and 2) is a pain to pulls the slopes out of that report
I current have split my table by "Acid Level", splitting "pH", and keeping "Formula". I know have something like this, and want a way to fill in the right 2 columns.
Formula | ph 5 | ph 10 | ph 15 | ph 20 | Slope | Intercept |
---|---|---|---|---|---|---|
ComboA | data | data | data | data | ||
ComboB | data | data | data | data | ||
ComboC | data | data | data | data |
First preference is to do with a formula. If I need to do a script, I can do that as well.
Thanks
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Get Direct Link
- Report Inappropriate Content
Re: Linear Regression of Data Set by a Column
Got most of the relevant code from Ian Cox's posting here. This will do the analysis and then put all of the slopes and intercepts into a table.
dt = New Table( "Sample Data", Add Rows( 6 ),
New Column( "Formula", Character, Nominal,
Set Values( {"ComboA", "ComboC", "ComboB", "ComboC", "ComboA", "ComboB"} )
),
New Column( "Acid Level", Numeric, Continuous, Format( "Best", 12 ),
Set Values( [5, 15, 5, 5, 10, 10] )
),
New Column( "pH", Numeric, Continuous, Format( "Best", 12 ),
Set Values( [6.7, 4.8, 6.75, 6.5, 6.4, 6.45] )
)
);
// Run Fit Y by X
bv = dt << Bivariate(
Y( :pH ),
X( :Acid Level ),
Fit Line( {Line Color( {213, 72, 87} )} ),
By( :Formula )
);
// A reference to the report layer: bivREp is also a list
bivRep = bv << Report;
// 'Make Combined Data Table' avoids having to message to each item in 'bivRep'
dt2 = bivRep [1] [TableBox(3)] << Make Combined Data Table;
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Get Direct Link
- Report Inappropriate Content
Re: Linear Regression of Data Set by a Column
Peter's script will serve you well. However, for reference if you were to do this by JMP rather than by JSL (it is often a toss up time wise depending on if it is a one time analysis or not) then if you fit your y-by-x you can "broadcast" your fit line, go get a cup of coffee if needed, then when JMP is done running right click on one of the parameter estimate tables and select "make combined data table" to get a table with the estimates. Which is how to execute Peter's script without the script.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Get Direct Link
- Report Inappropriate Content
Re: Linear Regression of Data Set by a Column
Got most of the relevant code from Ian Cox's posting here. This will do the analysis and then put all of the slopes and intercepts into a table.
dt = New Table( "Sample Data", Add Rows( 6 ),
New Column( "Formula", Character, Nominal,
Set Values( {"ComboA", "ComboC", "ComboB", "ComboC", "ComboA", "ComboB"} )
),
New Column( "Acid Level", Numeric, Continuous, Format( "Best", 12 ),
Set Values( [5, 15, 5, 5, 10, 10] )
),
New Column( "pH", Numeric, Continuous, Format( "Best", 12 ),
Set Values( [6.7, 4.8, 6.75, 6.5, 6.4, 6.45] )
)
);
// Run Fit Y by X
bv = dt << Bivariate(
Y( :pH ),
X( :Acid Level ),
Fit Line( {Line Color( {213, 72, 87} )} ),
By( :Formula )
);
// A reference to the report layer: bivREp is also a list
bivRep = bv << Report;
// 'Make Combined Data Table' avoids having to message to each item in 'bivRep'
dt2 = bivRep [1] [TableBox(3)] << Make Combined Data Table;
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Get Direct Link
- Report Inappropriate Content
Re: Linear Regression of Data Set by a Column
Peter's script will serve you well. However, for reference if you were to do this by JMP rather than by JSL (it is often a toss up time wise depending on if it is a one time analysis or not) then if you fit your y-by-x you can "broadcast" your fit line, go get a cup of coffee if needed, then when JMP is done running right click on one of the parameter estimate tables and select "make combined data table" to get a table with the estimates. Which is how to execute Peter's script without the script.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Get Direct Link
- Report Inappropriate Content
Re: Linear Regression of Data Set by a Column
Coincidentally, I was just about to point out that, if you know you need linear regression and only want the parameters, then JSL is much faster. The code below took about 100 seconds on my laptop.
NamesDefaultToHere(1);
// Make some grouped data
gs = 10; // Group size
ng = 30000; // Number of groups
groupID = SortAscending(Repeat((1::ng)`, gs));
xVals = J(NRow(groupID), 1, RandomNormal());
yVals = J(NRow(groupID), 1, RandomNormal());
// Put the data in a table
dt1 = NewTable("Grouped Regression Data",
NewColumn("Group", Numeric, Nominal, Values(groupID)),
NewColumn("x", Numeric, Continuous, Values(xVals)),
NewColumn("y", Numeric, Continuous, Values(yVals)),
);
// Uncomment this block to use Fit Y By X to do the fitting
/*
// Use Fit Y By X
fxy = dt1 << Bivariate(Y( :y ), X( x ), Fit Line( {Line Color( {208, 64, 86} )} ), By(:Group));
// Get the pfitted parameters into a table
// (To do this interactively, right-click on any Parameter Estimates table in the report and select 'Make Combined Data Table')
fxyRep = fxy << Report;
dt2 = fxyRep[1][TableBox(3)] << makeCombinedDataTable;
// Split dt2
dt3 = dt2 << Split(Split By( :Term ), Split( :Estimate ), Group( :Group ), Remaining Columns( Drop All ));
dt3 << setName("Parameter Estimates from Fit Y By X");
Close(dt2, noSave);
*/
/// Do the linear regression in JSL, bypassing the platform . . .
// (1) Get the data from the table
xVals = Column(dt1, "x") << getAsMatrix;
yVals = Column(dt1, "y") << getAsMatrix;
groupID = Column(dt1, "Group") << getAsMatrix;
// (2) Add the unit vector to the design matrix
xVals = J(NRow(xVals), 1, 1)||xVals;
// (3) Do the linear regression for each group and store the results
beta = J(ng, 2, .);
for (g=1, g <= ng, g++,
thisGroup = Loc(groupID == g);
beta[g, 0] = Transpose(Inv(xVals[thisGroup, 0]`*xVals[thisGroup, 0])*xVals[thisGroup, 0]`*yVals[thisGroup]);
);
// (4) Make the table of results
dt4 = AsTable(beta);
dt4 << setName("Parameter Estimates from JSL");
Column(dt4, "Col1") << setName("Intercept");
Column(dt4, "Col2") << setName("x");
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Get Direct Link
- Report Inappropriate Content
Re: Linear Regression of Data Set by a Column
Ian
thank you that is a nice approach with matrix.
is there a way to get the Rsquare as well?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Get Direct Link
- Report Inappropriate Content
Re: Linear Regression of Data Set by a Column
Once you have the fitted coefficients for each group (as above), you can get the predicted values, and hence residuals and the residual sum of squares (for each group). The total sum of squares (for each group) is easy too, so then you have everything needed to get R squared (for each group).
Bear in mind, though, that as you add more manipulations to the JSL it will take longer (though in this case still quicker than using the platform). Someone also needs to write and validate the code, of course!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Get Direct Link
- Report Inappropriate Content
Re: Linear Regression of Data Set by a Column
will check it considering your last comment. Thank you Ian
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Get Direct Link
- Report Inappropriate Content
Re: Linear Regression of Data Set by a Column
coming back to this topic.
I understand all the manipulation and can aplply to my data but still very unclear about the following part:
beta = J(ng, 2, .);
for (g=1, g <= ng, g++,
thisGroup = Loc(groupID == g);
beta[g, 0] = Transpose(Inv(xVals[thisGroup, 0]`*xVals[thisGroup, 0])*xVals[thisGroup, 0]`*yVals[thisGroup]);
);
not able to figure out what is happenning except tht a beta matrix is created and all data replaced in a new table...
could you please help me understand this point?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Get Direct Link
- Report Inappropriate Content
Re: Linear Regression of Data Set by a Column
I found following way to calculate expression.
SS_TOT = Summation( i = 1, N Row( dt1 ), (Column( dt1, k ) - a_mean
but calculation is made for all column "k".
Any hint on how I could include some group separation?
SS_TOT[gp#1] / SS_TOT[gp#2] ...
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Get Direct Link
- Report Inappropriate Content
Re: Linear Regression of Data Set by a Column
Running JSL is very useful for large datasets and large # of #groupIDs. Is there a way to iterate over non-numeric groupID.
And of course needs a column to store the non-numeric GroupID also.