Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page

Highlighted

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

Created:
Jul 3, 2015 2:29 PM
| Last Modified: Sep 13, 2017 9:12 AM
(18670 views)

I am trying to do a linear regression of 2 columns as grouped by a 3rd column and am stuck. I want a column with slope for that row, and another with intercept for that row.

Short background:

I have 30k different formulations that have each been tested with 13 different levels of an acid. I want to understand the slope of the pH as a function of acid concentration. For my range (and purpose) I can do this as a linear regression.

My data table is set up such as this, but with 30k Formula Combos and 13 acid levels for a total of 390k rows.

Formula |
Acid Level |
pH |
---|---|---|

ComboA |
5 | 6.7 |

ComboB | 5 | 6.75 |

ComboC | 5 | 6.5 |

ComboA | 10 | 6.4 |

ComboB | 10 | 6.45 |

I know that I can plot them in Fit Y by X, group the data, and then add linear fits but that 1) takes forever with 30k groups and 2) is a pain to pulls the slopes out of that report

I current have split my table by "Acid Level", splitting "pH", and keeping "Formula". I know have something like this, and want a way to fill in the right 2 columns.

Formula |
ph 5 | ph 10 | ph 15 | ph 20 | Slope | Intercept |
---|---|---|---|---|---|---|

ComboA | data |
data |
data |
data |
||

ComboB | data |
data |
data |
data |
||

ComboC | data |
data |
data |
data |

First preference is to do with a formula. If I need to do a script, I can do that as well.

Thanks

2 ACCEPTED SOLUTIONS

Accepted Solutions

Highlighted

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

Got most of the relevant code from Ian Cox's posting here. This will do the analysis and then put all of the slopes and intercepts into a table.

dt = New Table**(** "Sample Data", Add Rows**(** **6** **)**,

New Column**(** "Formula", Character, Nominal,

Set Values**(** **{**"ComboA", "ComboC", "ComboB", "ComboC", "ComboA", "ComboB"**}** **)**

**)**,

New Column**(** "Acid Level", Numeric, Continuous, Format**(** "Best", **12** **)**,

Set Values**(** **[****5**, **15**, **5**, **5**, **10**, **10]** **)**

**)**,

New Column**(** "pH", Numeric, Continuous, Format**(** "Best", **12** **)**,

Set Values**(** **[****6.7**, **4.8**, **6.75**, **6.5**, **6.4**, **6.45]** **)**

**)**

**)**;

// Run Fit Y by X

bv = dt << **Bivariate(**

Y**(** :pH **)**,

X**(** :Acid Level **)**,

Fit Line**(** **{**Line Color**(** **{****213**, **72**, **87}** **)}** **)**,

By**(** :Formula **)**

**)**;

// A reference to the report layer: bivREp is also a list

bivRep = bv << **Report**;

// 'Make Combined Data Table' avoids having to message to each item in 'bivRep'

dt2 = bivRep **[****1]** **[**TableBox**(3)]** << **Make Combined Data Table**;

Highlighted
Peter's script will serve you well. However, for reference if you were to do this by JMP rather than by JSL (it is often a toss up time wise depending on if it is a one time analysis or not) then if you fit your y-by-x you can "broadcast" your fit line, go get a cup of coffee if needed, then when JMP is done running right click on one of the parameter estimate tables and select "make combined data table" to get a table with the estimates. Which is how to execute Peter's script without the script.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

8 REPLIES 8

Highlighted

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

Got most of the relevant code from Ian Cox's posting here. This will do the analysis and then put all of the slopes and intercepts into a table.

dt = New Table**(** "Sample Data", Add Rows**(** **6** **)**,

New Column**(** "Formula", Character, Nominal,

Set Values**(** **{**"ComboA", "ComboC", "ComboB", "ComboC", "ComboA", "ComboB"**}** **)**

**)**,

New Column**(** "Acid Level", Numeric, Continuous, Format**(** "Best", **12** **)**,

Set Values**(** **[****5**, **15**, **5**, **5**, **10**, **10]** **)**

**)**,

New Column**(** "pH", Numeric, Continuous, Format**(** "Best", **12** **)**,

Set Values**(** **[****6.7**, **4.8**, **6.75**, **6.5**, **6.4**, **6.45]** **)**

**)**

**)**;

// Run Fit Y by X

bv = dt << **Bivariate(**

Y**(** :pH **)**,

X**(** :Acid Level **)**,

Fit Line**(** **{**Line Color**(** **{****213**, **72**, **87}** **)}** **)**,

By**(** :Formula **)**

**)**;

// A reference to the report layer: bivREp is also a list

bivRep = bv << **Report**;

// 'Make Combined Data Table' avoids having to message to each item in 'bivRep'

dt2 = bivRep **[****1]** **[**TableBox**(3)]** << **Make Combined Data Table**;

Highlighted
Peter's script will serve you well. However, for reference if you were to do this by JMP rather than by JSL (it is often a toss up time wise depending on if it is a one time analysis or not) then if you fit your y-by-x you can "broadcast" your fit line, go get a cup of coffee if needed, then when JMP is done running right click on one of the parameter estimate tables and select "make combined data table" to get a table with the estimates. Which is how to execute Peter's script without the script.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

Highlighted
##

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

Re: Linear Regression of Data Set by a Column

Coincidentally, I was just about to point out that, if you know you need linear regression and only want the parameters, then JSL is much faster. The code below took about 100 seconds on my laptop.

NamesDefaultToHere**(****1****)**;

// Make some grouped data

gs = **10**; // Group size

ng = **30000**; // Number of groups

groupID = SortAscending**(**Repeat**((****1**::ng**)**`, gs**))**;

xVals = J**(**NRow**(**groupID**)**, **1**, RandomNormal**())**;

yVals = J**(**NRow**(**groupID**)**, **1**, RandomNormal**())**;

// Put the data in a table

dt1 = NewTable**(**"Grouped Regression Data",

NewColumn**(**"Group", Numeric, Nominal, Values**(**groupID**))**,

NewColumn**(**"x", Numeric, Continuous, Values**(**xVals**))**,

NewColumn**(**"y", Numeric, Continuous, Values**(**yVals**))**,

**)**;

// Uncomment this block to use Fit Y By X to do the fitting

/*

// Use Fit Y By X

fxy = dt1 << Bivariate(Y( :y ), X( x ), Fit Line( {Line Color( {208, 64, 86} )} ), By(:Group));

// Get the pfitted parameters into a table

// (To do this interactively, right-click on any Parameter Estimates table in the report and select 'Make Combined Data Table')

fxyRep = fxy << Report;

dt2 = fxyRep[1][TableBox(3)] << makeCombinedDataTable;

// Split dt2

dt3 = dt2 << Split(Split By( :Term ), Split( :Estimate ), Group( :Group ), Remaining Columns( Drop All ));

dt3 << setName("Parameter Estimates from Fit Y By X");

Close(dt2, noSave);

*/

/// Do the linear regression in JSL, bypassing the platform . . .

// (1) Get the data from the table

xVals = Column**(**dt1, "x"**)** << **getAsMatrix**;

yVals = Column**(**dt1, "y"**)** << **getAsMatrix**;

groupID = Column**(**dt1, "Group"**)** << **getAsMatrix**;

// (2) Add the unit vector to the design matrix

xVals = J**(**NRow**(**xVals**)**, **1**, **1****)**||xVals;

// (3) Do the linear regression for each group and store the results

beta = J**(**ng, **2**, **.****)**;

for **(**g=**1**, g <= ng, g++,

thisGroup = Loc**(**groupID == g**)**;

beta**[**g, **0****]** = Transpose**(**Inv**(**xVals**[**thisGroup, **0****]**`*xVals**[**thisGroup, **0****])***xVals**[**thisGroup, **0****]**`*yVals**[**thisGroup**])**;

**)**;

// (4) Make the table of results

dt4 = AsTable**(**beta**)**;

dt4 << **setName****(**"Parameter Estimates from JSL"**)**;

Column**(**dt4, "Col1"**)** << **setName****(**"Intercept"**)**;

Column**(**dt4, "Col2"**)** << **setName****(**"x"**)**;

Highlighted
##

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

Re: Linear Regression of Data Set by a Column

Ian

thank you that is a nice approach with matrix.

is there a way to get the Rsquare as well?

Highlighted
##

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

Re: Linear Regression of Data Set by a Column

Once you have the fitted coefficients for each group (as above), you can get the predicted values, and hence residuals and the residual sum of squares (for each group). The total sum of squares (for each group) is easy too, so then you have everything needed to get R squared (for each group).

Bear in mind, though, that as you add more manipulations to the JSL it will take longer (though in this case still quicker than using the platform). Someone also needs to write and validate the code, of course!

Highlighted
##

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

Re: Linear Regression of Data Set by a Column

will check it considering your last comment. Thank you Ian

Highlighted
##

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

Re: Linear Regression of Data Set by a Column

coming back to this topic.

I understand all the manipulation and can aplply to my data but still very unclear about the following part:

beta = J(ng, 2, .);

for (g=1, g <= ng, g++,

thisGroup = Loc(groupID == g);

beta[g, 0] = Transpose(Inv(xVals[thisGroup, 0]`*xVals[thisGroup, 0])*xVals[thisGroup, 0]`*yVals[thisGroup]);

);

not able to figure out what is happenning except tht a beta matrix is created and all data replaced in a new table...

could you please help me understand this point?

Highlighted
##

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

Re: Linear Regression of Data Set by a Column

I found following way to calculate expression.

SS_TOT = Summation( i = 1, N Row( dt1 ), (Column( dt1, k )* - a_mean ) ^ 2 / N Row( dt1 ) );*

but calculation is made for all column "k".

Any hint on how I could include some group separation?

SS_TOT[gp#1] / SS_TOT[gp#2] ...

Article Labels

There are no labels assigned to this post.