Mar 24, 2014 1:28 PM
I have a problem with 10s of effects (independent variables) and one categorical dependent variable. The probability of the categorical variable taking on a value of "1" is very small- about 10 in a million. Consequently, I have 10s of millions of data points from which I would like to estimate a model using Logistic Regression, that would give me the coefficients for the independent variables to compute the probability of the dependent variable.

Before I run Logistic Regression and just take whatever coefficients it spits out, I would like to see how well JMP handles a problem of such dimensionality. Is anyone aware of a test data set that I could use to test JMP on a problem of similar size?

Thank you for the help.

Mar 24, 2014 1:57 PM
try Kaggle.com - probably violates their TOS though

Mar 24, 2014 1:58 PM
You might want to download the data in the link mentioned in the paper at: http://www.sascommunity.org/wiki/Expert_Panel_Solution_MWSUG_2013-Tabachneck

The link is for a dataset provided by MWSUG, last year, as part of an expert panel I was on. The dataset has around 1 million records with about 20 or so independent variables.

Mar 24, 2014 4:10 PM
Alternatively, use an exisiting JMP sample data set and expand it to desired size.

For the table below JMP 11 returns the parameter estimates in less than 30 s.

(>16 million rows, 13 independent variables)

dt = Open**(** "$SAMPLE_DATA/Body Fat.jmp" **)**;

// Add nominal binomial column with rare 1's

dt << **New Column****(** "Over 70", numeric, nominal, values**(** **(**Column**(** "Age(years)" **)** << **get values****)** > **70** **)** **)**;

// Delete redundant columns

dt << **delete columns****(**

Eval**(** dt << **get column group****(** "Prediction Formulas" **)** ||

**{**Column**(** "Validation" **)**,Column**(** "Age(years)" **)})**

**)**;

cols = dt << **get column names****()**;

// Make dataset bigger!

While**(** N Row**(** dt **)** < **1e7**, dt << **concatenate****(** dt, append to first table **)** **)**; Wait**(****0****)**;

// Fit nominal logistic

Fit Model**(**

Y**(** :Over 70 **)**,

Effects**(** Eval**(** cols**[****1** :: N Items**(** cols **)** - **1****]** **)** **)**,

Personality**(** Nominal Logistic **)**,

Run**(** Likelihood Ratio Tests**(** **1** **)**, Wald Tests**(** **0** **)** **)**

**)**;