Subscribe Bookmark RSS Feed

Is there any sample data set for testing Logistic Regression with a large number (10-20) of independent variables?

ranjan_mitre_or

Community Trekker

Joined:

Oct 16, 2013

I have a problem with 10s of effects (independent variables) and one categorical dependent variable. The probability of the categorical variable taking on a value of "1" is very small- about 10 in a million. Consequently, I have 10s of millions of data points from which I would like to estimate a model using Logistic Regression, that would give me the coefficients for the independent variables to compute the probability of the dependent variable.

Before I run Logistic Regression and just take whatever coefficients it spits out, I would like to see how well JMP handles a problem of such dimensionality. Is anyone aware of a test data set that I could use to test JMP on a problem of similar size?

Thank you for the help.

3 REPLIES
reeza

Community Trekker

Joined:

Jun 23, 2011

try Kaggle.com - probably violates their TOS though

You might want to download the data in the link mentioned in the paper at: http://www.sascommunity.org/wiki/Expert_Panel_Solution_MWSUG_2013-Tabachneck

The link is for a dataset provided by MWSUG, last year, as part of an expert panel I was on.  The dataset has around 1 million records with about 20 or so independent variables.

ms

Super User

Joined:

Jun 23, 2011

Alternatively, use an exisiting JMP sample data set and expand it to desired size.

For the table below JMP 11 returns the parameter estimates in less than 30 s.

(>16 million rows, 13 independent variables)

dt = Open( "$SAMPLE_DATA/Body Fat.jmp" );

// Add nominal binomial column with rare 1's

dt << New Column( "Over 70", numeric, nominal, values( (Column( "Age(years)" ) << get values) > 70 ) );

// Delete redundant columns

dt << delete columns(

  Eval( dt << get column group( "Prediction Formulas" ) ||

   {Column( "Validation" ),Column( "Age(years)" )})

);

cols = dt << get column names();

// Make dataset bigger!

While( N Row( dt ) < 1e7, dt << concatenate( dt, append to first table ) ); Wait(0);

// Fit nominal logistic

Fit Model(

  Y( :Over 70 ),

  Effects( Eval( cols[1 :: N Items( cols ) - 1] ) ),

  Personality( Nominal Logistic ),

  Run( Likelihood Ratio Tests( 1 ), Wald Tests( 0 ) )

);