Choose Language Hide Translation Bar
Highlighted
Level III

## Is there any sample data set for testing Logistic Regression with a large number (10-20) of independent variables?

I have a problem with 10s of effects (independent variables) and one categorical dependent variable. The probability of the categorical variable taking on a value of "1" is very small- about 10 in a million. Consequently, I have 10s of millions of data points from which I would like to estimate a model using Logistic Regression, that would give me the coefficients for the independent variables to compute the probability of the dependent variable.

Before I run Logistic Regression and just take whatever coefficients it spits out, I would like to see how well JMP handles a problem of such dimensionality. Is anyone aware of a test data set that I could use to test JMP on a problem of similar size?

Thank you for the help.

3 REPLIES 3
Highlighted
Level III

## Re: Is there any sample data set for testing Logistic Regression with a large number (10-20) of independent variables?

try Kaggle.com - probably violates their TOS though

Highlighted

## Re: Is there any sample data set for testing Logistic Regression with a large number (10-20) of independent variables?

The link is for a dataset provided by MWSUG, last year, as part of an expert panel I was on.  The dataset has around 1 million records with about 20 or so independent variables.

Highlighted
Super User

## Re: Is there any sample data set for testing Logistic Regression with a large number (10-20) of independent variables?

Alternatively, use an exisiting JMP sample data set and expand it to desired size.

For the table below JMP 11 returns the parameter estimates in less than 30 s.

(>16 million rows, 13 independent variables)

dt = Open( "\$SAMPLE_DATA/Body Fat.jmp" );

// Add nominal binomial column with rare 1's

dt << New Column( "Over 70", numeric, nominal, values( (Column( "Age(years)" ) << get values) > 70 ) );

// Delete redundant columns

dt << delete columns(

Eval( dt << get column group( "Prediction Formulas" ) ||

{Column( "Validation" ),Column( "Age(years)" )})

);

cols = dt << get column names();

// Make dataset bigger!

While( N Row( dt ) < 1e7, dt << concatenate( dt, append to first table ) ); Wait(0);

// Fit nominal logistic

Fit Model(

Y( :Over 70 ),

Effects( Eval( cols[1 :: N Items( cols ) - 1] ) ),

Personality( Nominal Logistic ),

Run( Likelihood Ratio Tests( 1 ), Wald Tests( 0 ) )

);

Article Labels

There are no labels assigned to this post.