Solved: Long time for big data

Report Inappropriate Content · Jun 8, 2023 5:56 PM

Hi,

I'm using JMP for larger datasets, 1000 variables, and 500 subjects. It takes a long time, like a couple of minutes to run a simple fit model. I have a good laptop with 64G RAM, 4g GPU, and so on. When running JMP, nothing gets even close to 30% usage, still, it takes forever for JMP to do anything. I gave JMP Realtime priority.

Any suggestions?

Best

Kamil

Chris_Kirchberg · Nov 10, 2022 3:12 PM

Hi Kamil,

The bottleneck is Fit Model>Standard Least Squares implementation. Even with fit separately, it it is trying to put 1000 model results into a single report. And that is just the start of it. My guess is that it does this in linear time (one at a time), so one processor/thread is being used for generating the report. But I am only guessing.

I am not sure if you can specify how to use computer resources by JMP (except to turn multithreading off on a per platform basis in some of the platform dialogs). Many of the platforms are already multithreaded (I think Response Screening might be one of them). I am not sure about Fit Model>Standard Least Squares. I am also guessing that report generation is single threaded. But I would need a developer to let me know how this situation is working behind the scenes. Some functions cannot be multithreaded so max CPU capacity utilization might not be possible. Memory is similar. You can consume all of your memory just by clustering large data sets (1000s by 1000s in a two way hierarchical cluster). That is mainly to store the information for generating the graphical output.

I hope this gives some idea of what JMP does in the background. It tries to use what ever resources are available for that function/capability/platform, if it is possible.

By the way, we are continuing to enhance JMP Pro to accommodate wide data sets, such as yours. Fit Model>Mixed Models is one example where we have a red triangle option to dispose reports and only show tables. This makes it fast to fit a mixed model on 1000s of responses, but does not give a graphically based report.

Chris Kirchberg, M.S.²
Data Scientist, Life Sciences - Global Technical Enablement
JMP Statistical Discovery, LLC. - Denver, CO
Tel: +1-919-531-9927 ▪ Mobile: +1-303-378-7419 ▪ E-mail: chris.kirchberg@jmp.com
www.jmp.com

View solution in original post

pauldeen · Nov 7, 2022 11:30 AM

If you are doing it in scripting, make sure you open up data tables in private mode to save on memory adressing time.

Close all other applications so that you have as much ram available as you can and elinate swapping.

peng_liu · Nov 8, 2022 11:39 AM

When you said 30% usage, is that CPU? I guess that you are not talking about memory size. I think that you are talking about a specific modeling type, for the size (or complexity) of your data, the performance is slow. E.g. least squares should not have issues like this, but other methods may.

P_Bartell · Nov 9, 2022 7:29 AM

To add a bit to @peng_liu 's thoughts, it might help us give guidance if you tell us a bit more about your data, like the nature of the 'variables'...all they all x's and/or how many y's? Please differentiate/explain a 'variable' and a 'subject'. Data type can play a role as well in processing time in some platforms. And 'simple fit model' doesn't tell us which personality you actually chose. Some might take longer than others as well. Each platform dialog requires a model. Can you tell us what your model structure is? Lastly, more about your modeling objectives? With '...1000 variables...' if your modeling objective is variable identification probably the last platform I'd look at if Fit Model and any of the personalities therein.

Mark_Bailey · Nov 9, 2022 10:08 AM

Also, is it numeric data or strings? If strings, how many variables contain strings, and what is the nature of a string?

Kamil · Nov 10, 2022 12:03 PM

Hi P_Bartell and all,

Thank you for your response. So this is metabolomic data. 1000 continuous variables as columns measured in 500 subjects (rows). The fit model I was trying to run is just a t-test for the genotype (just 2 categorical groups) effect with covariates. So, all my metabolites in Y and my genotype, BMI, sex, and age as model effects. Fit separately.

I'm trying to understand why is my computer not used to its full capacity during those operations. My CPU is at 30%, and my RAM may be at 10%. What is the bottleneck here? Can I set up JMP to use more of my computer?

Thanks

Kamil

Chris_Kirchberg · Nov 10, 2022 01:20 PM

Hi @Kamil ,

What version of JMP are you using? Have you tried the Response Screening Personality in Fit Model? If you have JMP Pro, then this personality is present. If only JMP, Response Screening can be found under Analyze>Screening>Response Screening (this method only allows for one model effect to be tested at a time). Response Screening is specifically designed to do this and it is very fast.

Chris Kirchberg, M.S.²
Data Scientist, Life Sciences - Global Technical Enablement
JMP Statistical Discovery, LLC. - Denver, CO
Tel: +1-919-531-9927 ▪ Mobile: +1-303-378-7419 ▪ E-mail: chris.kirchberg@jmp.com
www.jmp.com

Kamil · Nov 10, 2022 03:22 PM

Hi Chris,

I have JMP 17 Pro. Thanks for the tip. This makes things faster. Still, is there a way/setting that enables JMP to use more of my computer capacity? Trying to understand what is a bottleneck.

Best

Kamil

Chris_Kirchberg · Nov 10, 2022 3:12 PM

Hi Kamil,

The bottleneck is Fit Model>Standard Least Squares implementation. Even with fit separately, it it is trying to put 1000 model results into a single report. And that is just the start of it. My guess is that it does this in linear time (one at a time), so one processor/thread is being used for generating the report. But I am only guessing.

I am not sure if you can specify how to use computer resources by JMP (except to turn multithreading off on a per platform basis in some of the platform dialogs). Many of the platforms are already multithreaded (I think Response Screening might be one of them). I am not sure about Fit Model>Standard Least Squares. I am also guessing that report generation is single threaded. But I would need a developer to let me know how this situation is working behind the scenes. Some functions cannot be multithreaded so max CPU capacity utilization might not be possible. Memory is similar. You can consume all of your memory just by clustering large data sets (1000s by 1000s in a two way hierarchical cluster). That is mainly to store the information for generating the graphical output.

I hope this gives some idea of what JMP does in the background. It tries to use what ever resources are available for that function/capability/platform, if it is possible.

By the way, we are continuing to enhance JMP Pro to accommodate wide data sets, such as yours. Fit Model>Mixed Models is one example where we have a red triangle option to dispose reports and only show tables. This makes it fast to fit a mixed model on 1000s of responses, but does not give a graphically based report.

Chris Kirchberg, M.S.²
Data Scientist, Life Sciences - Global Technical Enablement
JMP Statistical Discovery, LLC. - Denver, CO
Tel: +1-919-531-9927 ▪ Mobile: +1-303-378-7419 ▪ E-mail: chris.kirchberg@jmp.com
www.jmp.com

Long time for big data

Re: Long time for big data

Re: Long time for big data

Re: Long time for big data

Re: Long time for big data

Re: Long time for big data

Re: Long time for big data

Re: Long time for big data

Re: Long time for big data

Re: Long time for big data

Recommended Articles