Choose Language Hide Translation Bar
Highlighted
bi0
bi0
Level I

Partial Least Squares

Hello all,

 

I am trying to perform a multivariate partial least squares on my dataset in JMP. This dataset consists of 10 rows and 59740 columns (FT-IR data, which has many thousands of "discrete" measurements to create a continuous curve). I get the following error when I try to run the PLS, setting 59738 columns to the X factors and two columns to the Y response.

 

This dataset is structured in the same way as the Baltic sample dataset but with many more measurements.

 

Untitled.png

 

Does anyone know how to resolve this? It works if I only select a small number of the X responses and not all 59738 of them. Have I exceeded the capacity of data JMP can handle?

 

Thank you

7 REPLIES 7
Highlighted
ian_jmp
Staff

Re: Partial Least Squares

Probably best to email support@jmp.com referencing this thread.

 

From a methodological point of view, have you plotted the data? Do you expect all regions of wavelength/frequency to discriminate between the responses?

Highlighted
bi0
bi0
Level I

Re: Partial Least Squares

Hey Ian, thanks for the reply. I will email JMP support. I have plotted the data, and not all regions do discriminate between the responses. I can get the PLS to run by manually eliminating the non-responsive X factors (which actually reduces the number of columns to consider from some 59,738 to near 8,000). However, this is non-ideal, and I would much rather be able to find a systematic/computed method for eliminating non-responsive X factors rather than doing it myself (which by definition, is prone to overfitting and may be modelling some spectral noise).

Somehow, I would like to use scaled and centred model coefficients to decide on useful wavenumber ranges ((criteria e.g. coefficient >= 5% of max?). I don't know how to do this, and if you had any ideas it would be greatly appreciated.
Highlighted
bi0
bi0
Level I

Re: Partial Least Squares

I realise now that I can do this by running the PLS, then Making a Model using VIP. However I get some strange results when I do this on my X-reduced dataset. It decides that the number of factors to minimise PRESS is 0, and therefore I cannot extract any meaningful information. This is in contrary the initial PLS, which decides on seven factors and has reasonably strong predictive power for X (97.7%) and Y (99.9%). It seems that removing non-important X factors makes the model worse, which does not seem to make sense to me. Perhaps I am misunderstanding this result?
Highlighted

Re: Partial Least Squares

I agree with Ian on contacting support, but there are couple of other options you can try out along the way.

 

It looks as if you are using JMP Pro.  Have you tried Generalized Regression?  You can limit the number of important intensities using a penalized regression approach like Elastic Net. 

Also a good way to look at spectral data is to use Functional Data Explorer using Rows as Functions.  Using your outputs of concentrations of imines and amines as "Z Supplementary" will allow you to use the Functional DOE capability (JMP Pro 15) to get a Generalized Regression model of your data.  Make sure to put your sample ID column in to fit the individual curves and use a P-Spline to fit the model.  This might take a little bit with 59,740 columns, but I believe it is worth a shot.

 

HTH

Bill

Highlighted
P_Bartell
Level VI

Re: Partial Least Squares

Another question for you...are the 'zero' responses really numerically zero, or just placeholders for maybe a 'missing value'? Zeros just look odd to me in the context of the other values in each column. JMP needs to know if they are truly zero (which it looks like that's how they'll get treated by your analysis so far) or if they are missing, then how would you like to handle the 'missingness'?

Highlighted
bi0
bi0
Level I

Re: Partial Least Squares

Thanks for highlighting this, they are true numerical zero values, and not missing datapoints.
Highlighted
P_Bartell
Level VI

Re: Partial Least Squares

OK...the zeros are actual numeric data. I'm wondering based on the error message if somewhere in the long list of over 50,000 predictor variables you've somehow got a column(s) with entire missing values? If I'm interpreting the error message correctly calculating the mean for missing values is the method of choice for missing values. Have you run any of the missing value exploratory platforms to see if in fact this is the root cause of the error? If you had a column of complete missing values, the mean can't be created. If missing values isn't the issue...then I'm at a loss and think along with @ian_jmp  and @bill_worley reaching out to JMP Technical Support might be the best recourse.

 

I (during my tenure as a JMP systems engineer) once had a customer that was encountering a similar 'error' type message and the root cause was missing values scattered throughout the data table that rendered execution of the analysis platform she was trying to use impossible.

Article Labels

    There are no labels assigned to this post.