Solved: Data analysis when much more columns than rows

Report Inappropriate Content · Jun 8, 2023 5:50 PM

Hello everybody
I would like to ask for your help because I have a data set with:

in columns, process parameters of a galvanized steel production line
in rows, coils that have been processed on this production line

The objective is to correlate a quantity of defects appearing on a coil to one or several process parameters.
Unfortunately, I have a data set with many more columns (407) than rows (63).
Do you have any advice on how to proceed?

Victor_G · Jun 14, 2022 03:12 AM

Hello @Steph_Georges,

There are a lot of options, depending if you are on JMP or JMP Pro.
Before going to analysis directly, I would visualize the correlations between process parameters and between process parameters and response (defaults), as this could give an indication about correlated variables and influential inputs for the analysis of defaults (platform "Multivariate" in Analyze -> Multivariate Methods).
Then, some options to consider :

- JMP :

A first approach could be to do a Principal Component Analysis (PCA), and then try to model the default response with these PCA parameters. But this approach will make you lose some explainability about which factors impact the defaults, as they will be combined in PCA variables.
Another option would be to try Partial Least Squares (PLS), a very efficient approach when you deal with a lot of correlated variables and very few observations. It is for example used in spectral analysis, where you have few observations but a lot of input data (transmittance/intensity for each wavelength for example).
What may also be interesting is to try Predictor Screening to have an overview about which factors are the most important.

- JMP Pro :

Instead of Predictor screening, you can try "Random Forest", where you'll have a lot more infos on the model and possibility to save formula. Since RF uses features bagging for each tree, you can be confident that even in case of multicollinearity and more parameters than observations, each parameter will have the same probability to be selected in a tree, so you'll have a good view on which factors impact the defaults.
You can also look at Generalized Regression models with penalized estimation methods like Lasso, Ridge or Elastic Net.

These are the models/techniques I would try first before trying more complex methods.
I let experts in this community complete my answers
I hope this will help you,

Victor GUILLER

"It is not unusual for a well-designed experiment to analyze itself" (Box, Hunter and Hunter)

View solution in original post

Victor_G · Jun 14, 2022 03:12 AM

Hello @Steph_Georges,

There are a lot of options, depending if you are on JMP or JMP Pro.
Before going to analysis directly, I would visualize the correlations between process parameters and between process parameters and response (defaults), as this could give an indication about correlated variables and influential inputs for the analysis of defaults (platform "Multivariate" in Analyze -> Multivariate Methods).
Then, some options to consider :

- JMP :

A first approach could be to do a Principal Component Analysis (PCA), and then try to model the default response with these PCA parameters. But this approach will make you lose some explainability about which factors impact the defaults, as they will be combined in PCA variables.
Another option would be to try Partial Least Squares (PLS), a very efficient approach when you deal with a lot of correlated variables and very few observations. It is for example used in spectral analysis, where you have few observations but a lot of input data (transmittance/intensity for each wavelength for example).
What may also be interesting is to try Predictor Screening to have an overview about which factors are the most important.

- JMP Pro :

Instead of Predictor screening, you can try "Random Forest", where you'll have a lot more infos on the model and possibility to save formula. Since RF uses features bagging for each tree, you can be confident that even in case of multicollinearity and more parameters than observations, each parameter will have the same probability to be selected in a tree, so you'll have a good view on which factors impact the defaults.
You can also look at Generalized Regression models with penalized estimation methods like Lasso, Ridge or Elastic Net.

These are the models/techniques I would try first before trying more complex methods.
I let experts in this community complete my answers
I hope this will help you,

Victor GUILLER

"It is not unusual for a well-designed experiment to analyze itself" (Box, Hunter and Hunter)

Steph_Georges · Jun 14, 2022 03:23 AM

Thanks a lot Victor for your extremely quick and complete answer.

As I still have JMP Pro, I will be able to test all the hints you mentioned.

Victor_G · Jun 14, 2022 03:26 AM

Perfect, then you have a lot of options to try !

It may be interesting to try the most simple models before diving in the JMP Pro models, and to compare the different models used, not only in terms of performance/significance, but also with your domain expertise : does it make sense that this process parameter is so important in the default response (for example) ?

Combining the different insights from the different models with your domain expertise should lead you in the right direction.

Have a nice day Stéphane,

Victor GUILLER

"It is not unusual for a well-designed experiment to analyze itself" (Box, Hunter and Hunter)

statman · Jun 14, 2022 10:45 AM

Are you just data mining?

I have the following thoughts:

1. I would start with a thoughtful consideration of the columns and develop rational hypotheses as to why they would or would not effect the number of defects or defect rate.Use these hypotheses as a guide to iterating your model.

2. I would be cautious of using defects. Defects may be the result of different failure mechanisms and therefore have different causal structures. This may get lost in the aggregate of defect density. Are there better measures? The more continuous the measures the more effective and efficient the study.

3. If data mining, I agree with Victor in assessing the multicollinearity of the column factors. Scatter plots are excellent for this.

4. Regression would be a place to start looking at relationships between factors and defects. Again the nature of the response variable may constrain what you can use. Typically I would start with linear effects and take an additive approach to model building (adding terms like interactions and non-linear over iterations), procedures like step wise and PLS may be useful, but always be able to justify the relationships with subject matter rationalization. (this is very difficult with PCA)

5. You have measurement errors throughout the data set and even likely in the classification of defects. There will be little chance to estimate this from your data set.

6. Context is important. As Victor mentions, you may have lagged variable effects and hidden variables. Just remember you are data mining to make your next iteration more effective (directed sampling or experimentation) NOT to draw conclusions.

"All models are wrong, some are useful" G.E.P. Box

Data analysis when much more columns than rows

Re: Data analysis when much more columns than rows

Re: Data analysis when much more columns than rows

Re: Data analysis when much more columns than rows

Re: Data analysis when much more columns than rows

Re: Data analysis when much more columns than rows