Solved: Re: SVEM ADD IN FOR SUPPORT VECTOR MACHINE

frankderuyck · Jun 7, 2024 03:29 AM

I have a DOE results I would like to model with Support Vector Machine (there are strong non linear effects that can't be estimated with a polynomial model) How do I install the SVEM add in?

Victor_G · Jun 7, 2024 1:31 AM

Hi @frankderuyck,

Some questions before answering yours :

Why would you like to use SVEM for SVM ? Support Vector Machines is a robust ML algorithm, that can be used for small-sized datasets like DoE. More infos about algorithm's robustness on LinkedIn.
There are also many ways to assess overfitting for SVM without using validation set, by checking the number of support vectors, and the values of penalty (cost) and curvature (gamma) if you have fine-tuned your SVM hyperparameters with a Tuning design... I am thinking about writing a blog post about assessing and preventing overfitting for robust algorithms (SVM, Random Forest) in the absence of validation (in DoE datasets for example).
Did you also consider other algorithms ? SVM is a robust algorithm for analysis of small-size datasets, but Random Forest can also provide good results and may be a good benchmark to evaluate and compare performances. More infos on Random Forests on LinkedIn.
How many levels do you have for your factors ? ML algorithms can only be useful if you have enough levels for each of your factors to enable a smooth interpolation. More than 3 levels may be a good starting point.

Concerning the use of SVEM, you can access it through different ways :

In the platform "Make Validation Column", you have the possibility to create an Autovalidation table : Launch the Make Validation Column Platform (jmp.com) This will duplicate the rows and create anti-correlated weights (between training and validation) column that you can use as "Frequency" in the platform launch. You can then simulate prediction results by varying the weights between training and validation (right-click a panel with validation results, click on Simulate, and specify weights column as the column to switch in and out).
You can also download the addin SVEM, that will do the same things as the autovalidation table : Re-Thinking the Design and Analysis of Experiments? (2021-EU-30MP-776) - JMP User Community
You can then reproduce the same steps as before and simulate results to assess model's robustness.

I would however like to emphasize that cross-validation (K-folds or SVEM) is not properly a validation method, but more a "debugging" method to assess robustness of algorithm's performances on various training and validation sets. The main difficulty after using a cross-validation technique is how to make sense of the results and how to proceed next:
Average the individual models trained on various folds ? Create an ensemble with these models ? Keep the best model trained on a fold (I wouldn't recommend this option) ? Train the model on whole dataset once the robustness has been checked with the cross-validation ? ... There is no definitive answer to this question and this is related to the dataset, modelisation complexity, performances improvement, user's practices, ...

More infos here with some episodes of Cassie Kozyrkov on the topic of cross-validation: https://youtu.be/zqD0lQy_w40?si=41jfBs83Q0TIlusr

https://youtu.be/by59hsYO0Lg?si=XeZ33j2e4jKQPaRA

Hope this answer will help you,

Victor GUILLER

"It is not unusual for a well-designed experiment to analyze itself" (Box, Hunter and Hunter)

View solution in original post

Victor_G · Jun 7, 2024 1:31 AM

Hi @frankderuyck,

Some questions before answering yours :

Why would you like to use SVEM for SVM ? Support Vector Machines is a robust ML algorithm, that can be used for small-sized datasets like DoE. More infos about algorithm's robustness on LinkedIn.
There are also many ways to assess overfitting for SVM without using validation set, by checking the number of support vectors, and the values of penalty (cost) and curvature (gamma) if you have fine-tuned your SVM hyperparameters with a Tuning design... I am thinking about writing a blog post about assessing and preventing overfitting for robust algorithms (SVM, Random Forest) in the absence of validation (in DoE datasets for example).
Did you also consider other algorithms ? SVM is a robust algorithm for analysis of small-size datasets, but Random Forest can also provide good results and may be a good benchmark to evaluate and compare performances. More infos on Random Forests on LinkedIn.
How many levels do you have for your factors ? ML algorithms can only be useful if you have enough levels for each of your factors to enable a smooth interpolation. More than 3 levels may be a good starting point.

Concerning the use of SVEM, you can access it through different ways :

In the platform "Make Validation Column", you have the possibility to create an Autovalidation table : Launch the Make Validation Column Platform (jmp.com) This will duplicate the rows and create anti-correlated weights (between training and validation) column that you can use as "Frequency" in the platform launch. You can then simulate prediction results by varying the weights between training and validation (right-click a panel with validation results, click on Simulate, and specify weights column as the column to switch in and out).
You can also download the addin SVEM, that will do the same things as the autovalidation table : Re-Thinking the Design and Analysis of Experiments? (2021-EU-30MP-776) - JMP User Community
You can then reproduce the same steps as before and simulate results to assess model's robustness.

I would however like to emphasize that cross-validation (K-folds or SVEM) is not properly a validation method, but more a "debugging" method to assess robustness of algorithm's performances on various training and validation sets. The main difficulty after using a cross-validation technique is how to make sense of the results and how to proceed next:
Average the individual models trained on various folds ? Create an ensemble with these models ? Keep the best model trained on a fold (I wouldn't recommend this option) ? Train the model on whole dataset once the robustness has been checked with the cross-validation ? ... There is no definitive answer to this question and this is related to the dataset, modelisation complexity, performances improvement, user's practices, ...

More infos here with some episodes of Cassie Kozyrkov on the topic of cross-validation: https://youtu.be/zqD0lQy_w40?si=41jfBs83Q0TIlusr

https://youtu.be/by59hsYO0Lg?si=XeZ33j2e4jKQPaRA

Hope this answer will help you,

Victor GUILLER

"It is not unusual for a well-designed experiment to analyze itself" (Box, Hunter and Hunter)

frankderuyck · Jun 7, 2024 08:26 AM

Hi Victor, this tuning design works great, I want to model a guality response Y as a function of three Y FPCA's from NIR spectra (FDA). I tried neural network but even with 3 TanH nodes profiler shows to my opinion too high and irregular ncurvature. SVE after tuning does a nice job, I support vector machine learning!

Victor_G · Jun 7, 2024 09:07 AM

Hi @frankderuyck,

Great to know !
I use Neural Networks as the last modeling option possible, there are many classical models and ML algorithms more simple that perform equally well (and sometimes better). These other options are also faster to compute, don't require extended knowledge, tests and validation to finetune them (unlike NNs where you have to specify the architecture, with the number of layers, type of functions, boosting or not, ... or using the Neural Network Tuning AddIn) and are easier to interpret (in terms of response profiles and relative factors importance).
On tabular data, if the only model able to perform well is Neural Networks, there are good chances that you might have other problems, in terms of data quality, representativeness, missing values/outliers, repeatability/reproducibility, etc...

Tree-based models provide a good performances benchmark and are often a safe and reliable option, see the article "Why do tree-based models still outperform deep learning on tabular data?" (2022) https://arxiv.org/abs/2207.08815

In your screenshot, it seems the optimal SVM model found with the Tuning design is quite interesting :

It uses 10 support vectors, so less than half of the points are used to build the "boundaries"/vectors used in SVM model, which may be a good sign of low complexity and good generalization properties,
The curvature value (gamma) is quite low, so the response profile and boundaries may not be over-complex and highly curved (so again, low complexity and good generalization properties),
The penalty (cost value) seems quite high, but with a low-size dataset this is often the case in practice, as any misclassified/mispredicted point will greatly increase the misclassification rate/RMSE : 1 misclassified point out of 25 = 4% misclassification rate, compared to higher dataset size where one misclassified point would represent less than 1% misclassification rate. In some situations, it can be interesting to choose a model with a slightly larger error (RMSE/Misclassification rate), but with a lot less support vectors : you degrade your predictive performances on the training set in favor of a less complex model that could be able to generalize better on new data.

You can use the table generated by the Tuning design, with the performance metrics, to better assess the relative importance and contribution of the two hyperparameters on the performance metrics (Misclassification rate for classification, or RASE and R² for regression).

As always, it is recommended to validate your model with new validation points, to make sure your model doesn't overfit.

Victor GUILLER

"It is not unusual for a well-designed experiment to analyze itself" (Box, Hunter and Hunter)

frankderuyck · Jun 7, 2024 10:21 AM

Guess tree based models like random forest & XGboost require a large data set unless you can use SVEM for small sets, correct?

Victor_G · Jun 7, 2024 8:25 AM

I don't fully agree, there are more nuances between tree-based models:

Boosted Tree methods, like XGBoost and others, use trees trained sequentially to refine and correct predictions progressively. They are more influenced by the training set characteristics, size, data quality, repartition of data points, representativeness, ... as they iteratively try to have the best predictions (each tree "correcting" the largest prediction errors of the previous tree) on the dataset they are trained on.
It's typically an algorithm described as low bias - high variance : predictions can be very precise, but very sensitive/influenced by the training dataset used.
Random Forests use trees trained in parallel on slightly different datasets (bootstrap sets). The good predictive performance of this algorithm comes from the averaging of the individual trees predictions. You can think of it as "The Wisdom of the Crowd": if you ask the weight of a whale to someone, it might be very inaccurate and not precise, but if collect the answers from hundreds or thousands of people and average the results, the mean result may be very close to the actual value.
It's typically an algorithm described as medium/low bias - low variance : predictions may be precise (depending on the averaging of the individual results and the complexity/number of individual trees), and they are quite robust to the training dataset thanks to the bootstrapping process and the individual training of the trees (with different features).

Here is a short visual between the different tree-based models (I don't like how the individual trees in the Random Forest have the same structure, but it's a simplistic schema):

To answer your question, I would try Random Forests over Boosted Tree methods on low size dataset like DoE, to have more confidence in the results, use an algorithm less prone to overfitting, and have better generalization properties.

Victor GUILLER

"It is not unusual for a well-designed experiment to analyze itself" (Box, Hunter and Hunter)

frankderuyck · Jun 7, 2024 11:40 AM

By Random forest I get poor R² = 0,63 even with 2000 trees

frankderuyck · Jun 7, 2024 11:43 AM

With boosted tree R² = 0,85

Victor_G · Jun 7, 2024 11:52 AM

Yes, boosted tree may be more precise than Random Forests "by design", due to their iterative prediction error correction with sequential trees training. But the complex splits done during the sequential trees learning to improve the performance may only be valid on the training set, and may not reflect what could happen on new data points.

Random Forest is a robust algorithm regarding the hyperparameters, so increasing the number of trees above a certain value may not change the performance evaluation, see my post on LinkedIn and/or the article "Tunability: Importance of Hyperparameters of Machine Learning Algorithms" : 1802.09596 (arxiv.org)

Victor GUILLER

"It is not unusual for a well-designed experiment to analyze itself" (Box, Hunter and Hunter)

frankderuyck · Jun 20, 2024 07:16 AM

Hi Victor, with some delay... why would you choose Random Forest over SVEM for modeling DOE data?