topic Re: SVEM ADD IN FOR SUPPORT VECTOR MACHINE in Discussions

SVEM ADD IN FOR SUPPORT VECTOR MACHINE

frankderuyck — Fri, 07 Jun 2024 07:29:34 GMT

I have a DOE results I would like to model with Support Vector Machine (there are strong non linear effects that can't be estimated with a polynomial model) How do I install the SVEM add in?

Re: SVEM ADD IN FOR SUPPORT VECTOR MACHINE

Victor_G — Fri, 07 Jun 2024 08:31:09 GMT

Hi @frankderuyck,

Some questions before answering yours :

Why would you like to use SVEM for SVM ? Support Vector Machines is a robust ML algorithm, that can be used for small-sized datasets like DoE. More infos about algorithm's robustness on LinkedIn.
There are also many ways to assess overfitting for SVM without using validation set, by checking the number of support vectors, and the values of penalty (cost) and curvature (gamma) if you have fine-tuned your SVM hyperparameters with a Tuning design... I am thinking about writing a blog post about assessing and preventing overfitting for robust algorithms (SVM, Random Forest) in the absence of validation (in DoE datasets for example).
Did you also consider other algorithms ? SVM is a robust algorithm for analysis of small-size datasets, but Random Forest can also provide good results and may be a good benchmark to evaluate and compare performances. More infos on Random Forests on LinkedIn.
How many levels do you have for your factors ? ML algorithms can only be useful if you have enough levels for each of your factors to enable a smooth interpolation. More than 3 levels may be a good starting point.

Concerning the use of SVEM, you can access it through different ways :

In the platform "Make Validation Column", you have the possibility to create an Autovalidation table : Launch the Make Validation Column Platform (jmp.com) This will duplicate the rows and create anti-correlated weights (between training and validation) column that you can use as "Frequency" in the platform launch. You can then simulate prediction results by varying the weights between training and validation (right-click a panel with validation results, click on Simulate, and specify weights column as the column to switch in and out).
You can also download the addin SVEM, that will do the same things as the autovalidation table : Re-Thinking the Design and Analysis of Experiments? (2021-EU-30MP-776) - JMP User Community
You can then reproduce the same steps as before and simulate results to assess model's robustness.

I would however like to emphasize that cross-validation (K-folds or SVEM) is not properly a validation method, but more a "debugging" method to assess robustness of algorithm's performances on various training and validation sets. The main difficulty after using a cross-validation technique is how to make sense of the results and how to proceed next:
Average the individual models trained on various folds ? Create an ensemble with these models ? Keep the best model trained on a fold (I wouldn't recommend this option) ? Train the model on whole dataset once the robustness has been checked with the cross-validation ? ... There is no definitive answer to this question and this is related to the dataset, modelisation complexity, performances improvement, user's practices, ...

More infos here with some episodes of Cassie Kozyrkov on the topic of cross-validation: https://youtu.be/zqD0lQy_w40?si=41jfBs83Q0TIlusr

https://youtu.be/by59hsYO0Lg?si=XeZ33j2e4jKQPaRA

Hope this answer will help you,

Re: SVEM ADD IN FOR SUPPORT VECTOR MACHINE

frankderuyck — Fri, 07 Jun 2024 12:26:18 GMT

Hi Victor, this tuning design works great, I want to model a guality response Y as a function of three Y FPCA's from NIR spectra (FDA). I tried neural network but even with 3 TanH nodes profiler shows to my opinion too high and irregular ncurvature. SVE after tuning does a nice job, I support vector machine learning!

Re: SVEM ADD IN FOR SUPPORT VECTOR MACHINE

Victor_G — Fri, 07 Jun 2024 13:07:27 GMT

Hi @frankderuyck,

Great to know ! :)
I use Neural Networks as the last modeling option possible, there are many classical models and ML algorithms more simple that perform equally well (and sometimes better). These other options are also faster to compute, don't require extended knowledge, tests and validation to finetune them (unlike NNs where you have to specify the architecture, with the number of layers, type of functions, boosting or not, ... or using the Neural Network Tuning AddIn) and are easier to interpret (in terms of response profiles and relative factors importance).
On tabular data, if the only model able to perform well is Neural Networks, there are good chances that you might have other problems, in terms of data quality, representativeness, missing values/outliers, repeatability/reproducibility, etc...

Tree-based models provide a good performances benchmark and are often a safe and reliable option, see the article "Why do tree-based models still outperform deep learning on tabular data?" (2022) https://arxiv.org/abs/2207.08815

In your screenshot, it seems the optimal SVM model found with the Tuning design is quite interesting :

It uses 10 support vectors, so less than half of the points are used to build the "boundaries"/vectors used in SVM model, which may be a good sign of low complexity and good generalization properties,
The curvature value (gamma) is quite low, so the response profile and boundaries may not be over-complex and highly curved (so again, low complexity and good generalization properties),
The penalty (cost value) seems quite high, but with a low-size dataset this is often the case in practice, as any misclassified/mispredicted point will greatly increase the misclassification rate/RMSE : 1 misclassified point out of 25 = 4% misclassification rate, compared to higher dataset size where one misclassified point would represent less than 1% misclassification rate. In some situations, it can be interesting to choose a model with a slightly larger error (RMSE/Misclassification rate), but with a lot less support vectors : you degrade your predictive performances on the training set in favor of a less complex model that could be able to generalize better on new data.

You can use the table generated by the Tuning design, with the performance metrics, to better assess the relative importance and contribution of the two hyperparameters on the performance metrics (Misclassification rate for classification, or RASE and R² for regression).

As always, it is recommended to validate your model with new validation points, to make sure your model doesn't overfit.

Re: SVEM ADD IN FOR SUPPORT VECTOR MACHINE

frankderuyck — Fri, 07 Jun 2024 14:21:34 GMT

Guess tree based models like random forest & XGboost require a large data set unless you can use SVEM for small sets, correct?

Re: SVEM ADD IN FOR SUPPORT VECTOR MACHINE

Victor_G — Fri, 07 Jun 2024 15:25:11 GMT

I don't fully agree, there are more nuances between tree-based models:

Boosted Tree methods, like XGBoost and others, use trees trained sequentially to refine and correct predictions progressively. They are more influenced by the training set characteristics, size, data quality, repartition of data points, representativeness, ... as they iteratively try to have the best predictions (each tree "correcting" the largest prediction errors of the previous tree) on the dataset they are trained on.
It's typically an algorithm described as low bias - high variance : predictions can be very precise, but very sensitive/influenced by the training dataset used.
Random Forests use trees trained in parallel on slightly different datasets (bootstrap sets). The good predictive performance of this algorithm comes from the averaging of the individual trees predictions. You can think of it as "The Wisdom of the Crowd": if you ask the weight of a whale to someone, it might be very inaccurate and not precise, but if collect the answers from hundreds or thousands of people and average the results, the mean result may be very close to the actual value.
It's typically an algorithm described as medium/low bias - low variance : predictions may be precise (depending on the averaging of the individual results and the complexity/number of individual trees), and they are quite robust to the training dataset thanks to the bootstrapping process and the individual training of the trees (with different features).

Here is a short visual between the different tree-based models (I don't like how the individual trees in the Random Forest have the same structure, but it's a simplistic schema):

To answer your question, I would try Random Forests over Boosted Tree methods on low size dataset like DoE, to have more confidence in the results, use an algorithm less prone to overfitting, and have better generalization properties.

Re: SVEM ADD IN FOR SUPPORT VECTOR MACHINE

frankderuyck — Fri, 07 Jun 2024 15:40:47 GMT

By Random forest I get poor R² = 0,63 even with 2000 trees

Re: SVEM ADD IN FOR SUPPORT VECTOR MACHINE

frankderuyck — Fri, 07 Jun 2024 15:43:05 GMT

With boosted tree R² = 0,85

Re: SVEM ADD IN FOR SUPPORT VECTOR MACHINE

Victor_G — Fri, 07 Jun 2024 15:52:56 GMT

Yes, boosted tree may be more precise than Random Forests "by design", due to their iterative prediction error correction with sequential trees training. But the complex splits done during the sequential trees learning to improve the performance may only be valid on the training set, and may not reflect what could happen on new data points.

Random Forest is a robust algorithm regarding the hyperparameters, so increasing the number of trees above a certain value may not change the performance evaluation, see my post on LinkedIn and/or the article "Tunability: Importance of Hyperparameters of Machine Learning Algorithms" : 1802.09596 (arxiv.org)

Re: SVEM ADD IN FOR SUPPORT VECTOR MACHINE

frankderuyck — Thu, 20 Jun 2024 11:16:59 GMT

Hi Victor, with some delay... why would you choose Random Forest over SVEM for modeling DOE data?

Re: SVEM ADD IN FOR SUPPORT VECTOR MACHINE

Victor_G — Thu, 20 Jun 2024 12:39:30 GMT

Hi @frankderuyck,

This is two different considerations :

SVEM is a validation strategy that could be applied to any model, from linear regression to neural networks.
Random Forest is a Machine Learning model.

Both share in common a way to prevent overfitting when modeling :

Random Forest use bootstrap samples and "Out-of-Bag" sample as "internal" model validation. The "Out-of-Bag" sample is a part of the data not used in training: Training samples are bootstrap samples sampled with replacement in the original dataset. Theoritically, you can calculate that by sampling with replacement your original data, around 1/3 of your data won't be sampled : machine learning - Random Forests out-of-bag sample size - Cross Validated (stackexchange.com)
This sample constitutes the "Out-of-bag" sample not used in the training of the Random Forest and enable to assess/validate the relevance and precision of the model.
SVEM use anticorrelated weights for training and validation, meaning that an experiment with a high weight for training will have a low weight for validation. By fitting a model with this validation setup, and then changing the weights and refitting the model with new weights for training and validation a high number of times, you have slightly different models that you can ensemble/combine, in order to reduce variance (and prevent overfitting).

Now, my personal opinion and choice :

What I do see frequently is people using SVEM for their DoE with highly complex models without even further considerations or previous analysis/modeling. You can often see for example SVEM applied with Neural Networks for quite relatively simple designs and results. It sounds like "using a bazooka to kill a fly" for me : I tend to choose the most simple options/models before increasing the complexity (if necessary) : Ockham's razor. You also have to consider how much levels you have for your factors and other considerations (like dimensionality), as Machine Learning models are interpolating models, very efficient at finding a pattern between points. So if you only have 2-3 levels for your factors, that might be not enough to really gain a benefit using Machine Learning.
If you have enough levels and the traditional linear modeling doesn't seem appropriate because of non-linear relationships, then you might be interested to try Machine Learning models.
Since you have a low-size but high-quality training dataset with your DoE, you have to choose a ML model that is simple (no or few and easy hyperparameters fine-tuning), robust (less sensitive to hyperparameters values setting) and less prone to overfitting, since you can't split your DoE data into a training and validation sets. This is where Random Forests (with its internal "Out-of-bag" sample validation) and SVM (you can evaluate overfitting possibility with hyperparameters choice and number of support vectors) are interesting.

Also using a robust ML algorithm instead of a complex one with a SVEM strategy has other benefits, such as lower computational time and easier interpretability.

Hope this answer will help you and make sense for you,

Re: SVEM ADD IN FOR SUPPORT VECTOR MACHINE

frankderuyck — Tue, 16 Dec 2025 14:50:12 GMT

Hi Victor, I tried to use this for mixture results, got a nice model but optimization with the profiler did not work because the component su did not add up to 1, was almost 3.. The same happens in other ML models, PLS and even Gen Reg

Re: SVEM ADD IN FOR SUPPORT VECTOR MACHINE

Victor_G — Tue, 16 Dec 2025 16:06:16 GMT

Hi Franck,

Have you checked that your mixture factors have the right column properties :

"Mixture" with low and high individual factors values
"Design Role" with role set as Mixture
"Factor Changes" set at Easy (or other type if you have a split-plot/restricted randomization scenario)

Here is an example on Donev Mixture Data with factor CuSO4 :

When launching the Profiler from Gaussian Process (or any other ML model), the mixture constraint is respected and enforced (mixture factors always sum up to 1):

(view in My Videos)

Hope this answer may solve your problem,

Re: SVEM ADD IN FOR SUPPORT VECTOR MACHINE

frankderuyck — Thu, 18 Dec 2025 06:37:38 GMT

OK Victor, thanks!