topic Re: Balancing data in Discussions

Balancing data

Lu — Wed, 27 Aug 2025 23:00:19 GMT

Hello,

I want to balance my data before analysing of my data since my response has a rate of only 27%. I first created a stratified (based on my target column) validation column to create a Training, Validation and Test set. My question is should I balance my data in both the training + validation set and not in the Test set or only in the training set? I would like to use the SMOTE technique and finally the cross validation technique of the Model screening platform on the concacenated databases (training+validation+test) with a Validation column for Test set perfomance analysis. Is this method the way for balancing my data before cross validation? Other suggestions always welcome?

Regards,

Re: Balancing data

Victor_G — Thu, 28 Aug 2025 08:30:44 GMT

Hi @Lu,

I have mixed feelings about SMOTE or other balancing techniques for classification :

SMOTE completely "destroys" the calibration of your classifier (as it completely changes the proportion of the two classes during the learning phase), which means that the probabilities calculated by your classifier do not correspond to actual measured probabilities in your dataset. This can be particularly sensistive for certain applications (healthcare, finance/banking, ...), as sometimes you may need the calculated probabilities instead of the final majority class.
Moreover, in the original paper and confirmed in later studies, SMOTE only provide small benefits when the classifier is a weak learner model, like Decision Tree and Naïve Bayes. When used with state-of-the-art/"strong" models (like Random Forest, XGBoost, ...), SMOTE doesn't provide benefits, and may even be detrimental to the learning. Another study here : The Harms of Class Imbalance Corrections for Machine Learning Based Prediction Models: A Simulation Study - Carriero - 2025 - Statistics in Medicine - Wiley Online Library
In most studies, the conclusion regarding these techniques are the same : "proper, best prediction is achieved by using a strong classifier and balancing the data is not required nor beneficial" or "Correcting for class imbalance is not always necessary and may even be harmful to clinical prediction models which aim to produce reliable risk estimates on an individual basis.".
SMOTE may be beneficial in specific contexts, like using weak learners or when you can't adjust decision thresholds, but its relative benefit may also change depending on the performance metrics chosen. Besides that, I would not recommend using it.

27% for a minority class is not a huge imbalance, and classical methods like adjusting the Decision Thresholds or specifying a Profit Matrix should help you adjust the performances of your classifier to the situation. Adding new features/variables, pre-process the data (feature engineering) to help the model, adding new data or selecting the most representative and different observations, and trying several models for comparison and benchmarking can also help increase your performances.
It's important to understand that imbalanced data is a problem only because the minority class in the training set is not representative enough and/or the features are not good enough indicators for the label. The ideal scenario is to solve these two problems, then the model can perform perfectly well despite the imbalance.

This being said, adding a SMOTE (or any pre-processing) operation in the workflow (if needed) should be done with care, to avoid data leakage and false over-optimistic performance expectations. SMOTE or any other balancing technique should be applied in the training and validation sets (to train, compare and select the best performing model) but obviously not in the test set, as the model needs to perform and be evaluated in "real-life" conditions (with the severe imbalance).

Hope this answer will help you,

Re: Balancing data

Lu — Thu, 28 Aug 2025 10:55:48 GMT

Thanks for your clear advice Victor.

For the time being, I have checked the performance of my models usings the "Inbalanced Classification" Add-Ins and calculated the Precision Recall & ROC AUC for Neural Network ,Bootstrap Forrest and Boosted Tree models with SMOTE and with TOMEC balancing and without weighting but it didn't improve the AUCs at all. So, balancing my data probably would not improve the performance anyway.

Thanks for your support,

Regards

Re: Balancing data

Lu — Tue, 30 Sep 2025 11:24:31 GMT

Hey Russ,

I have a Validation column that distinguishes between training, validation, and test rows. Should this column be specified in the model screening platform when applying nested cross-validation? I have noticed that the performance statistics summary differs depending on whether or not this column is included.

Could you confirm what the recommended approach would be in this case?

I would like to compare and report the performance of ML models.

Is there a fast way to get into more details then the current produced summury report such as F1, MCC, Sens, Spec, FP, FN, ...

regards,

Re: Balancing data

Victor_G — Tue, 30 Sep 2025 12:37:46 GMT

Hi @Lu,

If you have selected a cross-validation option when launching the Model Screening platform, adding a validation column won't change anything, as the selected cross-validation option will take over the validation column: Launch the Model Screening Platform (validation part).

The different results you may obtain are rather due to the random seed if you didn't fixed it. Adding or not a validation column when a crossvalidation option is selected will lead to the same results with a fixed random seed:

Regarding the (broad) topic of model evaluation, comparison and selection, several other topics may help you find a relevant validation strategy:

Hope this answer will help you,