cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
  • Register to attend Discovery Summit 2025 Online: Early Users Edition, Sept. 24-25.
  • New JMP features coming to desktops everywhere this September. Sign up to learn more at jmp.com/launch.
Choose Language Hide Translation Bar
Lu
Lu
Level IV

Balancing data

Hello,

 

I want to balance my data before analysing of my data since my response has a rate of only 27%. I first created a stratified (based on my target column) validation column to create a Training, Validation and Test set. My question is should I balance my data in both the training + validation set and not in the Test set or only in the training set? I would like to use the SMOTE technique and finally the cross validation technique of the Model screening platform on the concacenated databases (training+validation+test) with a Validation column for Test set perfomance analysis. Is this method the way for balancing my data before cross validation?  Other suggestions always welcome?

 

Regards,

 

Lu 

1 ACCEPTED SOLUTION

Accepted Solutions
Victor_G
Super User

Re: Balancing data

Hi @Lu,

 

I have mixed feelings about SMOTE or other balancing techniques for classification :

 

  • SMOTE completely "destroys" the calibration of your classifier (as it completely changes the proportion of the two classes during the learning phase), which means that the probabilities calculated by your classifier do not correspond to actual measured probabilities in your dataset. This can be particularly sensistive for certain applications (healthcare, finance/banking, ...), as sometimes you may need the calculated probabilities instead of the final majority class.
  • Moreover, in the original paper and confirmed in later studies, SMOTE only provide small benefits when the classifier is a weak learner model, like Decision Tree and Naïve Bayes. When used with state-of-the-art/"strong" models (like Random Forest, XGBoost, ...), SMOTE doesn't provide benefits, and may even be detrimental to the learning. Another study here : The Harms of Class Imbalance Corrections for Machine Learning Based Prediction Models: A Simulation...
    In most studies, the conclusion regarding these techniques are the same : "proper, best prediction is achieved by using a strong classifier and balancing the data is not required nor beneficial" or "Correcting for class imbalance is not always necessary and may even be harmful to clinical prediction models which aim to produce reliable risk estimates on an individual basis.". 
  • SMOTE may be beneficial in specific contexts, like using weak learners or when you can't adjust decision thresholds, but its relative benefit may also change depending on the performance metrics chosen. Besides that, I would not recommend using it.

 

27% for a minority class is not a huge imbalance, and classical methods like adjusting the Decision Thresholds or specifying a Profit Matrix should help you adjust the performances of your classifier to the situation. Adding new features/variables, pre-process the data (feature engineering) to help the model, adding new data or selecting the most representative and different observations, and trying several models for comparison and benchmarking can also help increase your performances.
It's important to understand that imbalanced data is a problem only because the minority class in the training set is not representative enough and/or the features are not good enough indicators for the label. The ideal scenario is to solve these two problems, then the model can perform perfectly well despite the imbalance.

 

This being said, adding a SMOTE (or any pre-processing) operation in the workflow (if needed) should be done with care, to avoid data leakage and false over-optimistic performance expectations. SMOTE or any other balancing technique should be applied in the training and validation sets (to train, compare and select the best performing model) but obviously not in the test set, as the model needs to perform and be evaluated in "real-life" conditions (with the severe imbalance).

 

Hope this answer will help you,

Victor GUILLER

"It is not unusual for a well-designed experiment to analyze itself" (Box, Hunter and Hunter)

View solution in original post

2 REPLIES 2
Victor_G
Super User

Re: Balancing data

Hi @Lu,

 

I have mixed feelings about SMOTE or other balancing techniques for classification :

 

  • SMOTE completely "destroys" the calibration of your classifier (as it completely changes the proportion of the two classes during the learning phase), which means that the probabilities calculated by your classifier do not correspond to actual measured probabilities in your dataset. This can be particularly sensistive for certain applications (healthcare, finance/banking, ...), as sometimes you may need the calculated probabilities instead of the final majority class.
  • Moreover, in the original paper and confirmed in later studies, SMOTE only provide small benefits when the classifier is a weak learner model, like Decision Tree and Naïve Bayes. When used with state-of-the-art/"strong" models (like Random Forest, XGBoost, ...), SMOTE doesn't provide benefits, and may even be detrimental to the learning. Another study here : The Harms of Class Imbalance Corrections for Machine Learning Based Prediction Models: A Simulation...
    In most studies, the conclusion regarding these techniques are the same : "proper, best prediction is achieved by using a strong classifier and balancing the data is not required nor beneficial" or "Correcting for class imbalance is not always necessary and may even be harmful to clinical prediction models which aim to produce reliable risk estimates on an individual basis.". 
  • SMOTE may be beneficial in specific contexts, like using weak learners or when you can't adjust decision thresholds, but its relative benefit may also change depending on the performance metrics chosen. Besides that, I would not recommend using it.

 

27% for a minority class is not a huge imbalance, and classical methods like adjusting the Decision Thresholds or specifying a Profit Matrix should help you adjust the performances of your classifier to the situation. Adding new features/variables, pre-process the data (feature engineering) to help the model, adding new data or selecting the most representative and different observations, and trying several models for comparison and benchmarking can also help increase your performances.
It's important to understand that imbalanced data is a problem only because the minority class in the training set is not representative enough and/or the features are not good enough indicators for the label. The ideal scenario is to solve these two problems, then the model can perform perfectly well despite the imbalance.

 

This being said, adding a SMOTE (or any pre-processing) operation in the workflow (if needed) should be done with care, to avoid data leakage and false over-optimistic performance expectations. SMOTE or any other balancing technique should be applied in the training and validation sets (to train, compare and select the best performing model) but obviously not in the test set, as the model needs to perform and be evaluated in "real-life" conditions (with the severe imbalance).

 

Hope this answer will help you,

Victor GUILLER

"It is not unusual for a well-designed experiment to analyze itself" (Box, Hunter and Hunter)
Lu
Lu
Level IV

Re: Balancing data

Thanks for your clear advice Victor.

For the time being, I have checked the performance of my models usings the "Inbalanced Classification" Add-Ins and calculated the Precision Recall & ROC AUC for Neural Network ,Bootstrap Forrest and Boosted Tree models with SMOTE and with TOMEC balancing and without weighting but it didn't improve the AUCs at all. So, balancing my data probably would not improve the performance anyway.

 

Thanks for your support,

 

Regards

 

Lu

Recommended Articles