Hi @Lu,
I have mixed feelings about SMOTE or other balancing techniques for classification :
- SMOTE completely "destroys" the calibration of your classifier (as it completely changes the proportion of the two classes during the learning phase), which means that the probabilities calculated by your classifier do not correspond to actual measured probabilities in your dataset. This can be particularly sensistive for certain applications (healthcare, finance/banking, ...), as sometimes you may need the calculated probabilities instead of the final majority class.
- Moreover, in the original paper and confirmed in later studies, SMOTE only provide small benefits when the classifier is a weak learner model, like Decision Tree and Naïve Bayes. When used with state-of-the-art/"strong" models (like Random Forest, XGBoost, ...), SMOTE doesn't provide benefits, and may even be detrimental to the learning. Another study here : The Harms of Class Imbalance Corrections for Machine Learning Based Prediction Models: A Simulation...
In most studies, the conclusion regarding these techniques are the same : "proper, best prediction is achieved by using a strong classifier and balancing the data is not required nor beneficial" or "Correcting for class imbalance is not always necessary and may even be harmful to clinical prediction models which aim to produce reliable risk estimates on an individual basis.".
- SMOTE may be beneficial in specific contexts, like using weak learners or when you can't adjust decision thresholds, but its relative benefit may also change depending on the performance metrics chosen. Besides that, I would not recommend using it.
27% for a minority class is not a huge imbalance, and classical methods like adjusting the Decision Thresholds or specifying a Profit Matrix should help you adjust the performances of your classifier to the situation. Adding new features/variables, pre-process the data (feature engineering) to help the model, adding new data or selecting the most representative and different observations, and trying several models for comparison and benchmarking can also help increase your performances.
It's important to understand that imbalanced data is a problem only because the minority class in the training set is not representative enough and/or the features are not good enough indicators for the label. The ideal scenario is to solve these two problems, then the model can perform perfectly well despite the imbalance.
This being said, adding a SMOTE (or any pre-processing) operation in the workflow (if needed) should be done with care, to avoid data leakage and false over-optimistic performance expectations. SMOTE or any other balancing technique should be applied in the training and validation sets (to train, compare and select the best performing model) but obviously not in the test set, as the model needs to perform and be evaluated in "real-life" conditions (with the severe imbalance).
Hope this answer will help you,
Victor GUILLER
"It is not unusual for a well-designed experiment to analyze itself" (Box, Hunter and Hunter)