Dealing with imbalanced data (02/02)

Researchers in this space have literally turned into explorers trying out every wild method, strategy and theory that one can think of from under sampling (reducing the number of instances in the majority class), to over sampling (increasing the number of instances in the minority class) using techniques like SMOTE ¹, with or without class weighting ², or even using boosting or bagging ³, ⁴.

Nuanced approaches like combining both methods and using ensemble resampling ⁵ have also been explored. A very promising approach on addressing imbalanced datasets has been conformal prediction ⁶ .

van den Goorbergh R, et al. (2022) analyzed the impact of various techniques to deal with imbalanced data on the performance of logistic regression. The results are showing that neither under sampling, random oversampling or SMOTE resulted in higher areas under the Receiver Operating Characteristic (ROC) curve when compared with models developed without correction for class imbalance. Although imbalance correction improved metrics such as sensitivity and specificity, the same results were observed by only shifting the probability threshold instead. The caveat here is that the corrective models constantly resulted in miscalibrated models (overestimation of log-likehood of belonging to the minority class) which is a huge concern.

Albeit the fact that practitioners have long been using SMOTE, it might not be the best way to deal with imbalanced dataset. The original paper of SMOTE is now more than 20 years old (in this case, it matters!) and was executed in a certain number of datasets using specific classifiers thus, it’s difficult to be used widely. In this post the matter is clearly explained. In short, SMOTE generated points are not a good representation of the underlying data distribution (due to a couple of reasons such as giving linear structure to the generated data) – which is also changing overtime.

Sidenote, 10 years ago (I repeat, 10!!), Facebook’s applied research team published a paper dealing with imbalanced data ⁷ without even using SMOTE. Also, a very interesting paper from Amazon & Cornell university researchers also concluded that balancing imbalanced datasets does not improve performance for strong classifiers.

One can think that solving imbalanced datasets using many popular techniques is intentionally biasing your data (to an extent) to get interesting results instead of just accurate results (at a data level). Research applications have proven that Logistic Regression, SVM are less vulnerable than decision trees (which hugely overfit).

This is true due to the simpler decision boundary of Logistic Regression and the margin maximization of SVM which helps in handling the minority class examples close to the decision boundary, without being overly influenced by the majority class. Decision trees, however, are very greedy and can often create over complex models that fit too closely to the majority class ignoring or misclassifying the minority class.

Specifically for Logistic Regression model applications, although imbalanced training data skews all the predicted probabilities and therefore compromises your predictions, it only influences the estimate of the model’s intercept which can easily be treated by applying an intercept correction. Given that you either know or you can guess the true proportion of 0’s and 1’s, and know these proportions in the training set, you can apply a rare events correction to the intercept. In addition, it’s crucial to choose a valid threshold through the Receiver Operating Characteristic (ROC) but one should ensure not to depend solely on ROC.

Classifiers are best suited for deterministic problems where there is a clear fine cut in outcomes but for most real-world examples where there is complexity and inherent variability, probability-based models such as logistic regression are precise and insightful. Not to mention how strong they can be with proper calibration. The most important matter still is to understand the problem in hand, first.

One may go beyond the ‘classification’ way of thinking and come with other ways of approaching the problem. Questioning if the imbalanced dataset SHOULD be imbalanced or WILL STAY imbalanced through time are some questions to start off. Think of shifting what was though a ‘classification problem’ to Anomaly Detection could be a valid alternative.

In general there are four cases but it all depends on what the goal is:

• If your goal is precise forecasting and you believe your dataset accurately represents the situation, there's no need for adjustments.

• When predictions matter but your sample is skewed due to missing some data points — if the missing data is random, it's not an issue, you’re fine. However, if the loss is systematic and the nature of the bias is unknown, acquiring new data is necessary. If the missing data is selectively absent based on a single characteristic (for instance, if you categorize your results into Groups A and B and lose half of Group B's data), you can employ bootstrapping to compensate.

• If your interest lies in identifying and analyzing rare occurrences rather than achieving broad prediction accuracy, you may consider artificially increasing the frequency of these cases in your dataset through bootstrapping, or by discarding data unrelated to these cases if you have sufficient data. Be aware, though, that such approaches will introduce bias into your dataset and could lead to incorrect conclusions or estimates.

• Use a different approach on solving the problem rather than classification. Make sure you have a calibrated model using the right metrics instead (‘Discontinuous’ metrics are discussed in the next post).

• If there is a clear-cut on the problem on hand (difficult to be found on real life scenarios), adjusting the threshold while being very cautious can be advantageous. Cost sensitive learning (make misclassification of minority class more ‘expensive’ for the algorithm) along with weight adjustment during training might help as well.

Additional tip: check out this video on how to use Scikit-learn for imbalanced datasets (comes with free sample on how SMOTE does not work). Scikit-learn’s progression is truly admiring.

OR: check out ⁸, ⁹

1. Lee PH. Resampling Methods Improve the Predictive Power of Modeling in Class-Imbalanced Datasets. International Journal of Environmental Research and Public Health. 2014

2. Anand, A., Pugalenthi, G., Fogel, G.B. et al. An approach for classification of highly imbalanced data using weighting and undersampling

3. M. Hao, Y. Wang, S.H. Bryant, An efficient algorithm coupled with synthetic minority over-sampling technique to classify imbalanced PubChem BioAssay data, Anal. Chim. Acta 806 (2014) 117–127.

4. H. Parvin, B. Minaei-Bidgoli, H.J. Alinejad-Rokny, A new imbalanced learning and dictions tree method for Breast cancer diagnosis, J. Bionanosci. 7 (2013) 673–678.

5. H. Wang, Q. Xu, L. Zhou, Large unbalanced credit scoring using lasso-logistic regression ensemble, PLoS One 10 (2015) e0117844

6. Norinder, U., & Boyer, S. (2017). Binary classification of imbalanced datasets using conformal prediction. Journal of Molecular Graphics and Modelling, 72, 256-265.

7. He, X., Pan, J., Jin, O., Xu, T., Liu, B., Xu, T., ... & Candela, J. Q. (2014, August). Practical lessons from predicting clicks on ads at facebook. In Proceedings of the eighth international workshop on data mining for online advertising (pp. 1-9).

8. https://stats.stackexchange.com/questions/6067/does-an-unbalanced-sample-matter-when-doing-logistic-regression

9.https://stats.meta.stackexchange.com/questions/6349/profusion-of-threads-on-imbalanced-data-can-we-merge-deem-canonical-any