Learning from unbalanced catastrophic out-of-pocket health expenditure dataset: Blending SMOTE-boosting with ensemble models


Creative Commons License

Çinaroğlu S.

Journal of Experimental & Theoretical Artificial Intelligence, cilt.1, sa.1, ss.1-18, 2022 (SCI-Expanded)

Özet

This study attests to the benefits of synthetic data generation with the Synthetic Minority Oversampling Technique (SMOTE), and it incorporates this procedure with SMOTEBoosting by applying learning algorithms to model unbalanced catastrophic out-of-pocket (OOP) health expenditure dataset. Nationally representative household budget survey data were gathered from Turkish Statistical Institute for the year 2012. A total of 9987 households responded to the survey. The original dataset was highly unbalanced and a total of 0.14 percent of households faced catastrophic health expenses. SMOTE was used to perform balanced oversampling, and 10 artificial datasets with sizes from 10% to 100% of the majority group of original training data were generated. To predict OOP catastrophic health expenditures, the SMOTEBoosting was embedded with learning algorithms, such as C5.0, random forest (RF), naïve Bayes, and support vector machine. Study results confirm the outstanding prediction performance of the blended strategy of SMOTEBoosting with RF (area under the curve ˃0.85) for prediction. A variable importance plot and decision tree visualize that at least 65 years of age is the most important predictor of the catastrophic cases. The findings of this study highlights that multistrategy ensemble learning techniques are useful to model highly unbalanced datasets.