Modelling Unbalanced Catastrophic Health Expenditure Data by Using Machine‐Learning Methods

Çinaroğlu S.

Intelligent Systems in Accounting, Finance and Management, vol.1, no.1, pp.1-14, 2020 (ESCI)


This study aims to compare the performances of logistic regression and random forest classifiers in a balanced oversampling procedure for the prediction of households that will face catastrophic out‐of‐pocket (OOP) health expenditure. Data were derived from the nationally representative household budget survey collected by the Turkish Statistical Institute for the year 2012. A total of 9,987 households returned valid surveys. The data set was highly imbalanced, and the percentage of households facing catastrophic OOP health expenditure was 0.14. Balanced oversampling was performed, and 30 artificial data sets were generated with sizes of 5% and 98% of the original data size. The balanced oversampled data set provided accurate predictions, and random forest exhibited superior performance in identifying households facing catastrophic OOP health expenditure (area under the receiver operating characteristic curve, AUC = 0.8765; classification accuracy, CA = 0.7936; sensitivity = 0.7765; specificity = 0.8552; F1 = 0.7797).