Handling Missing Values in Random Forests: An Application to Demographic Survey Data


Creative Commons License

Içen D., Abbasoğlu Özgören A., Boz Semerci A.

6th International Conference on Advances in Statistics (ICAS’20), 16 - 18 October 2020, pp.16-22

  • Publication Type: Conference Paper / Full Text
  • Page Numbers: pp.16-22
  • Hacettepe University Affiliated: Yes

Abstract

The purpose of this study is to examine how missing values should be handled when a classification is made with a random forest algorithm to the most recent Turkey Demographic and Health Survey (2018 TDHS) data. The main idea of ensemble learning methods is to create a better model, each solving the same problem, with more accurate and reliable predictions or decisions than using a single model [1]. As being one of the ensemble methods, Random Forests (RFs) is developed by Leo Breiman in 2001 and has been increasingly used in the field of data science since then [2]. Some important advantages of the random forest method are that it handles a large number of input variables and that it is speedy [3]. The inevitable problem of the data scientist is that s/he faces missing values in almost all areas of science. We first focus on the 2018 Turkey Demographic and Health Survey (2018 TDHS) data that has some missing values. We use different imputation methods for the missing values of this data [4]. Finally, the best imputation method for 2018 TDHS data is determined in the classification problem using Random forests.