6th International Conference on Advances in Statistics (ICAS’20), 16 - 18 October 2020, pp.16-22
The purpose of this study is to examine how missing values should be handled when a classification is made with a random
forest algorithm to the most recent Turkey Demographic and Health Survey (2018 TDHS) data. The main idea of ensemble
learning methods is to create a better model, each solving the same problem, with more accurate and reliable predictions or
decisions than using a single model [1]. As being one of the ensemble methods, Random Forests (RFs) is developed by Leo
Breiman in 2001 and has been increasingly used in the field of data science since then [2]. Some important advantages of
the random forest method are that it handles a large number of input variables and that it is speedy [3]. The inevitable
problem of the data scientist is that s/he faces missing values in almost all areas of science. We first focus on the 2018
Turkey Demographic and Health Survey (2018 TDHS) data that has some missing values. We use different imputation
methods for the missing values of this data [4]. Finally, the best imputation method for 2018 TDHS data is determined in
the classification problem using Random forests.