Comparison of Fuzzy C-Means and K-Means Clustering Performance: An Application on Household Budget Survey Data


Çinaroğlu S.

in: Intelligent and Fuzzy Techniques: Smart and Innovative Solutions, Kahraman C.,Cevik Onar S.,Oztaysi B.,Sari I.,Cebi S.,Tolga A, Editor, Springer, London/Berlin , Basel, pp.54-62, 2020

  • Publication Type: Book Chapter / Chapter Research Book
  • Publication Date: 2020
  • Publisher: Springer, London/Berlin 
  • City: Basel
  • Page Numbers: pp.54-62
  • Editors: Kahraman C.,Cevik Onar S.,Oztaysi B.,Sari I.,Cebi S.,Tolga A, Editor
  • Hacettepe University Affiliated: Yes

Abstract

National Household Budget Survey (HBS) data includes sociodemographic and financial indicators that are the elements of government public policy actions. Finding the optimal grouping of households in a given, sufficiently large data is a challenging task for policymakers. Soft classification techniques such as Fuzzy C-means (FCM) provide a deep understanding of hidden patterns in the variable set. This study aims to compare FCM and k-means (KM) classification performance for the grouping of households in terms of sociodemographic and out-of-pocket (OOP) health expenditure variables. Health expenditure variables have heavily skewed distributions and that the shape of the variable distribution has a measurable effect on classifiers. Incorporating Bayesian data generation procedures into the variable transformation process will increase the ability to deal with skewness and improve model performance. However, there is a scarcity of knowledge about the embedded strategy performance of the Bayesian data generation approach with unsupervised learning with the application on health expenditures. This study applied the aforementioned strategy to Turkish HBS data for the year 2015 while comparing FCM and KM classification performance. Normality test results for the distribution of logarithmic (KS = 0.006; p > 0.05) and Box-Cox transformed (KS = 0.006; p > 0.05) health expenditure variables, which were generated using lognormal distributions from a Bayesian viewpoint, are next to normal. Moreover, KM clustering (Sil = 0.48) results are better than FCM (Sil = 0.4198) for classifying households. The optimal number of household groups is 20. Further studies will compare the cluster-seeking performance of other unsupervised learning algorithms while incorporating arbitrary health expenditure variables into the study model.