Comparison of Fuzzy C-Means and K-Means Clustering Performance: An Application on Household Budget Survey Data

Çinaroğlu, SONGÜL

Comparison of Fuzzy C-Means and K-Means Clustering Performance: An Application on Household Budget Survey Data

Intelligent and Fuzzy Techniques: Smart and Innovative Solutions, Kahraman C.,Cevik Onar S.,Oztaysi B.,Sari I.,Cebi S.,Tolga A, Editör, Springer, London/Berlin , Basel, ss.54-62, 2020

Yayın Türü: Kitapta Bölüm / Araştırma Kitabı
Basım Tarihi: 2020
Yayınevi: Springer, London/Berlin
Basıldığı Şehir: Basel
Sayfa Sayıları: ss.54-62
Editörler: Kahraman C.,Cevik Onar S.,Oztaysi B.,Sari I.,Cebi S.,Tolga A, Editör
Hacettepe Üniversitesi Adresli: Evet

Özet

National Household Budget Survey (HBS) data includes sociodemographic and financial indicators that are the elements of government public policy actions. Finding the optimal grouping of households in a given, sufficiently large data is a challenging task for policymakers. Soft classification techniques such as Fuzzy C-means (FCM) provide a deep understanding of hidden patterns in the variable set. This study aims to compare FCM and k-means (KM) classification performance for the grouping of households in terms of sociodemographic and out-of-pocket (OOP) health expenditure variables. Health expenditure variables have heavily skewed distributions and that the shape of the variable distribution has a measurable effect on classifiers. Incorporating Bayesian data generation procedures into the variable transformation process will increase the ability to deal with skewness and improve model performance. However, there is a scarcity of knowledge about the embedded strategy performance of the Bayesian data generation approach with unsupervised learning with the application on health expenditures. This study applied the aforementioned strategy to Turkish HBS data for the year 2015 while comparing FCM and KM classification performance. Normality test results for the distribution of logarithmic (KS = 0.006; p > 0.05) and Box-Cox transformed (KS = 0.006; p > 0.05) health expenditure variables, which were generated using lognormal distributions from a Bayesian viewpoint, are next to normal. Moreover, KM clustering (Sil = 0.48) results are better than FCM (Sil = 0.4198) for classifying households. The optimal number of household groups is 20. Further studies will compare the cluster-seeking performance of other unsupervised learning algorithms while incorporating arbitrary health expenditure variables into the study model.