Weighted kappa measures for ordinal multi-class classification performance

YILMAZ, AYFER; Demirhan, Haydar

doi:10.1016/j.asoc.2023.110020

Weighted kappa measures for ordinal multi-class classification performance

Atıf İçin Kopyala

YILMAZ A. E., Demirhan H.

Applied Soft Computing, cilt.134, 2023 (SCI-Expanded)

Yayın Türü: Makale / Tam Makale
Cilt numarası: 134
Basım Tarihi: 2023
Doi Numarası: 10.1016/j.asoc.2023.110020
Dergi Adı: Applied Soft Computing
Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED), Scopus, Applied Science & Technology Source, Compendex, Computer & Applied Sciences, INSPEC
Anahtar Kelimeler: Accuracy, Agreement measures, Evaluation metric, Matthews correlation coefficient, Performance metric, Ordinal classifier, Ordinal labels
Hacettepe Üniversitesi Adresli: Evet

Özet

© 2023 The Author(s)Assessing the classification performance of ordinal classifiers is a challenging problem under imbalanced data compositions. Considering the critical impact of the metrics on the choice of classifiers, employing a metric with the highest performance is crucial. Although Cohen's kappa measure is used for performance assessment, there are better-performing agreement measures under different formations of ordinal confusion matrices. This research implements weighted agreement measures as evaluation metrics for ordinal classifiers. The applicability of agreement and mainstream performance metrics to various practice fields under challenging data compositions is assessed. The sensitivity of the metrics in detecting subtle distinctions between ordinal classifiers is analyzed. Five kappa-like agreement measures with six weighting schemes are employed as evaluation metrics. Their reliability/usefulness is compared to the mainstream and recently proposed metrics, including F1, Matthews correlation coefficient, and informational agreement. The performance of 37 metrics is analyzed in two extensive numerical studies, including synthetic confusion matrices and real datasets. Promising metrics under practical circumstances are identified, and recommendations about the best metric to evaluate ordinal classifiers under different conditions are made. Overall, the weighted Scott's pi-measure is found useful, sensitive to small differences in the classification performance, and reliable under general conditions.