Term evaluation metrics in imbalanced text categorization

Naderalvojoud, ALAETTİN; SEZER, EBRU

doi:10.1017/s1351324919000317

Term evaluation metrics in imbalanced text categorization

Atıf İçin Kopyala

Naderalvojoud B., SEZER E.

NATURAL LANGUAGE ENGINEERING, cilt.26, sa.1, ss.31-47, 2020 (SCI-Expanded)

Yayın Türü: Makale / Tam Makale
Cilt numarası: 26 Sayı: 1
Basım Tarihi: 2020
Doi Numarası: 10.1017/s1351324919000317
Dergi Adı: NATURAL LANGUAGE ENGINEERING
Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED), Arts and Humanities Citation Index (AHCI), Social Sciences Citation Index (SSCI), Scopus, Applied Science & Technology Source, Compendex, Computer & Applied Sciences, INSPEC, Linguistics & Language Behavior Abstracts, Psycinfo, DIALNET
Sayfa Sayıları: ss.31-47
Hacettepe Üniversitesi Adresli: Evet

Özet

This paper proposes four novel term evaluation metrics to represent documents in the text categorization where class distribution is imbalanced. These metrics are achieved from the revision of the four common term evaluation metrics: chi-square, information gain, odds ratio, and relevance frequency. While the common metrics require a balanced class distribution, our proposed metrics evaluate the document terms under an imbalanced distribution. They calculate the degree of relatedness of terms with respect to minor and major classes by considering their imbalanced distribution. Using these metrics in the document representation makes a better distinction between the documents of the minor and major classes and improves the performance of machine learning algorithms. The proposed metrics are assessed over three popular benchmarks (two subsets of Reuters-21578 and WebKB) by using four classification algorithms: support vector machines, naive Bayes, decision trees, and centroid-based classifiers. Our empirical results indicate that the proposed metrics outperform the common metrics in the imbalanced text categorization.