Imbalanced Text Categorization Based on Positive and Negative Term Weighting Approach


Naderalvojoud B., Sezer E. A. , UÇAN A.

18th International Conference on Text, Speech and Dialogue (TSD), Pilsen, Czech Republic, 14 - 17 September 2015, vol.9302, pp.325-333 identifier identifier

  • Publication Type: Conference Paper / Full Text
  • Volume: 9302
  • Doi Number: 10.1007/978-3-319-24033-6_37
  • City: Pilsen
  • Country: Czech Republic
  • Page Numbers: pp.325-333

Abstract

Although term weighting approach is typically used to improve the performance of text classification, this approach may not provide consistent results while imbalanced data distribution is available. This paper presents a probability based term weighting approach which addresses the different aspects of class imbalance problem in text classification. In this approach, we proposed two term evaluation functions called as PNF and PNF 2 which can produce more influential weights by relying on the imbalanced data sets. These functions can determine the significance of a term in association with a particular category. This is a crucial point because in one hand a frequent term is more important than a rare term in a particular category according to feature selection approach, and on the other hand a rare term is no less important than a frequent term based on idf assumption of traditional term weighting approach. Incorporation of these two approaches at the same time is the main idea that make them superior to other weighting methods. The achieved results from experiments which were carried out on two popular benchmarks (Reuters-21578 and WebKB) demonstrate that the probability based term weighting approach yields more consistent results than the other methods on the imbalanced data sets.