Neural Text Normalization for Turkish Social Media


Creative Commons License

Goker S., CAN BUĞLALILAR B.

3rd International Conference on Computer Science and Engineering (UBMK), Sarajevo, Bosna-Hersek, 20 - 23 Eylül 2018, ss.161-166 identifier identifier

  • Yayın Türü: Bildiri / Tam Metin Bildiri
  • Cilt numarası:
  • Doi Numarası: 10.1109/ubmk.2018.8566406
  • Basıldığı Şehir: Sarajevo
  • Basıldığı Ülke: Bosna-Hersek
  • Sayfa Sayıları: ss.161-166
  • Hacettepe Üniversitesi Adresli: Evet

Özet

Social media has become a rich data source for natural language processing tasks with its worldwide use; however, it is hard to process social media data due to its informal nature. Text normalization is the task of transforming the noisy text into its canonical form. It generally serves as a preprocessing task in other NLP tasks that are applied to noisy text. In this study, we apply two approaches for Turkish text normalization: Contextual Normalization approach using distributed representations of words and Sequence-to-Sequence Normalization approach using neural encoder-decoder models. As the approaches applied to Turkish and also other languages are mostly rule-based, additional rules are required to be added to the normalization model in order to detect new error patterns arising from the change of the language use in social media. In contrast to rule-based approaches, the proposed approaches provide the advantage of normalizing different error patterns that change over time by training with a new dataset and updating the normalization model. Therefore, the proposed methods provide a solution to language change dependency in social media by updating the normalization model without defining new rules.