Synset expansion on translation graph for automatic wordnet construction


ERCAN G. , Haziyev F.

INFORMATION PROCESSING & MANAGEMENT, cilt.56, ss.130-150, 2019 (SCI İndekslerine Giren Dergi) identifier identifier

  • Cilt numarası: 56 Konu: 1
  • Basım Tarihi: 2019
  • Doi Numarası: 10.1016/j.ipm.2018.10.002
  • Dergi Adı: INFORMATION PROCESSING & MANAGEMENT
  • Sayfa Sayıları: ss.130-150

Özet

Research on clustering algorithms in synonymy graphs of a single language yields promising results, however, this idea is not yet explored in a multilingual setting. Nevertheless, moving the problem to a multilingual translation graph enables the use of more clues and techniques not possible in a monolingual synonymy graph. This article explores the potential of sense induction methods in a massively multilingual translation graph. For this purpose, the performance of graph clustering methods in synset detection are investigated. In the context of translation graphs, the use of existing Wordnets in different languages is an important clue for synset detection which cannot be utilized in a monolingual setting. Casting the problem into an unsupervised synset expansion task rather than a clustering or community detection task improves the results substantially. Furthermore, instead of a greedy unsupervised expansion algorithm guided by heuristics, we devise a supervised learning algorithm able to learn synset expansion patterns from the words in existing Wordnets to achieve superior results. As the training data is formed of already existing Wordnets, as opposed to previous work, manual labeling is not required. To evaluate our methods, Wordnets for Slovenian, Persian, German and Russian are built from scratch and compared to their manually built Wordnets or labeled test-sets. Results reveal a clear improvement over 2 state-of-the-art algorithms targeting massively multilingual Wordnets and competitive results with Wordnet construction methods targeting a single language. The system is able to produce Wordnets from scratch with a Wordnet base concept coverage ranging from 20% to 88% for 51 languages and expands existing Wordnets up to 30%.