Incorporating word embeddings in unsupervised morphological segmentation

Ustun, Ahmet; CAN BUĞLALILAR, BURCU

doi:10.1017/s1351324920000406

Incorporating word embeddings in unsupervised morphological segmentation

Atıf İçin Kopyala

Ustun A., CAN BUĞLALILAR B.

NATURAL LANGUAGE ENGINEERING, cilt.27, sa.5, ss.609-629, 2021 (SCI-Expanded)

Yayın Türü: Makale / Tam Makale
Cilt numarası: 27 Sayı: 5
Basım Tarihi: 2021
Doi Numarası: 10.1017/s1351324920000406
Dergi Adı: NATURAL LANGUAGE ENGINEERING
Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED), Arts and Humanities Citation Index (AHCI), Social Sciences Citation Index (SSCI), Scopus, Applied Science & Technology Source, Compendex, Computer & Applied Sciences, INSPEC, Linguistics & Language Behavior Abstracts, Psycinfo, DIALNET
Sayfa Sayıları: ss.609-629
Anahtar Kelimeler: Morphological segmentation, Unsupervised learning, Bayesian learning, Low-resource language
Hacettepe Üniversitesi Adresli: Evet

Özet

We investigate the usage of semantic information for morphological segmentation since words that are derived from each other will remain semantically related. We use mathematical models such as maximum likelihood estimate (MLE) and maximum a posteriori estimate (MAP) by incorporating semantic information obtained from dense word vector representations. Our approach does not require any annotated data which make it fully unsupervised and require only a small amount of raw data together with pretrained word embeddings for training purposes. The results show that using dense vector representations helps in morphological segmentation especially for low-resource languages. We present results for Turkish, English, and German. Our semantic MLE model outperforms other unsupervised models for Turkish language. Our proposed models could be also used for any other low-resource language with concatenative morphology.