Comparison of Textual Data Augmentation Methods on SST-2 Dataset


Çataltaş M., Baykan N. A., ÇİÇEKLİ İ.

2nd International Congress of Electrical and Computer Engineering, ICECENG 2023, Bandirma, Türkiye, 22 - 25 Kasım 2023, ss.189-201 identifier

  • Yayın Türü: Bildiri / Tam Metin Bildiri
  • Doi Numarası: 10.1007/978-3-031-52760-9_14
  • Basıldığı Şehir: Bandirma
  • Basıldığı Ülke: Türkiye
  • Sayfa Sayıları: ss.189-201
  • Anahtar Kelimeler: Data augmentation, Natural language processing, Text generation
  • Hacettepe Üniversitesi Adresli: Evet

Özet

Since the arrival of advanced deep learning models, more successful techniques have been proposed, significantly enhancing the performance of nearly all natural language processing tasks. While these deep learning models achieve the best results, large datasets are needed to get these results. However, data collection in large amounts is a challenging task and cannot be done successfully for every task. Therefore, data augmentation might be required to satisfy the need for large datasets by generating synthetic data samples using original data samples. This study aims to give an idea to those who will work in this field by comparing the successes of using a large dataset as a whole and data augmentation in smaller pieces at different rates. For this aim, this study presents a comparison of three textual data augmentation techniques, examining their efficacy based on the augmentation mechanism. Through empirical evaluations on the Stanford Sentiment Treebank dataset, the sampling-based method LAMBADA showed superior performance in low-data regime scenarios and moreover showcased better results than other methods when the augmentation ratio is increased, offering significant improvements in model robustness and accuracy. These findings offer insights for researchers on augmentation strategies, thereby enhancing generalization in future works.