An external plagiarism detection system based on part-of-speech (POS) tag n-grams and word embedding


Yalcin K., ÇİÇEKLİ İ., ERCAN G.

EXPERT SYSTEMS WITH APPLICATIONS, cilt.197, 2022 (SCI-Expanded) identifier identifier

  • Yayın Türü: Makale / Tam Makale
  • Cilt numarası: 197
  • Basım Tarihi: 2022
  • Doi Numarası: 10.1016/j.eswa.2022.116677
  • Dergi Adı: EXPERT SYSTEMS WITH APPLICATIONS
  • Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED), Scopus, Academic Search Premier, PASCAL, Aerospace Database, Applied Science & Technology Source, Communication Abstracts, Compendex, Computer & Applied Sciences, INSPEC, Metadex, Public Affairs Index, Civil Engineering Abstracts
  • Anahtar Kelimeler: Plagiarism detection, part-of-speech (POS) tagging, N-grams, Semantic similarity, Word embedding, TEXT
  • Hacettepe Üniversitesi Adresli: Evet

Özet

The aim of this paper is to present an automatic plagiarism detection system to identify plagiarized passages of documents. Our plagiarism detection system uses both syntactic and semantic similarities to identify plagiarized passages. Our proposed method is a novel contribution because of its usage of part-of-speech tag n-grams (POSNG) which are able to show syntactic similarities between source and suspicious sentences. Each source document is indexed according to part-of-speech (POS) tag n-grams by a search engine in order to access rapidly to sentences that are possible plagiarism candidates. Even though our plagiarism detection system obtains very good results just using POS tag n-grams, its performance is further improved with the usage of semantic similarities. The semantic relatedness between words is measured with the word embedding technique called Word2Vec and the longest common subsequence approach is used to measure the semantic similarity between source and suspicious sentences. There are several types of plagiarism such as verbatim, paraphrasing, sourcecode, and cross-lingual. The high obfuscation paraphrasing is a type of plagiarism and its detection is one of the most difficult plagiarism detection tasks. Our proposed method, which is based on POS tag n-grams, improves the detection performance of the high obfuscation paraphrasing type and is the main contribution of this paper. For this study, we use the large dataset called PAN-PC-11 which is created for the evaluation of automatic plagiarism detection algorithms. Our experiments are conducted with the four types of paraphrasing in PAN-PC11 which are none, low, high and simulated obfuscation paraphrasing types. We defined various threshold and parameter settings in order to assess the diversity of our results. We compared the performance of our method with the plagiarism detectors in the 3rd International Competition on Plagiarism Detection (PAN11). According to the experimental results, the proposed method achieved the best performance in terms of plagdet measure in the types of high and low obfuscation paraphrasing and produced competitive results in the other paraphrasing types.