31st IEEE Conference on Signal Processing and Communications Applications, SIU 2023, İstanbul, Turkey, 5 - 08 July 2023
In this study, the alignment of video-text and imagetext datasets is studied. Firstly, similarities are calculated over the texts in the two data sets. A retrieval setup with visual similarities is then applied to the subset which is created via calculated text similarities. A BERT-based embedding vector method is applied to the raw and pure texts. As a visual feature, object-based and CLIP-based methods are used to define video frames. According to the results, alignment with CLIP features achieves the best results in the subset created by filtering using raw text.