Layout-based computation of web page similarity ranks

BOZKIR A. S., Sezer E. A.

INTERNATIONAL JOURNAL OF HUMAN-COMPUTER STUDIES, vol.110, pp.95-114, 2018 (SCI-Expanded) identifier identifier


In this paper, we propose a ranking approach which considers visual similarities among web pages by using structure and vision-based features. Throughout the study, we aim to understand and represent the web page visual structure as in the way people do by focusing on the layout similarity through the wireframe design. The conducted study is composed of two parts. In the first part, structural similarities are analyzed with the proposed concept of "layout components" along with visual inspection of DOM trees. In this way, five types of structural layout components are proposed and revealed. Moreover, whitespaces are also utilized since they are important visual cues in the visual perception of web pages. In the second part, a computer-vision based method named histogram of oriented gradients (HOG) is employed to reveal local visual cues in terms of edge orientations. Following the feature extraction phases, extracted feature histograms are mapped on spatial information preserving multilevel and multi-resolution bag of features representation method named spatial pyramid matching. In this way, three goals were achieved: (1) the visual layout of web pages were mapped and compared in a multi resolution schema; (2) the intermediate process of visual segmentation was removed; and (3) efficient and easily comparable web page layout signatures were generated. We also conducted a questionnaire study covering 312 subjects. This helped us to create a benchmark dataset involving similarity scores collected from individuals. So far, there exists no web page layout similarity ranking oriented corpus in the literature. Our suggested approach achieved a remarkable ranking performance at top-5 and top-10 retrieval results. According to the findings of the comparative study, our approach outperforms some structure and vision-based studies in the literature. With this achievement, web pages could be employed as a query item to find other, similar web pages by taking into consideration that they are web pages, instead of images or anything else. (C) 2017 Elsevier Ltd. All rights reserved.