Improving the performance of Hadoop Hive by sharing scan and computation tasks

Dokeroglu, Tansel; Ozal, Serkan; Bayir, Murat; Cinar, Muhammet; COŞAR, AHMET

doi:10.1186/s13677-014-0012-6

Improving the performance of Hadoop Hive by sharing scan and computation tasks

Atıf İçin Kopyala

Dokeroglu T., Ozal S., Bayir M. A., Cinar M. S., COŞAR A.

JOURNAL OF CLOUD COMPUTING-ADVANCES SYSTEMS AND APPLICATIONS, cilt.3, 2014 (ESCI)

Yayın Türü: Makale / Tam Makale
Cilt numarası: 3
Basım Tarihi: 2014
Doi Numarası: 10.1186/s13677-014-0012-6
Dergi Adı: JOURNAL OF CLOUD COMPUTING-ADVANCES SYSTEMS AND APPLICATIONS
Derginin Tarandığı İndeksler: Emerging Sources Citation Index (ESCI), Scopus
Hacettepe Üniversitesi Adresli: Evet

Özet

MapReduce is a popular programming model for executing time-consuming analytical queries as a batch of tasks on large scale data clusters. In environments where multiple queries with similar selection predicates, common tables, and join tasks arrive simultaneously, many opportunities can arise for sharing scan and/or join computation tasks. Executing common tasks only once can remarkably reduce the total execution time of a batch of queries. In this study, we propose a Multiple Query Optimization framework, SharedHive, to improve the overall performance of Hadoop Hive, an open source SQL-based data warehouse using MapReduce. SharedHive transforms a set of correlated HiveQL queries into a new set of insert queries that will produce all of the required outputs within a shorter execution time. It is experimentally shown that SharedHive achieves significant reductions in total execution times of TPC-H queries.