Improving the performance of Hadoop Hive by sharing scan and computation tasks


Dokeroglu T., Ozal S., Bayir M. A., Cinar M. S., COŞAR A.

JOURNAL OF CLOUD COMPUTING-ADVANCES SYSTEMS AND APPLICATIONS, vol.3, 2014 (ESCI) identifier identifier

Abstract

MapReduce is a popular programming model for executing time-consuming analytical queries as a batch of tasks on large scale data clusters. In environments where multiple queries with similar selection predicates, common tables, and join tasks arrive simultaneously, many opportunities can arise for sharing scan and/or join computation tasks. Executing common tasks only once can remarkably reduce the total execution time of a batch of queries. In this study, we propose a Multiple Query Optimization framework, SharedHive, to improve the overall performance of Hadoop Hive, an open source SQL-based data warehouse using MapReduce. SharedHive transforms a set of correlated HiveQL queries into a new set of insert queries that will produce all of the required outputs within a shorter execution time. It is experimentally shown that SharedHive achieves significant reductions in total execution times of TPC-H queries.