HadoopDB: Efficient Processing of Data Warehousing Queries in a Split Execution Environment

The buzz about Hadapt and HadoopDB has been around for a while now as it is one of the first systems to combine ideas from two different approaches, namely parallel databases based on a shared-nothing architecture and map-reduce, to address the problem of large scale data storage and analysis.

This early paper that introduced HadooDB crisply summarizes some reasons why parallel database solutions haven’t scaled to hundreds machines. The reasons include –
1. As the number of nodes in a system increases failures become more common.
2. Parallel databases usually assume a homogeneous array of machines which becomes impractical as the number of machines rise.
3. They have not been tested at larger scales as applications haven’t demanded more than 10’s of nodes for performance until recently.

Based on this reasoning the paper outlines some key characteristics that systems meant to handle analytical workloads have to demonstrate. These are –
1. Performance as in the number of compute units required to perform an analysis task should be low.
2. Fault tolerance in the context of analytical workloads means that the query when distributed for processing should not be restarted even when one of the nodes involved in processing the query fails.
3. A system designed to run in an heterogeneous environment translates to the ability of the system to be sensitive to the actual computational capabilities of the nodes when distributing a query for execution. In the absence of such an ability the time taken to execute a distributed query will be limited by the slowest node.
4. The query interface to such a system should allow working with commonly used tools found in the BI landscape such as tools for ad-hoc querying, creating dashboards along with extensibility through user defined functions etc.

HadoopDB Design & Architecture
The basic idea behind HadoopDB is to have a system that is composed of single-node databases with Hadoop coordinating query execution across multiple nodes. Hadoop itself acts as the coordinator of the query execution task by taking control of scheduling and job tracking. It pushes much of the query heavy lifting to the databases themselves in order to maximize the performance.

Building Blocks
The whole system is engineered as an extension of Hadoop by adding custom components to a) connect and interact with a single-node database b) read metadata about the different database nodes c) repartition and bulk load data d) plan, translate (into MR tasks) and execute query

The Database Connector – This component is implemented as a custom InputFormat implementation. An InputFormat encapsulates the mechanism of reading data from a data source and translating it into a key value format which will be palatable to the Mappers.
The connector extends the InputFormat interface. Each map reduce job supplies a query and other connection parameters to the connector. Using this the connector executes the query against a single node database instance and translates the results into a set of key value pairs.
The initial version of the connector only supported accessing a group of co-partitioned tables in a single database schema. Later the ability to access tables across multiple schema in a single Map job was added. Progressing along, the ability to consume inputs from both database tables and HDFS files was added too. All query execution beyond the Map phase is carried out within Hadoop.

The Catalog – This plays a role very similar to what the name node plays within HDFS. It contains information on the different single-node instances, data access statistics, partitioning properties etc. It is implemented as an XML file on HDFS and is accessed by the Job and Task trackers.

The Data Loader – It is responsible largely for bringing in data into the system, making sure that it distributes the data amongst the databases in the cluster based on the partitioning policy etc.

The Query Interface – It is the front end to the database that allows users to interact with the database using SQL. The primary responsibility of the system is to translate the query into a series of MR jobs, execute them and return the results. In the case of HadoopDB much of this layer is designed by adapting Hive’s query interface layer

The rest of this paper discusses the techniques that are used for the optimization and execution of warehouse queries split across Hadoop and the single node database instances.

Previewing from http://cs-www.cs.yale.edu