HAIL – Only Aggressive Elephants are Fast Elephants

Typically we store data based on any one of the different physical layouts (such as row, column, vertical, PAX etc). And this choice determines its suitability for a certain kind of workload while making it less optimal for other kinds of workloads. Can we store data under different layouts at the same time? Especially within a HDFS environment where each block is replicated a few times. This is the big idea that HAIL (Hadoop Aggressive Indexing Library) pursues.

At a very high level it looks like to understand the working of HAIL we will have to look at the three distinct workflows the system is organized around namely –
1) The data/file upload pipeline
2) The indexing pipeline
3) The query pipeline

Every unit of information makes its journey through these three pipelines.

The Upload Pipeline

Here is where we begin. Typically one uploads the file to be analyzed into HDFS first and then executes a set of MR jobs on the Hadoop cluster. With HAIL you use a Hail client to upload the file. The client parses the contents of the file based on newlines and splits it up into blocks such that no row spans across blocks. Additionally the user can also specify a schema while uploading the file (much like in PIG). HAIL then converts all data blocks to a binary PAX representation. The good thing about this pipeline is that the whole transformation happens as the file is being written on HDFS. The blocks are not re-read from HDFS which will cause a lot of extra I/O. This significantly improves the write performance.

The client then contacts the Name node to get a list of data nodes. It then sends chunked PAX blocks to the first data node. When the data node receives the packet it immediately forwards the same packet to the next data node. It does this without flushing the contents of the packets and the checksum to the disk. This is the same with every data node that receives the packet. Contents are not flushed to the disk immediately.

On receiving a whole block worth of contents the each data node sorts the contents and creates indexes based on the specification of sort order and indexes by the user. Each data node sorts the data in a different order. All the index metadata, the sorted data and checksums etc form what is known as a HAIL block.

Within HAIL its vital that the MR jobs run such that they are able to leverage the indexing thats happened on the data nodes. So the tasks have to be scheduled on the data node that has the most suitable index. In order to enable this the sort/index metadata has to be stored at the name node level. An instance of HAILBlockReplicaInfo contains detailed information about the types of available indexes for a replica, i.e. indexing key, index type, size, start offsets etc.

The Indexing Pipeline

The basic purpose of the indices is to get to the relevant blocks by scanning the index first. After experimenting with a few different types of indexes they seem to have concluded on using a sparse clustered B+ tree based index. The column that needs to be indexed is first sorted in memory and then the index tree is written to the disk on to a single directory. Note that the index is not a multi-level index. The paper gives some back of the envelope calculations for this choice.

The Query Pipeline

Much of the MR job continues to be written just as before but with some interfaces changed. Firstly the InputFormat implementation used is HailInputFormat. The other nicety of this framework is the typical task of filtering the records which is carried out within the map function can be delegated to HAIL. You can annotate the map function with @HailQuery annotation where you may declaratively specify the projected attributes and the filtering condition.

@HailQuery(filter="@3 between(1999-01-01,2000-01-01)", projection={@1})
void map(Text key, HailRecord v) { ... }

The HailRecordReader which collaborates with HailInputFormat is the component that applies the predicate to filter out the qualifying records. Lastly the value passed to the map function is a HailRecord object.

Summing it up
HAIL tries to support per-replica indexes in an efficient way and without significant changes in the standard execution pipeline. It tries to achieve much of this by providing alternate implementations of the InputFormat and RecordReader interfaces along with a custom splitting policy. HAIL improves upload and query times without impacting the failover properties of Hadoop and minimal change to the map reduce programming interface.

Link to the original paper