MinuteSort with Flat Datacenter Storage

Its been a couple of days since Microsoft has been in the news as the one to beat the previous data sorting record held by Hadoop by sorting 1,401Gb of data in a minute. All news articles came along with a mention of two new terms namely, Flat Datacenter Storage and Full Bisection Bandwidth Networks. The enquiry to understand what these terms meant led me to this paper which makes an attempt at describing it. Flat Datacenter Storage (FDS) happens to be a high-performance distributed blob storage system. Just the kind of system I would love to cover for this blog!

Data is stored on dedicated storage nodes, called tractservers (comparable to HFDS name nodes). The tractserver is a network front-end to a single disk; machines with multiple disks have one tractserver running per disk.
User code does not run on tractservers; applications can only retrieve data from or write data to tractservers over the network. In FDS, there is no such thing as a local file.

Contrary to the idea of moving compute to the storage nodes this system works by always sending data over the network. It manages the cost of data transport by
a) Giving each storage node network bandwidth that matches its storage bandwidth
b) Interconnecting storage nodes and compute nodes using a full bisection bandwidth network

This combination produces an uncongested path from remote disks to CPUs, giving the system an aggregate I/O bandwidth essentially equivalent to a system such as MapReduce that uses local storage. FDS also supports data replication for failure recovery.

In FDS, data is logically stored in blobs. A blob is a byte sequence named with a 128-bit GUID. Blobs can be any length, limited in size only by the system’s storage capacity. Reads from and writes to a blob are done in units called tracts. Each tract within a blob is numbered sequentially starting from 0. Tracts in FDS are about 8MB. The FDS API defines simple CRUD operations to interact with a Blob. All calls in the API are non-blocking. Consequently the API also takes in callback function that is invoked after the operation completes.

By spreading a blob’s tracts over many tractservers and issuing many requests in parallel, many tractservers can begin reading data off disk and transferring it back to a processing node in parallel. Deep read-aheads enable a tract to be read off disk into the tractserver’s cache while the previous one is being transferred over the network.

Does it have a SPOF?

A single central metadata server that should be consulted to learn about where the data is placed is a common design pattern in distributed storage systems. Writers contact the metadata server to find out where to write a new block; the metadata server picks a data server, durably stores that decision and returns it to the writer. Readers contact the metadata server to find out which servers store the blocks to be read. This approach turns the metadata server into a SPOF as it is always on the critical path for all reads and writes.

This system too has a metadata server, except that its role during normal operations is simple and limited: collect a list of the system’s active tractservers and distribute information about them to clients. This list known as the tract locator table (TLT), is first retrieved from the metadata server when a client starts. The metadata server stores only metadata about the hardware configuration, not about files.

When it wants to read or write a tract, it first computes a tract locator. The simplest tract locator is the sum of the 128-bit blob GUID to be read and the 64-bit tract number to be read, modulo the number of entries in the TLT. Indexing the tract locator into the TLT yields the tractserver to which that tract read or write should be issued. The TLT changes only in response to cluster reconfiguration and not individual CRUD operations. It can thus be cached by clients for a long time.

Since the tractservers remember their position in the table, the metadata server stores no durable state; in case of a metadata server failure, the TLT is reconstructed by contacting each tractserver. The TLT is never modified due to reads and writes.
Also the TLT contains random permutations of the list of tractservers. This increases the chances of sequential reads and writes by independent clients utilizing all tractservers uniformly. The TLTs independent permutations prevent clients from organizing into synchronized convoys.

Section 2.2.1 very splendidly describes the optimizations and trade-offs in the design of the metadata server. A must read!

Per-blob metadata, such as blob length and permissions, are stored in a special tract (“tract -1”) of each blob. Clients find a blob’s metadata using the same method for finding data, using the TLT. Thus per blob metadata management is as distributed as blob
data storage.

The rest of the paper describes the execution of the sort algorithm in great detail.