Windows Azure Storage : A Highly Available Cloud Storage Service with Strong Consistency

Windows Azure Storage is a key component of the Windows Azure Cloud platform that offers an infinite disk in the cloud. It’s been in production since November 2008 and is used heavily within Microsoft in addition to being available as a public cloud service. It currently handles about 70 PB of raw storage in production and is set to add a few hundred more in the near future.

The Architecture
WAS is engineered as a service that runs atop another service called the Windows Azure Fabric Controller. The Fabric Controller is a lower level service that provides a lot of cross cutting functionality such as node management, network configuration, health monitoring, starting/stopping of service instances etc.
In some ways WAS takes on responsibilities very similar to what we see in a distributed file system’s name/metadata node such as data placement across disks, replication and load balancing. At a high level it is logically organized into two units called Storage Stamps and the Location Service. It supports three different storage abstractions namely Blobs, Tables and Queues.

Storage Stamps
Each Storage Stamp is a cluster of multiple racks. Typically there are about 10-20 racks with about 18 storage nodes per rack. Currently each storage stamp holds about 2PB of data but is soon expected to scale to holding up to 30 PB per stamp. Optimum utilization level of this infrastructure seems to be around the 70% limit. When a stamp reaches this limit the Location Service migrates contents to other stamps through replication.

A Partitioned Namespace for Objects
The objects in this storage system are all part of a single global namespace. This is achieved by using DNS and synthesizing the identification scheme from a customer account name, a partition name and an object name. All objects are accessible via an URI of the form http(s)://AccountName..core.windows.net/PartitionName/ObjectName
Account Name is the name chosen by the customer. The DNS entry corresponding to this is mapped to the primary storage cluster in the appropriate data center where this data is stored. Partition Name located the data within the cluster and the finally the Object Name identifies the actual stored object.

The Location Service
The Location Service is responsible for managing the storage stamps. Different accounts are mapped to different stamps by the LS. It also ensures that the data is replicated and load balanced.
WAS storage locations are spread across North America, Europe and Asia. Its built in such a way that each location has one data center and each data center holds multiple storage stamps. New regions or new locations to a region or new storage stamps to a location can be added anytime.
When an application requests creates a new account for storing data it is allowed to specify its location affinity. The LS allocates a storage stamp based on this preference and updates DNS to route traffic to that particular Storage Stamp’s Virtual IP.

Anatomy of Storage Stamp
Stream Layer : Its the lowermost layer of a stamp. It is the layer that is responsible for storing the data on the disk. It also ensures durability of the stored units by replicating them across storage nodes within the stamp. The unit of storage is known as a stream. This layer does not concern itself with the semantics of the objects that are being stored. Data that is stored here is accessible from the layer above it.
Partition Layer : It is the layer that understands semantic differences between the different storage abstractions, provides a namespace for them, transaction ordering and strong consistency for the objects and caching services.
Front End Layer : This layer is a stateless server that intercepts the request for any object and routes it to the appropriate partition server that can serve this request. Before doing this it authenticates and authorizes the request based on the account details. They also stream large objects directly from the stream layer and cache frequently requested data.

Replication
Contents in a stamp are replicated within the stamp’s storage nodes and across stamps. Intra stamp replication is synchronous and is performed by the stream layer.
Inter stamp is asynchronous and is performed by the partition layer. Inter-stamp replication is focused on replicating objects and the transactions applied to those objects, whereas intra-stamp replication is focused on replicating blocks of disk storage that are used to make up the objects. Intra stamp replication safeguards the data from being lost due to machine failure and inter stamp replication provides geo redundancy and enables disaster recovery.

The Stream Layer and its relationship to Tidy FS
It appears like this layer is essentially a distributed file system and that too one that is based on to another work from Microsoft Research. Reading about TidyFS at this point possibly will help understand the nomenclature and semantics of the elements of the Stream Layer. Stream is the unit of storage and only append operations are permitted on them. This is a feature that is now commonplace in all distributed file systems.

Some Interesting Lessons Learnt
Scaling computation separately from storage
What this means is the VMs that run the application are not the same machines that store the data owned by the application. This is done intentionally so that demands on compute and storage can be met by scaling independently. Instead of collocating compute and storage the system is built to allow computation to access storage via high bandwidth network. In order to ensure quick access they are also moving a different data center architecture that flattens the network topology.

Throttling/Isolation
Due to a large number of accounts it soon becomes difficult to track every account’s usage profile and throttle based on that profile information. So in order to determine if an account is well-behaving or not it uses a Sample Hold algorithm to track the top N busiest accounts and partitions. When a server gets overloaded this information is consulted to throttle the traffic.

Append Only System
It greatly simplifies the replication protocol and failure handling scenarios.In this model once committed data is never overwritten and hence consistency across replicas can be enforced on the basis of the commit lengths.

Multiple Data Abstractions from a Single Stack
This system supports three different data abstractions all based on the same underlying storage stack. All the abstractions use the same intra and inter stamp replication and load balancing mechanisms. Due to the fact that the performance of the three abstractions vary a lot this design allows them to run all of these different workloads on the same set of storage nodes and thus improve utilization.

CAP Theorem
Although CAP states that all of Consistency, Availability and Partition Tolerance cannot be provided within a distributed system WAS is system that provides high availability with strong consistency guarantees. And within a stamp all the three properties are realized. This is achieved through layering and designing around a specific fault model. Its quite interesting to learn how strong consistency is provided despite network partitions. In essence what this means is within a storage stamp the system is engineered to behave more like a monolithic system than a distributed one.

Previewing from http://sigops.org/sosp/sosp11/current/2011-Cascais/printable/11-calder.pdf