Semantics of Caching with SPOCA: A Stateless, Proportional, Optimally-Consistent Addressing Algorithm

This paper describes the essential parts of Yahoo’s video delivery system. The Yahoo! Video Platform has a library of over 20 million video assets. From this library, end users make about 30,000,000 requests per day for over 800,000 unique videos, which creates a low ratio of total requests to unique requests. Also, because videos are large, a typical front-end server can hold only 500 unique videos in memory and 100,000 unique videos on disk. The low ratio of total/unique requests combined with the large size of video files make it difficult to achieve a high percentage of cache hits.

Their initial content serving platform had a relatively straightforward architecture that used a hardware load balancer issuing requests to a set of front end servers. The front end servers cache the actual video files which in turn are fetched from a storage farm.

The entire repository of videos can be classified into different content pools based on the type of content they serve. The three main pools include Download pool, Flash Media based pool and Windows Media pool. The overarching goal of the platform was to merge these pools and adapt content delivery to suit varying traffic profiles of different videos.

A simple approach to apportion video files to different servers is to maintain a catalog that contains the associations between a video object and front end servers. But this solution would be based on the assumption that the servers are always available. Also the cost of reindexing is quite high whenever a new server is added or an existing one becomes unavailable. Thus the alternative is to use a stateless addressing approach to locate content.

In this scheme a central Request Router is responsible for locating a file given the filename and a set of available servers. The mapping is determined on the fly by the request router.

Some of the important requirements for this content delivery system are –

1) Since the front ends servers are not all homogeneous in their capacity to serve traffic the distribution scheme must be sensitive to the configuration of a server while allocating a set of files to be served by it.
2) Also the number of files that have to be reassigned to other servers (when servers join/leave the pool) should be minimized so that cache disruption can be minimized to the extent possible.
3) Although content may be distributed proportionally among the servers, this does not mean the requests will be proportionally distributed. So there should be a means to detect very popular content/overloaded servers so that the load can be distributed away from these servers. In essence the distribution system in addition to distributing files proportionally also needs to balance the load on its components.

The Solution
The highlight of the solution is the stateless content addressing/routing algorithm based on which content is cached on the front end servers. The Stateless, Proportional, Optimally-Consistent Addressing Algorithm (SPOCA) takes in a content identifier and a list of available servers and deterministically returns a single front end server where the content is cached without having to remember anything it did before.

Each object is assigned a unique identifier of the form “Content-ID.domain”. The content id part is derived deterministically from the content itself such as the hash of the filename and the domain part represents the pool that this content belongs to. The typical URL for a content piece looks like “http://Content-ID.domain”.

When a request for a video content arrives it is first intercepted by a DNS server called Zebra which directs the request to a specific cluster that will serve the content. It determines the serving cluster based on geo-locality and popularity of the content.

The Request Router
The request router implementing the SPOCA algorithm is the one that determines which front end server will serve the video. This algorithm uses a hash function to map contents to a server. Contents are mapped to servers in a sparse hash space where space has been allocated to each server based on its capacity. Not all of the space is allocated to a server so if a key hashes to a space that has no server assigned then its hashed repeatedly until a space with a server is found.
The consistent hashing algorithm is implemented based on the standard linear congruential pseudo random number generator. The Content-ID for each file is used as the seed for the pseudo random number generator.

When a server becomes unavailable
When a server fails and becomes unavailable the section of the hash space that was assigned to it is withdrawn and made unavailable. A subsequent request that hashes to this space will find it unavailable and therefore rehash until it finds a new server. This server then serves the content by fetching it from the storage farm and caching it in its memory along the way. This design allows only those contents that have been assigned to the failed servers to be reassigned onto other live servers.

When a new server is added
New servers are mapped to the unassigned portion of the hash space. The newly added server over a period becomes the goto server for some content that already resides on another server. In due course the content that had been cached on some existing server is removed from the there due to a natural timeout.

The algorithm described so far minimizes the number of servers caching the same content so that the aggregate of servers can be utilized to cache more content objects. But as contents become more popular they will end up choking the same server. In order to avoid this, the requests for popular contents must be routed to more than one server by extending the routing algorithm to also perform some load balancing logic.

Load Balancing
The request router handles popular content based on the same approach it uses to handle unavailable servers. Once a popular piece of content has been served the hashed address of the server that served the request is cached for a small window of time a.k.a popularity window (150 seconds in this case). If another request for the same content piece arrives during this window then the routing starts by hashing the hashed address value that was stored from the previous request instead of the content identifier. This ensures that the request is routed to a different server each time a new request for the same content is made within a window. Once the time window is over the stored hash for each file is discarded without any regard to how recently it was derived.

Some interesting consequences of this design

a) The same content gets mapped to different servers within a popularity window.
b) For a given object the sequence of servers to which it is assigned remains the same across windows.
c) Although the solution introduces some state for load balancing still the state is soft. It never gets to a point where it has to store mappings for all the 20 million files in the repository or even for all the files that are served in a day. It only stores those files that have been viewed in the 150 sec window and can therefore be fully persisted in memory.
d) Its important to understand the trade off that comes with the configurability of the popularity time window. The effect of the window is akin to a situation where a set of servers are progressively becoming unavailable for a certain amount of time. So longer the window the more the number of unavailable servers hence more unbalanced the load. On the other hand if the window is too small then the same object is duplicated many times.

Previewing from http://www.usenix.org/event/atc11/tech/final_files/Chawla.pdf

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s