Tuesday, September 16, 2008

Consistent Hash based Distributed Data Storage


In an ultra large enterprise application, such as an online e-commerce or online gaming site, the site is dealing with millions of users and thousands of transactions every second. To handle this kind of traffic the number of servers, routers, databases, and storage hardware makes hardware or network failure a norm instead of an exception. Despite of the constant hardware failure in your system, your customer will not tolerate the slightest down time; the more successful your system is the more important it becomes to your client, and less happy they are when they experience an outage.

Solution -

To solve this challenge we need a highly available, decentralized, and high performance data storage service shielding the application from the harsh reality and complexity. No exception will be thrown when hardware or network failure occurs and the application code can always safely assume that the data storage service is available to write and read at any given point of time. This data storage service also needs to be evolutionarily scalable since down time is not acceptable, thus adding new node and storage capacity should not require shutting down the entire system, and it should only have limited impact on the service and it's client. A bonus side effect of this solution is that the distributed data storage can also act as a distributed cache system to reduce the hit to the persistent data storage such as a relational database.

Design -

- Consistent Hash

In a large online application, the type of data that require this kind of high availability are usually data that can be identified by a primary key and stored as binary content, for example user session(session id), shopping cart(cart id), preferences(user id), and etc,. Due to this nature, a Consistent Hash based distributed storage solution was proposed. Consistent Hash algorithm was initially introduced in Consistent Hashing and Random Trees: Distributed Caching Protocols for Relieving Hot Spots on the World Wide Web by David Karger in 1997. The key benefit of Consistent Hashing is that hashing any given key will always return the same value even when new slots are added or removed. The principle of this design can be illustrated using the following diagram. Imagine a virtual clock that represents an integer value from -231 to 231-1, and each server (A, B and C) in the storage cluster has a hash value assigned, hence for each given key (K1 and K2) can only land somewhere between these server nodes on the clock. The algorithm will search the clock clock-wise and pick the first server it encounters as the storage server for the given key, and because the hashing algorithm is consistent therefore any future put or get operation is guaranteed to be performed on the same node. Moreover the consistent hash algorithm also minimiz the impact for adding and removing node to its neighboring nodes instead of the entire cluster.

Challenge 1: Non-Uniformed Distribution

The first problem we need to solve is that the server hash value are most likely not uniformly distributed on the clock, as the result the server utilization will be skewed which is hardly an ideal situation. To solve this problem we are planning to borrow the idea discussed in Werner's paper on Amazon Dynamo by creating virtual nodes for the servers, and when you have enough virtual nodes created on the clock a close to uniformed distribution can be achieved.

Challenge 2: Availability and Replication

To provide high availability, the stored data need to be replicated to multiple servers. Based on the algorithm we are employing, in the case of a server failure any data stored on this specific server will automatically become the subjacent server's responsibility as shown in the following diagram.

Therefore our replication strategy is quite simple that every node will replicate the data it stores to it's immediate subjacent neighboring server. You can also include more than two servers in the replication group for even higher availability, although in our project I believe paired availability server group will provide desired availability without introducing too much complexity.
Open Source Alternative -

Open source Consistent Hash based distributed cache Memcached is built based on the similar design but without the high availability replication capability. It expects the application code being able to recover and restore the cache when a node becomes unavailable, which is usually an acceptable alternative for replication based availability at the cost of performance penalty during outage and increased code complexity. I usually recommend Memcached over custom in-house distributed cache system, since its proven, free, and a lot less work; what more can you ask :-)

Further Improvement -

Although its currently not in the plan but down the road there might be a need to implement Vector Clock based object versioning and merging capability to support simultaneous writes on multiple nodes, which is crucial for maintaining data consistency during partial system failure. Currently we are simply planning to employ "last-write-win" strategy to resolve the conflict.

Related Readings -

Consistent Hashing and Random Trees: Distributed Caching Protocols for Relieving Hot Spots on the World Wide Web

Amazon Dynamo

Tim's Blog on Consistent Hash


Talip Ozturk said...

Good stuff Nick,
Hazelcast (http://www.hazelcast.com) is built with similar concepts. Designing and coding with failure in mind and being ready for failure makes world look quite different.


Nick Zhu said...

Hi Talip,

Thanks for the link, very interesting project. I have added to my bookmarks, will definitely keep an eye on it.