The open-source software Memcached is used to reduce the read load of many of the world’s most popular dynamic websites, including YouTube, Reddit, Twitter, Tumblr, Wikipedia, and Facebook. With nearly two billion globally distributed users communicating in real-time, Facebook’s Memcache scaling is a particularly interesting case. Rajesh Nishtala, a member of the team that developed their Memcache infrastructure, has explained their process in a video which is worth consideration by anyone looking scale an enterprise website with a heavy read load.
In addition to a wide user base, read-heavy workload, real-time communication, and global distribution, Facebook’s infrastructure requirements include supporting access and updates to viral videos and other popular content, ongoing aggregation of newsfeed content from multiple sources, and processing millions of requests per second. In addition, their design must be able to adapt to a product that is constantly evolving and adding new features.
Prior to their use of Memcache servers, Facebook deployed only a few databases and servers, sharding data over multiple databases based on user IDs. However, as their user base grew, high fanout and increasingly complex data fetching needed to render page requests required more read capacity, and Memcache was implemented to avoid latency issues and guard against database crashes.
In his video, Rajesh describes scaling in four steps: a single Memcache server, a single front-end cluster consisting of multiple Memcache servers housed with multiple webservers that read and write updates to databases, multiple front-end clusters, and multiple globally dispersed front-end clusters.
With the addition of Memcache servers, Facebook began to store their data in a look-aside cache. In this set-up, servers would first look for data in the Memcache and if they did not find it, the request would be returned to the server to fetch the information from the database. Afterward, the cache would be filled with the database information. When a new write updated information in the database, any corresponding information in the cache was deleted, with the web application specifying which keys to invalidate and refilling only on demand. Because deletes are idempotent, this method was preferred to updating the cache directly because any information lost or delayed in the system could be replayed.
One problem with look-aside caching was the issue of inconsistencies between the Memcache and the database, leading to stale sets. For example, if a user from Server A attempted to read information from a database, the database would then store that information in the cache. Another user could then update the information from Server B. After the information was updated in the cache, however, Server A would see that the information does not match its own and update the cache again with the stale set, which would remain in the cache until the next write. This problem was solved by attaching a 64-bit lease ID to each miss which was invalidated inside the server after a delete. If an update was attempted on an invalidated lease ID, the set was disallowed.
Content that was viral or highly accessed caused another problem in this set-up. With each update to this content, it was deleted from the cache, sending the many users attempting to read the content to flood the database. To avoid database crashes, another small extension was added to the leases so that the Memcache server could arbitrate reads to the database. The slightly stale content or short wait that resulted from arbitrations was highly preferable to the potential loss of access for many users’ information that would come from a crash.
To support a read-heavy workload, wide fanout, and handling failures, more Memcache servers were added, forming a single cluster with multiple servers and database storage. In this set-up, a consistent hash-on-key was used to distribute items across multiple Memcache servers. However, this led to an all-to-all communication problem as each web server received many small responses from the many Memcache servers, overwhelming the client-side servers and resulting in packet drops.
Limiting the number of outstanding requests helped to resolve this issue. A sliding window was employed that closed to force additional round trips during congested periods and opened for faster delivery once the number of read requests reduced.
As the user base grew, however, a new solution was needed to ensure this method did not slow performance. Multiple clusters were deployed to provide additional read capacity, with 100s of servers used to accommodate hundreds of millions of requests per second. While this type of horizontal scaling would have been impossible within a single cluster due to the limits of all-to-all communication, deployed additional clusters provided a safeguard against incast congestion.
The first problem to be solved as new clusters were added was how to maintain cache consistency. Facebook’s solution to this problem was a system called McSqueal. Within each storage cluster, requests to the MySQL database would be recorded in a commit log to be accessed by McSqueal. For each request, McSqueal would look at previous database transactions in the commit log, extract any items that needed to be invalidated, and broadcast the invalidations to all front-end clusters.
Although this resulted in a high number of packets, Facebook was able to minimize bandwidth and maximize packet density by routing their database instances to a layer of Memcache routers. The router would then send invalidation requests to various front-end clusters where they were broadcast to Memcache servers within the cluster. Through this method, they achieved a reduction of 18x from the original packet size.
Geographically Distributed Databases
The last step in scaling Memcache was to ensure data consistency as replica databases were deployed around the world to maintain access in the event of natural disasters, ensure high performance for users located far from the master database, and support up to a billion requests per second. The challenge with global distribution came in determining how to handle writes by users located far outside the master database. These writes would result in a long round-trip to the master database, but a much shorter, faster trip to delete the information from a nearby Memcache. Should another user located closeby attempt to read the information at this time, the empty Memcache would direct her to the closest replica database–which may not have had time to update the information from the previous write. This would cause the stale value to get stuck in the Memcache, leading to a permanent inconsistency.
To prevent this problem, Facebook implemented a system where remote markers to the Memcache anytime a database write is requested. After the marker is placed, the information is written to the master database. Only then is the information deleted from the Memcache and updated in the replica database. After the process is complete, the remote marker is deleted. Should a user attempt to read the information when the marker is set, she would be directed to the master database to receive the updated information.
Through problem-solving to ensure data consistency and performance while avoiding crashes, Facebook was able to scale Memcache to support its significant infrastructure requirements and design demands.