MSDN Blog Postings

via RSS Feed

Designed for failure

Posted by on May 3rd, 2008

Alex Mallet posted @ the Live Mesh blog about how the Live Mesh cloud services run, how we think about the services, deployment, monitoring, connectivity – check it out.

Our general philosophy when building our cloud services was to adhere to the tenets of Recovery-Oriented Computing (ROC): programs will crash, hardware will fail, and they will do so regularly, so your system should be prepared to deal with these failures.

One gem in this post was a linked document written by James Hamilton, “On Designing and Deploying Internet-Scale Services” which collected the best practices from the Windows Live/MSN organizations – it is definitely worth a read, the abstract is:

The system-to-administrator ratio is commonly used as a rough metric to understand administrative costs in high-scale services. With smaller, less automated services this ratio can be as low as 2:1, whereas on industry leading, highly automated services, we’ve seen ratios as high as 2,500:1. Within Microsoft services, Autopilot is often cited as the magic behind the success of the Windows Live Search team in achieving high system-to-administrator ratios. While auto-administration is important, the most important factor is actually the service itself. Is the service efficient to automate? Is it what we refer to more generally as operations-friendly? Services that are operations friendly require little human intervention, and both detect and recover from all but the most obscure failures without administrative intervention. This paper summarizes the best practices accumulated over many years in scaling some of the largest services at MSN and Windows Live.

 

 


This post originated from and is provided by the MSDN Blogs RSS feed. The original post of the article can be found here.