March 25th, 2011 by David Link, Co-Founder and CEO
Recently I attended Cloud Connect and participated in a panel discussion on Performance Management in the Cloud. During the conference I had the pleasure of hearing Adrian Cockcroft (Director Cloud Architecture at Netflix) provide a comprehensive look at what it took for Netflix to build, test, monitor, and scale massive on-demand applications.
Adrian’s deep experience and astounding success in moving 100% of Neflix’s vast compute needs to the cloud was not only fascinating; but his transparent, common sense approach, while providing a disarming view into his perspectives on Devops and the future of applications built for the cloud led to an interesting nugget for those of us who are interested in performance management in the cloud.
Adrian’s experience has led him to think about performance management in a different light. He resoundingly argued that most of the tools in the marketplace today just don’t capture the right information and are miserable at scale. For instance Adrian said that Utilization is virtually a useless performance statistic. He articulated that a different set of performance metrics are now relevant (beyond the usual suspects of CPU, Memory, I/O).
One performance statistic that is very valuable to Netflix is stolen time. Essentially, the steal time cycle counts the amount of time that your VM is ready to run but could not run due to other VM’s competing for the CPU. This is a particularly relevant statistic to think about when you are using a cloud service provider. With cloud service providers you don’t control the hypervisor (as an enterprise customer would in a traditional datacenter). As a result, you don’t know how much the hypervisor is oversubscribed nor do you have insight or control into the amount of resources that other VM’s are taking from your compute requirements.
One of the things you notice when you run up a relatively recent Linux distribution is that there is a new CPU %age shown when you run ‘top’. The mysterious ‘st’. So if your Amazon AMI doesn’t have this statistic in ‘top’, and isn’t reporting ‘Stolen CPU ticks’ in ‘vmstat -s’, then you need to upgrade your ‘procps’ tools to a later version.
So how does Netflix handle this problem when using Amazon’s Cloud? Adrian admits that they tracked this statistic so closely that when an instance crossed a stolen time threshold the standard operating procedure at Netflix was to kill the VM and start it up on a different hypervisor. What Netflix realized over time was that once a VM was performing poorly because another VM was crashing the party, usually due to a poorly written or compute intensive application hogging the machine, it never really got any better and their best learned approach was to get off that machine.
A clever approach, and a nice solution to gain insight and some control for cloud resources and performance measurements that are not exposed by the service providers.
As for ScienceLogic, we will be releasing our versions of Stolen Time on ZEN or equivalent hypervisor performance measurement powerpacks in the next few months for the major hypervisors.
As I was writing this post, Netflix experienced a major outage – which leads me to question what is the next performance, fault or configuration metric that Adrian will want to start watching for to prevent a future outage? They were building their applications to be fault resistant and leveraging cloud infrastructure to hopefully failure-proof their application layer. Kudos to Netflix for reaching out to their customers and providing a credit for the time lost – which one article quoted could be as much as $4M given their customer base (that is, if every customer cashed in on the outage). Is this outage bad news for the cloud computing industry? This situation makes you to wonder – was the downtime a Netflix problem? An Amazon problem? Did Netflix not have enough control or visibility into their cloud environment? It would certainly be interesting to learn more.