November 14th, 2012 by Jeremy Sherwood, Cloud Strategist
“It’s about the Service, dummy.” I have to keep reminding myself. I have 30 years of operations experience. We manage technology. We practice our arcane craft by watching the gauges and warning lights that tell us when the technology is running “hot.” Except that this is all wrong. It’s not about the technology used to deliver IT Services…it’s about the Service itself. We need to change our way of thinking about IT from the equipment and communications lines and protocols and all the nasty bits to the value being delivered to the business. If we can do this we become a valued partner to the business.
Service Operations needs to know that a Service being provided to their customers is delivering the promised Warranty (as defined by ITIL: a promise or guaranty that a Product or Service will meet its agreed requirements). But often we’re looking at this all upside-down and watching the technology, not the Service. The Service Desk needs to know the status of Services because they impact the business. Service Operations needs to be aware of Services that are performing poorly or are in danger of performing poorly. Poorly-performing Services impact the business in very clear and measurable ways. Awareness of Services in trouble enables Service Operations to focus monitoring and/or restoration efforts on the most business-impacting issues first, prioritizing technological break-fix tasks based on their relationship to Services.
The challenge is that every service is different. True Service Management will collect specific key performance indicators that tell how well the Service is running and indicate when it is at risk. Availability is tied to Service Level Agreements and becomes a threshold on how well the Service is running. Service Health is a measure of how well the service is performing. Service Risk is how likely it is to stop performing at acceptable levels. Service Availability is a threshold on Service Health: when service performance is so poor that the service becomes unacceptable it is considered unavailable. So, for example, if Service Operations knows that the client RPC latency for Microsoft Exchange should not exceed 25ms and it’s currently at 72ms it is a relative certainty that users are impacted, service is degraded, and if it worsens, will be “down” or unavailable. Service Health is poor, though the Service is Available, but Service Risk is very high. Service Operations should be focusing all of their attention on preventing the outage that is imminent.
We need to focus our measure of IT Services around what they deliver to the user. Technology-centric approaches to service management, measuring memory and CPU and ping response, do not cut it these days. Monitoring Service Health, Availability, and Risk enables Service Operations to prioritize work around the most impactful Service problems and ensure business continuity. ScienceLogic built this into our IT Service Management – flexible and extensible Service Management for modern IT Service architectures.