ITIL Service Outage Analysis (SOA) in 7 Steps
The IT Infrastructure Library (ITIL) refers to Service or Systems Outage Analysis (SOA) as a method to improve availability. Unfortunately, the ITIL does not indicate how one actually performs SOA! This article explains the benefits of SOA, and gives you a 7-step guide to performing SOA.
The objective of SOA is to reduce the frequency and duration of outages while improving Mean Time To Repair (MTTR). The result of SOA is clear exposure of the risk of future outages, as well as recommendations for improvement.
SOA is a powerful technique that requires no major investment in tools or training. The process is straightforward. Working with Problem Management and Customers, you examine past outages and identify related Configuration Items (products, people or process). Then you review the impact of the organization and infrastructure on availability.
To get started, collect outage data, and assemble a team of people familiar with the outages. Then, guide them through the 7 following steps.
- Group related outages together; create groupings by vendor, product, family, application, customer, etc. Categorize each outage as "significant" or "less significant." Focus only on those labeled "significant" ; monitor the "less significant" for future outages.
For each "significant" outage, review the root cause of the unavailability. For example, faulty hardware or software. This is probably already known since the outage is resolved.
- Using a Pareto analysis (80/20 rule), rank the related outages and their causes. You will see that the majority of the outages result from a select few causes. Focus on the "80%" of the outages caused by the "20%" of the causes.
- For each grouping of similar outages, examine the reasons for the duration of the unavailability. For example, the outage may have occurred because of faulty hardware or software; but the duration of the unavailability might have been extended by lack of tools, training, spares, etc.
- Remember to consider the three "P's" - People, Product and Process, and review:
- Existing procedures and support policies that were invoked or used during this outage.
- The actions (or inactions) of staff members, customers and anyone else involved in the outage or restoration.
- Try to determine if anything might have lessened the duration of the outage, or avoided it altogether. The examination should locate a trend, or at least something in common with similar outages. This is what you are looking for - the "smoking gun." An example might be the lack of a tool, process or similarly related item.
- Quantify the avoidable outage time. That is, if one hour of downtime resulted from trying to locate the proper tool, then the avoidable outage time is 1 hour x the number of outages so affected. Identifying the most preventable downtime is your goal.
- Prepare a Request for Change (RFC) to address the most significant generator of preventable downtime!
The end of the SOA is the creation of a report summarizing the number of outages analyzed and the report timeframe; listing of the avoidable outage time; and suggestions for improving or avoiding the outage.
When you are done, you will have a documented business case justifying a Change! Most importantly, SOA provides you a clear roadmap that shows exactly how to remove a significant source of downtime from your infrastructure.