ITIL AVAILABILITY MANAGEMENT

hank

MARQUIS

by Hank Marquis
UPDATED FEB. 11, 2006: ADDED LINKS TO SOA, CFIA, FTA , AND CRAMM
UPDATED MAR. 8, 2006: ADDED LINK TO TOP

Availability Management is an ITIL process not many are familiar with, but with virtually no added costs, IT can realize improvements in availability in as little as a few weeks!

Here’s the secret -- to “stay up”, you have to know why you “went down”. Put another way, to improve availability, you have to measure, identify and address un-availability.

Following is a 6-step plan to examine infrastructure (products) and organization (people and process) for un-availability; identify the #1 issue; and develop a solution to increase availability:

Start with the “Incident Lifecycle”. Examine the time spent on Incident detection, diagnosis, repair, recovery and restoration. Document where un-availability comes from using metrics for Recoverability (Mean Time To Repair), Reliability (Mean Time Between Failures) and Serviceability (agreed uptime - downtime) for external providers. This is a baseline to document improvement.
Perform a Service Outage Analysis (SOA). Working with Problem Management and Customers, examine past outages and identify responsible IT assets (products, people or process). Create a Pareto chart; graph paper will do nicely. A majority of un-availability usually results from a minority of assets. [See ‘Service Outage Analysis in 7 Steps’ DITY Vol. 1 #7 for more on SOA]
Create Component Failure Impact Assessment (CFIA) tables. Driven by the SOA, CFIA shows the scope of impact and locates Single Points of Failure and other flaws. All it takes is a spreadsheet, or paper and pen. CFIA works for all IT assets – people and processes (organization) as well as products (infrastructure). [See ‘3 Steps to Success with CFIA’ DITY Vol. 1 #4 for more on CFIA]
Develop Fault Tree Analysis (FTA) diagrams. Both CFIA and FTA clarify potential flaws. FTA uses a logical model that shows how a failure can snowball into a major outage. Like CFIA, FTA may also be done with paper and pen. [See ‘Fault Tree Analysis Made Easy’ DITY Vol. 1 #5 for more on FTA]
Prioritize using CRAMM. CRAMM classifies the risk faced by an asset (threats) due to vulnerabilities in infrastructure and organization, identified in this case by CFIA and FTA. A paper or spreadsheet -based solution works just fine. [See ‘10 Steps to Do It Yourself CRAMM’ DITY Vol. 2 #8 for more on CRAMM]
Establish a Technical Observation Post (TOP). A TOP is a team of Subject Matter Experts, Customers and suppliers assembled to brainstorm on the issues identified by as responsible for un-availability and classified as most severe -- the #1 issue. The result is a Request for Change (RFC) to improve availability. [See ‘7 Steps to the TOP’ DITY Vol. 2 #10 for more on the TOP]

These six steps in the order presented deliver an understanding of un-availability in a matter of days -- at low to no additional cost. Benefits of the Changes proposed by the TOP should appear in “Incident Lifecycle” metrics within days or weeks.

Subscribe to our newsletter and get new skills delivered right to your Inbox, click here.

To browse back-issues of the DITY Newsletter, click here.