Availability Management on a Budget

Which familiar ITIL process has virtually no added costs, yet can realize visible improvements in availability in as little as a few weeks?

You guessed it, the Service Design Phase’s Availability Management!

Here’s the secret -- to "stay up," you have to know why you "went down." Put another way, to improve availability, you have to measure, identify and address un-availability.

Following is a 6-step plan to examine infrastructure (products) and organization (people and process) for un-availability identify the prime issue(s), and develop a solution to increase availability:

  1. Start with the “Incident Lifecycle.” Examine the time spent on Incident detection, diagnosis, repair, recovery and restoration. Document where un-availability comes from using metrics for Recoverability (Mean Time To Repair), Reliability (Mean Time Between Failures) and Serviceability (agreed uptime minus downtime) for external providers. This is a baseline to document improvement.


  2. Perform a Service Outage Analysis (SOA). Working with Problem Management and Customers, examine past outages and identify responsible IT assets (products, people or process). Create a Pareto chart; graph paper will do nicely. A majority of un-availability usually results from a minority of assets. See Service Outage Analysis in 7 Steps for more on SOA.


  3. Create Component Failure Impact Assessment (CFIA) tables. Driven by the SOA, CFIA shows the scope of impact and locates Single Points of Failure and other flaws. All it takes is a spreadsheet, or paper and pen. CFIA works for all IT assets – people and processes (organization) as well as products (infrastructure).See 3 Steps to Success with CFIA DITY for more on CFIA.


  4. Develop Fault Tree Analysis (FTA) diagrams. Both CFIA and FTA clarify potential flaws. FTA uses a logical model that shows how a failure can snowball into a major outage. Like CFIA, FTA may also be done with paper and pen. See Fault Tree Analysis Made Easy DITY for more on FTA.


  5. Prioritize using CRAMM. CRAMM classifies the risk faced by an asset (threats) due to vulnerabilities in infrastructure and organization, identified in this case by CFIA and FTA. A paper or spreadsheet-based solution works just fine. See 10 Steps to Do It Yourself CRAMM DITY for more on CRAMM.


  6. Establish a Technical Observation Post (TOP). A TOP is a team of Subject Matter Experts, Customers and suppliers assembled to brainstorm on the issues identified as responsible for un-availability and classified as most severe – the #1 issue. The result is a Request for Change (RFC) to improve availability. See 7 Steps to the TOP DITY for more on the TOP.


SUMMARY

These six steps in the order presented deliver an understanding of un-availability in a matter of days – at low to no additional cost. Benefits of the Changes proposed by the TOP step could begin to appear in your “Incident Lifecycle” metrics within a few weeks – or even days.

Related programs

Related articles