Availability Management on a Budget
Which familiar ITIL process has virtually no added costs, yet can realize visible improvements in availability in as little as a few weeks?
You guessed it, the Service Design Phase’s Availability Management!
Here’s the secret -- to "stay up," you have to know why you "went down." Put another way, to improve availability, you have to measure, identify and address un-availability.
Following is a 6-step plan to examine infrastructure (products) and organization (people and process) for un-availability identify the prime issue(s), and develop a solution to increase availability:
- Start with the “Incident Lifecycle.” Examine the time spent on Incident detection, diagnosis, repair, recovery and restoration. Document where un-availability comes from using metrics for Recoverability (Mean Time To Repair), Reliability (Mean Time Between Failures) and Serviceability (agreed uptime minus downtime) for external providers. This is a baseline to document improvement.
- Perform a Service Outage Analysis (SOA). Working with Problem Management and Customers, examine past outages and identify responsible IT assets (products, people or process). Create a Pareto chart; graph paper will do nicely. A majority of un-availability usually results from a minority of assets. See Service Outage Analysis in 7 Steps for more on SOA.
- Create Component Failure Impact Assessment (CFIA) tables. Driven by the SOA, CFIA shows the scope of impact and locates Single Points of Failure and other flaws. All it takes is a spreadsheet, or paper and pen. CFIA works for all IT assets – people and processes (organization) as well as products (infrastructure).See 3 Steps to Success with CFIA DITY for more on CFIA.
- Develop Fault Tree Analysis (FTA) diagrams. Both CFIA and FTA clarify potential flaws. FTA uses a logical model that shows how a failure can snowball into a major outage. Like CFIA, FTA may also be done with paper and pen. See Fault Tree Analysis Made Easy DITY for more on FTA.
- Prioritize using CRAMM. CRAMM classifies the risk faced by an asset (threats) due to vulnerabilities in infrastructure and organization, identified in this case by CFIA and FTA. A paper or spreadsheet-based solution works just fine. See 10 Steps to Do It Yourself CRAMM DITY for more on CRAMM.
- Establish a Technical Observation Post (TOP). A TOP is a team of Subject Matter Experts, Customers and suppliers assembled to brainstorm on the issues identified as responsible for un-availability and classified as most severe – the #1 issue. The result is a Request for Change (RFC) to improve availability. See 7 Steps to the TOP DITY for more on the TOP.
These six steps in the order presented deliver an understanding of un-availability in a matter of days – at low to no additional cost. Benefits of the Changes proposed by the TOP step could begin to appear in your “Incident Lifecycle” metrics within a few weeks – or even days.