UPDATED FEB. 11, 2006: ADDED LINKS TO
SOA, CFIA, FTA , AND CRAMM
UPDATED MAR. 8, 2006: ADDED LINK TO TOP
Availability Management is an ITIL
process not many are familiar with, but with virtually no added costs, IT
can realize improvements in availability in as little as a few weeks!
Here’s the secret -- to “stay up”, you have to know why you “went down”.
Put another way, to improve availability, you have to measure, identify and
Following is a 6-step plan to examine
infrastructure (products) and organization (people and process) for
un-availability; identify the #1 issue; and develop a solution to increase
- Start with the “Incident Lifecycle”.
Examine the time spent on Incident detection, diagnosis, repair,
recovery and restoration. Document where un-availability comes from using
metrics for Recoverability (Mean Time To Repair), Reliability (Mean
Time Between Failures) and Serviceability (agreed uptime - downtime)
for external providers. This is a baseline to document improvement.
- Perform a Service Outage Analysis (SOA).
Working with Problem Management and Customers, examine past outages and
identify responsible IT assets (products, people or process). Create a
Pareto chart; graph paper will do nicely. A majority of un-availability
usually results from a minority of assets.
[See ‘Service Outage Analysis in 7 Steps’
DITY Vol. 1 #7 for more on SOA]
- Create Component Failure Impact
Assessment (CFIA) tables. Driven by the SOA, CFIA shows the
scope of impact and locates Single Points of Failure and other flaws. All
it takes is a spreadsheet, or paper and pen. CFIA works for all IT assets
– people and processes (organization) as well as products
(infrastructure). [See ‘3
Steps to Success with CFIA’ DITY Vol. 1 #4 for more on CFIA]
- Develop Fault Tree Analysis (FTA)
diagrams. Both CFIA and FTA clarify potential flaws. FTA uses a
logical model that shows how a failure can snowball into a major outage.
Like CFIA, FTA may also be done with paper and pen.
[See ‘Fault Tree Analysis Made Easy’
DITY Vol. 1 #5 for more on FTA]
- Prioritize using CRAMM. CRAMM
classifies the risk faced by an asset (threats) due to vulnerabilities in
infrastructure and organization, identified in this case by CFIA and FTA.
A paper or spreadsheet -based solution works just fine.
Steps to Do It Yourself CRAMM’ DITY Vol. 2 #8 for more on
- Establish a Technical Observation
Post (TOP). A TOP is a team of Subject Matter Experts, Customers and
suppliers assembled to brainstorm on the issues identified by as
responsible for un-availability and classified as most severe -- the #1
issue. The result is a Request for Change (RFC) to improve availability.
Steps to the TOP’ DITY Vol. 2 #10 for more on
These six steps in the order presented
deliver an understanding of un-availability in a matter of days -- at low to
no additional cost. Benefits of the Changes proposed by the TOP should
appear in “Incident Lifecycle” metrics within days or weeks.
- Subscribe to our newsletter and get
new skills delivered right to your Inbox,
- To browse back-issues of
the DITY Newsletter, click here.