Expanding the Expanded Incident Lifecycle
A key to improving the quality of IT service begins with understanding and utilizing one of ITIL’s simplest concepts - the Expanded Incident Lifecycle.
If you have attended an ITIL Foundation course, you undoubtedly remember the slide depicting the Expanded Incident Lifecycle (Figure 1, below). That is the graphical timeline that starts with an Incident on the left, progresses through the various stages of diagnosis, repair, restoration and closure, and then continues to the next Incident.

The labels dispersed along the Incident timeline are not just handy monikers that the Service Desk uses to report the changing state of an Incident. They represent critical intersections of ITIL processes and activities and provide a roadmap to shorten the time to recover from an Incident and lengthen the time of error-free operation.

Mean Times to . . .

Before we start, let’s review a few key ITIL measurements, the “Mean Times to . . .”

MTTR (Mean Time to Repair) - This is the average elapsed time between detecting an Incident and repairing the failed component; e.g., diagnosing and replacing a failed disk. Upon the completion of this activity, there is a functioning disk, but data has not been restored, and the users are still unable to access or use the service.

Essentially this measures the technical response to diagnose and repair the failed component. The shorter this time, the better because shortened times mean less downtime for the user.

MTRS (Mean Time to Restore Service) - This is the average elapsed time between detecting an Incident and fully restoring the service to the user; e.g., restoring data to the disk, recovering and restarting interfaces to other applications, informing the users that the service is available, and initiating user access (you may not want all of your users to log in simultaneously upon repair of the service!).

This is a measure of the quality of your operational processes, as well as system design to facilitate recovery after failure. Again, shortening these times should be your goal.

MTBF (Mean Time Between Failures) - This is the average elapsed time between restoration of service following an Incident and detection of the next Incident. In this case, a big number representing a long time between failures is good because it indicates a reliable service.

MTBSI (Mean Time Between System Incidents) - This is the average elapsed time between Incidents, including downtime represented by the MTTR and MTRS measurements. By understanding the proportion of repair and restoration time versus failure-free time for a particular service, you can begin to prioritize service and system improvements. For example, you may decide to commit resources to improving a critical business service that experiences few, but lengthy, failures, and give a lower priority to repairing a less business-critical service that experiences frequent failures, but which require few resources and time to restore.

Figure 1

Expanding the Expanded Incident Lifecycle

Now that we’ve looked at what the Expanded Incident Lifecycle diagram tells us, let’s take a look at which ITIL processes support it, and how you can use it to pinpoint areas to automate or improve.

Occurrence – By definition, an Incident is an unplanned disruption to an agreed service. ITIL offers a number of proactive ways to protect a service:

Detection - Incident resolution starts when a user or an automated system detects an error with a Configuration Item. Detection generally occurs sometime after the occurrence of the event. The goal is to shorten the time between Occurrence and Detection as much as possible. This activity ties directly to:

Diagnosis – During this stage, staff members try to identify the characteristics of the Incident and match it to previous Incidents, Problems and Known Errors. If Incident Management cannot match the Incident, the Problem Management process should start.

Repair – Sometimes a repair might raise a Request for Change (RFC) to change one or more Configuration Items (CI). After the CI is repaired, it may still be unavailable to the user and require recovery.

Recovery – This is the process of restoring the failed CI to the last recoverable state. This includes any required testing, final adjustment, configuration, etc.

Recovery also has a proactive side, which results in designing services and systems that faster and easier to recover.

Restoration – Service restoration makes the recovered service available to the user, so that the user can resume work.

On the proactive side, restoration capabilities can be “designed into” the service:

Closure – Closure occurs some time after restoration. It should give the user ample time to “shake out” the repaired service to ensure that it is really working, but it should not be so far into the future that users and staff have difficulty reconstructing what the parameters of the actual failure were.

The Final Step – Closing the Loop

You do not see the “final” step in the Expanded Incident Lifecycle because it is not really a step, but an action “implied” by the four “Mean Time” measurements.

The last step that ties together the steps along the timeline is to use MTTR, MTRS, MTBF and MTBSI to measure and analyze the effectiveness and efficiency of all of the activities and processes that contribute to Incident restoration.

Were appropriate resources available to assist with the Incident resolution? Were appropriate interfaces in place so that resources could be applied in a timely manner?

And, finally, did you learn something that can help the process work better the next time – or prevent the Incident from occurring?

Related programs

Related articles