Topic: Does Every Incident Equal a Problem?

Question:

What is a good rule of thumb as to establishing the criteria to differentiate Incident from problem Mgt?

Meaning that, should every incident having business impact (from a 15 min outage to an inconvenience of intermittent errors) be given the due diligence of the root cause analysis?

When does an Incident not warrant this time and investigation?

Is there a resource or white paper to reference to understand and help formulate some guidelines for my organization?

Thank you.

Regards,
Madeline

Response:

Hi Madeline,

Yes, you are right; it would be quite expensive and time-consuming to initiate a root cause analysis for every Incident. In fact, with a single minor Incident, there may not even be enough information for Problem Management to draw upon!

A previous ‘Ask the Expert’ response, Incident and Problem Management, addressed the necessity of distinguishing between Incidents and Problems. It also attempted to dispel the concept that an Incident ‘becomes’ a Problem by noting that Problems cause Incidents, but do not result from Incidents.

That said, I would like to direct you to the Problem Management section of the ITIL V3 Service Operation publication, which contains guidance related to the questions you ask.

First is the concept of Prioritization. Incident Management prioritizes Incidents in terms of their agreed Impact and Urgency to the business customer. Problem Management takes Prioritization one step farther. Incident Management deals with a single Incident and its agreed priorities. Problem Management deals with the competing priorities of all problems that may be in the queue at the same time. Therefore, Problem Management adds Frequency and Severity to the priority mix. Severity considers recovery vs. replacement, required skills, cost, length of time to fix, and scope of the Problem.

For example, a 15-minute outage in the financial systems of a publically traded company during its month-end close could be highly critical, whereas a 15-minute outage in email services during mid-month might be merely inconvenient. Likewise, the combined cost of ‘intermittent’ errors that impact field locations of a highly distributed organization might, upon accumulation over time, prove to be quite costly and establish the need for remediation.

I also suggest that you take a look at the Event Management process in the Service Operation document. While Incident Management deals with Incidents that impact agreed service levels with the business customers and users, Event Management addresses the low-level granular ‘events’ that occur in IT services and Configuration Items. For example, a single disk failure in a disk array would be an Event that requires remediation. Because it occurred in a properly configured array, it would not result in a service interruption to the user. Hence, there was an Event, but no Incident. If there have been repeated disk failures, however, you may decide to classify it as a Problem that needs resolution.

An organization we are familiar with treated the initial occurrence of a failure as a problem. Within this context it developed the remediation (work around and break/fix) and created a “known error” so that the next occurrence could be linked to a known error record with its work around and remediation procedure. That was an effective approach for the nature of the service it provided. If you think about it, a similar approach is doable in a “regular IT shop” … which should result in an impact to Mean Time to Recover (MTTR) and Mean Time to Restore Service (MTRS), as well as optimizing Incident staff resources.

I hope this answers your question.