One of the first choices to make when adopting best practices is to define what will be Problems. This simple sounding decision is much more complicated that it would seem, and is often the cause of confusion. It does not have to be.
Anyone who has taken IT Infrastructure Library® (ITIL®) Foundation certification is familiar with the terms Incident and Problem.
Still, many practitioners implementing ITIL get confused about the relationship between Incidents and Problems. One of the most common questions I encounter is “when does an Incident become a Problem?” When I say “never”, people get more confused.
An Incident is an “unplanned interruption or potential interruption to service.” A Problem is the "underlying root-cause of one or more Incidents or potential Incidents." Thus, Incidents never become Problems.
The two conditions are distinct, and represent separate situations and activities. Often the next question is “ok, what is a Problem and where do they come from?”
There are three broad answers to this question. First is to declare a Problem whenever an Incident has no workaround; the second is to declare a Problem if the Incident is a “Major Incident”; and the third is to declare a Problem after resolving one or more Incidents.
When a Problem comes as a result of Incidents, and not other causes, it is important to delineate where an Incident “ends” and a Problem “begins.”
Following I describe what a Problem is, how to determine if you have one, their relationship with other often related conditions, and finally when to close them.
Incidents, Problems, Known Errors, and Workarounds
Before I get going with Problems, we need to clear up some commonly confused terms. These terms define conditions along a life cycle – Incidents (and Major Incidents), Problems, and Known Errors. Understanding how these conditions occur and relate to each other helps make understanding and defining Problems much easier.
Each of these conditions is distinct and these conditions are not always sequential. For example, you can have a Known Error without any associated Incidents or Problems -- Microsoft, Cisco, and other vendors send out Known Errors daily or weekly. You can have Incidents without Problems, and vice versa as well.
Where Problems Come From
It is common mistake to confuse Incidents and Problems. According to the ITIL, Problem identification is task of detecting a Problem. The ITIL offers several examples. You know you have a Problem when:
Problems come from the above activities. As always, ITIL is very flexible, and leaves many key decisions to the practitioner. A couple of examples help make their relationships more clear. I took these examples from real-world ITIL implementations, and they are by no means all inclusive or definitive, but rather serve to illustrate the underlying logic.
These examples show three scenarios:
Example #1 -- Problem with Open Incident
Imagine that an IT service slows down, breaks, or otherwise does not perform as required. Specifically, a user cannot print a document she needs right away, so she calls the Service Desk. An Incident condition now exists.
The Service Desk tries to locate a workaround to restore her service, but there is no workaround and no mention of this issue in the knowledge base. Since there is no Workaround, past Problems, or Known Errors, a Problem condition now exists.
The agent raises a Problem record and Problem Management takes over. The Incident remains open since the user is not able to function, even in a degraded state (there is no workaround.)
Problem Management coordinates the activities of technical groups to identify the root-cause. In so doing, they discover a workaround and communicate it to the Service Desk as quickly as possible to get the user operational again, even if degraded. At this point, the Service Desk closes the Incident, since she (the user) can now continue to work.
The Problem remains open, and Problem Management continues on until they discover the root-cause. With a workaround and the root-cause established, a Known Error condition now exists. Problem Management verifies all the information in the diagnoses, and updates the knowledge base with the Known Error information.
After documenting the Known Error, the root-cause and actions required to resolve the Problem are known, the Problem can close. The Known Error remains open until resolving the root-cause.
Example #2 -- Problem From Major Incident
An application server has a severely fragmented hard drive, resulting in increasing response times. As the system slows, its customers and users begin to experience delays, and perhaps time outs.
Many customers and users are affected. They call the Service Desk, and agents locate and pass on workarounds, but more and more users call. Agents realize that more than enough customers have issues to declare a Major Incident. The agent raises a Problem record and Problem Management takes over.
Incidents continue to open and close as the Service Desk responds to the Major Incident condition. The Problem remains open until Problem Management determines the root-cause, and creates a Known Error. Since a workaround already exists, the Problem can now close.
The Known error remains open until a change occurs that resolves the underlying fault responsible for the Major Incident, Problem, and the Known Error. Should management decide that it is not cost effective to implement the change, then the Known error will remain open indefinitely.
Example #3 -- Problem After Closing Incident
A user cannot print a document she needs right away, so she calls the Service Desk. An Incident condition now exists. The Service Desk locates a workaround to restore her service, and closes the Incident.
Upon review of the Incident at or after closing, the Service Desk agent realizes that this Incident pertained to printing invoices, a Vital Business Function defined in the Service Level Agreement (SLA). Since at this company all Incidents affecting VBF requires analysis and reporting by Problem Management, the agent raises a Problem record and Problem Management investigates.
The Problem remains open until Problem Management determines the root-cause, creates a Known Error, and management evaluates the Known error and decides to implement the Change or not.
As you can see, there are a number of decisions you must make regarding problems. For example, what is a Major Incident? Which Incidents with workarounds also get Problems? Which do not? Which Incidents require analysis (e.g., Problem) after closure?
Hopefully it is also clear that Service Level Management, and business alignment through Vital Business Function identification also come into play.
Understanding what a Problem is, where they come from, and how they relate to Incidents, Major Incidents, Known Errors, and Workarounds goes a long way to helping you establish the criteria for identifying Problems.
Now you know why Incidents do not “become” Problems, and whenever someone asks you “when does an Incident become a Problem?” you can answer “never!”
Where to go from here:
For more about Vital Business Functions please see:
For more about Incident handling please see:
For more about resolving Problems please see:
Entire Contents © 2006 itSM Solutions LLC. All Rights Reserved.