The Problem with Problems

Back Issues

Vol. 2.35

SEPTEMBER 6, 2006

"The Problem with Problems"

DITY Weekly Reader
The workable, practical guide to Do IT Yourself

One of the first choices to make when adopting best practices is to define what will be Problems. This simple sounding decision is much more complicated that it would seem, and is often the cause of confusion. It does not have to be.

hank

MARQUIS

Articles
E-mail
Bio

By Hank Marquis

Anyone who has taken IT Infrastructure Library® (ITIL®) Foundation certification is familiar with the terms Incident and Problem.

Still, many practitioners implementing ITIL get confused about the relationship between Incidents and Problems. One of the most common questions I encounter is “when does an Incident become a Problem?” When I say “never”, people get more confused.

An Incident is an “unplanned interruption or potential interruption to service.” A Problem is the "underlying root-cause of one or more Incidents or potential Incidents." Thus, Incidents never become Problems.

The two conditions are distinct, and represent separate situations and activities. Often the next question is “ok, what is a Problem and where do they come from?”

There are three broad answers to this question. First is to declare a Problem whenever an Incident has no workaround; the second is to declare a Problem if the Incident is a “Major Incident”; and the third is to declare a Problem after resolving one or more Incidents.

When a Problem comes as a result of Incidents, and not other causes, it is important to delineate where an Incident “ends” and a Problem “begins.”

Following I describe what a Problem is, how to determine if you have one, their relationship with other often related conditions, and finally when to close them.

Incidents, Problems, Known Errors, and Workarounds

Before I get going with Problems, we need to clear up some commonly confused terms. These terms define conditions along a life cycle – Incidents (and Major Incidents), Problems, and Known Errors. Understanding how these conditions occur and relate to each other helps make understanding and defining Problems much easier.

An Incident condition is “any event that is not part of the standard operation of a service and that causes, or may cause, an interruption to, or a reduction in, the quality of that service.” While an Incident is usually a disruption to an IT service, there are three broad categories of Incidents: faults, service requests, and application queries.
A Major Incident is any Incident condition with severe negative consequences. Major Incidents require a solution be found as soon as possible. As with many areas of the ITIL, you have to decide what a Major Incident is. For example, you might define a Major Incident as “any issue that causes a disruption of service to three or more internal or external customers.”
A Problem condition exists is when there is an “undiagnosed underlying root-cause of one or more Incidents or potential Incidents.”
A Known Error condition exists after discovering the root cause of a Problem. Best practice desires a Workaround as part of the Known Error; however, not all Known Errors have Workarounds. A Known Error condition persists until resolved by a change. Not all Known Errors get resolved, and if not cost effective, a Known Error can stay open indefinitely.
A Workaround is a way of preventing or resolving an Incident or Problem. Workarounds can be used to temporarily resolve an issue, or steer the user toward another solution.

Each of these conditions is distinct and these conditions are not always sequential. For example, you can have a Known Error without any associated Incidents or Problems -- Microsoft, Cisco, and other vendors send out Known Errors daily or weekly. You can have Incidents without Problems, and vice versa as well.

Where Problems Come From

It is common mistake to confuse Incidents and Problems. According to the ITIL, Problem identification is task of detecting a Problem. The ITIL offers several examples. You know you have a Problem when:

Matching an Incident against existing Problems and Known Errors is not successful
Study of Incidents shows many recurrent Incidents
Technical analysis of the infrastructure shows something that could lead to Incidents
A Major Incident occurs and requires an immediate solution

Problems come from the above activities. As always, ITIL is very flexible, and leaves many key decisions to the practitioner. A couple of examples help make their relationships more clear. I took these examples from real-world ITIL implementations, and they are by no means all inclusive or definitive, but rather serve to illustrate the underlying logic.

These examples show three scenarios:

Incidents remain open while the related Problem is also open
Incidents open and close while the related Problem is also open
Incidents close, and then the Problem opens

Example #1 -- Problem with Open Incident

Imagine that an IT service slows down, breaks, or otherwise does not perform as required. Specifically, a user cannot print a document she needs right away, so she calls the Service Desk. An Incident condition now exists.

The Service Desk tries to locate a workaround to restore her service, but there is no workaround and no mention of this issue in the knowledge base. Since there is no Workaround, past Problems, or Known Errors, a Problem condition now exists.

The agent raises a Problem record and Problem Management takes over. The Incident remains open since the user is not able to function, even in a degraded state (there is no workaround.)

Problem Management coordinates the activities of technical groups to identify the root-cause. In so doing, they discover a workaround and communicate it to the Service Desk as quickly as possible to get the user operational again, even if degraded. At this point, the Service Desk closes the Incident, since she (the user) can now continue to work.

The Problem remains open, and Problem Management continues on until they discover the root-cause. With a workaround and the root-cause established, a Known Error condition now exists. Problem Management verifies all the information in the diagnoses, and updates the knowledge base with the Known Error information.

After documenting the Known Error, the root-cause and actions required to resolve the Problem are known, the Problem can close. The Known Error remains open until resolving the root-cause.

Example #2 -- Problem From Major Incident

An application server has a severely fragmented hard drive, resulting in increasing response times. As the system slows, its customers and users begin to experience delays, and perhaps time outs.

Many customers and users are affected. They call the Service Desk, and agents locate and pass on workarounds, but more and more users call. Agents realize that more than enough customers have issues to declare a Major Incident. The agent raises a Problem record and Problem Management takes over.

Incidents continue to open and close as the Service Desk responds to the Major Incident condition. The Problem remains open until Problem Management determines the root-cause, and creates a Known Error. Since a workaround already exists, the Problem can now close.

The Known error remains open until a change occurs that resolves the underlying fault responsible for the Major Incident, Problem, and the Known Error. Should management decide that it is not cost effective to implement the change, then the Known error will remain open indefinitely.

Example #3 -- Problem After Closing Incident

A user cannot print a document she needs right away, so she calls the Service Desk. An Incident condition now exists. The Service Desk locates a workaround to restore her service, and closes the Incident.

Upon review of the Incident at or after closing, the Service Desk agent realizes that this Incident pertained to printing invoices, a Vital Business Function defined in the Service Level Agreement (SLA). Since at this company all Incidents affecting VBF requires analysis and reporting by Problem Management, the agent raises a Problem record and Problem Management investigates.

The Problem remains open until Problem Management determines the root-cause, creates a Known Error, and management evaluates the Known error and decides to implement the Change or not.

Summary

As you can see, there are a number of decisions you must make regarding problems. For example, what is a Major Incident? Which Incidents with workarounds also get Problems? Which do not? Which Incidents require analysis (e.g., Problem) after closure?

Hopefully it is also clear that Service Level Management, and business alignment through Vital Business Function identification also come into play.

Understanding what a Problem is, where they come from, and how they relate to Incidents, Major Incidents, Known Errors, and Workarounds goes a long way to helping you establish the criteria for identifying Problems.

Now you know why Incidents do not “become” Problems, and whenever someone asks you “when does an Incident become a Problem?” you can answer “never!”

Where to go from here:

Subscribe to our newsletter and get new skills delivered right to your Inbox, click here.
Download this article in PDF format for use at your own convenience, click here.
Browse back-issues of the DITY Newsletter, click here.

For more about Vital Business Functions please see:

Vital Business Function Truths

For more about Incident handling please see:

For more about resolving Problems please see: