Incident and Problem Management
In my last post I presented a real-life example of incident management. I described an event that happened to me that involved an interruption to service; namely, my car stopped running. I also described the technique that I used to restore service. I ended that post with a question; if I owned that car for one year and the incident that I described happened every two weeks, how many incidents did I experience?
The answer is twenty-six.
Each time an interruption to service occurs, it is represented by an incident. If at sixteen I used some system to track incident tickets, then every two weeks when this happened I would have opened a new incident ticket.
As I described the situation in the last post, the first time the incident occurred I devised a temporary fix, or what ITIL calls a workaround. The intent of a workaround is to restore service. When a workaround is applied, service is restored, and if the customer is happy with the resolution, then typically an incident ticket is closed. In other words, the interruption to service is over.
In the case of my car, once I applied the workaround, service was restored, the customer was happy, and the incident was over.
Aspects of Problem Management
Thus far in this series of posts I said mostly nothing about the problem aspects of the scenario. ITIL defines a problem as the unknown, underlying cause of one or more incidents. Problem management is the process that seeks to prevent incidents from happening and minimize the impact of incidents that aren’t avoidable.
There are two flavors of problem management: proactive and reactive. Proactive problem management can be thought of as things that are done to prevent incidents from occurring. In other words, proactive problem management seeks to correct potential root causes before they result in incidents. On the other hand, reactive problem management seeks to correct the root causes of incidents that have already occurred.
In the scenario I described in the last post, I indicated that there were some proactive things that we did before I ever drove the car. My father and I replaced the brakes on the car and had a new set of tires installed. These were preventative measures and are an example of proactive problem management. Without taking these preventative measures, we would have accepted significant risk and likely that risk would have been realized. For example, without effective brakes, I might not have been able to stop the car. That would have resulted in an “incident”.
Also in the scenario I described aspects of reactive problem management. One of the activities of problem management is to identify workarounds. In the scenario I described I demonstrated how I produced the workaround of clearing the fuel filter. That workaround restored service, and it was the culmination of a series of troubleshooting activities that I described in last week’s post.
I said nothing in the scenario about correcting the problem. Problem management, in addition to identifying workarounds, also identifies root causes. Once these two things are understood and documented, a known error exists. Incident management uses known errors to speed incident restoration. In the case of my car story, the workaround was clearing the fuel filter. Over the one year that I owned the car, I became very good at applying that workaround.
Workarounds, however, do not correct the root cause. They are temporary fixes. Another aspect of problem management is to identify and, where it makes sense, correct root causes. In my situation I identified the root case as rust and dirt that over time had built up in the gas tank. The car had sat in a field for a while. As the car was driven, the sediment would work its way into the fuel line and ultimately be stopped from entering the engine by the fuel filter.
Changing the fuel filter would not correct the root cause. In fact, the fuel filter was working as intended. There are several options to correct the root cause, and I used to tell my father that the “correction” was to buy me a new car! Needless to say, that didn’t happen. Other potential corrections could have included replacing the fuel tank, draining and coating the fuel tank, or selling the car to someone else.
In my case, I continued to deal with the periodic outage for the one year that I owned the car, and then I sold the car (with full disclosure of course) to someone else. Effectively, and from my perspective, this corrected the root cause by removing it.
Problem management and incident management are tightly linked processes, however they are different and require different skillsets and a different focus. People sometimes have a tendency to confuse causes and effect. Problem and incident management processes exist because causes and effect are often treated differently, and there are many factors that help determine what is the appropriate treatment of a cause or an effect.
As I’ve mentioned in this series of posts, I lived with this problem for the one year that I owned that car. During that time I was continuing to apply a workaround in order to keep my car working. In the next installment of this series I will discuss aspects of workarounds and why it’s not always a great idea to continuously invoke a workaround.