ITSM

Problem management ITSM -
how to eliminate recurring incidents

Incident vs Problem, 4-phase problem management process, RCA methods (5 Whys, fishbone diagram), trend analysis. How to roll it out in ManageEngine SDP and reduce recurring incidents.

← Back to Blog
ITSM
Jakub Roszkiewicz · May 2026 · 12 min read

A share of IT incidents are repeats of the same underlying problem. A technician hits the same database error, the same misconfigured router, the same gap in the onboarding procedure - sometimes several times a week. It is not the technician's fault: it is the absence of problem management. Problem management ITSM is the process that finds the root cause, eliminates it and reduces the risk of the incident recurring. In this article I break it all down: the difference between an incident and a problem, the 4-phase process, three proven RCA methods, how it looks in ManageEngine ServiceDesk Plus, and how to calculate ROI for your own company.

4 phases
of the problem management process per ITIL
3 methods
of RCA: 5 Whys, Fishbone, Timeline
root cause
goal: eliminate the cause, not just the symptom

Incident vs Problem - why they are not the same, and why everyone confuses them

Let us start with the definitions, because everything turns on this.

Incident (Incident Management in ITIL) is an unplanned interruption to an IT service or degradation of its quality. An incident is reactive: someone reports the network is down, system login is broken, the printer needs paper. The goal of incident management is to restore service to normal as quickly as possible, regardless of what caused it. MTTR (Mean Time To Recovery) for an incident is minutes, at most hours.

Problem (Problem Management in ITIL) is the root cause of one or more incidents. A problem is proactive: deep investigation: "Why does the network fail every 3 weeks?" or "Why does login break every time we roll out M365?" The goal of problem management is to eliminate the cause permanently, so the incident never recurs. A problem can be "in resolution" for months.

Analogy: Incident is putting out the fire, problem management is removing the cause of fires (faulty wiring). An ITSM technician does both - but too many only do the firefighting and then wonder why the fires keep returning.

Consequences of neglecting problem management:

The 4-phase ITIL problem management process - how it works in practice

ITIL v4 defines problem management as a process with four main phases. Here is what it looks like in a real scenario.

1. Detect & Analyze

You identify patterns of recurring incidents. Sources:

  • Tickets in the same category/symptom every week
  • Many incidents for the same IT component (router, server, application)
  • Trend: rising average MTTR for that category

Tool: Incident dashboard in SDP, filter by category, sort by frequency.

2. Root Cause Analysis (RCA)

Deep investigation - why this is happening. Use one of three methods:

  • 5 Whys (asking "why" five times)
  • Fishbone Diagram (Ishikawa cause diagram)
  • Timeline Analysis (chronology of events)

Output: an RCA document with an unambiguous root cause.

3. Fix & Verify

Plan the change (Change Management), implement the fix, test. The resolution should be:

  • Approved by the Change Advisory Board (CAB)
  • Tested in a test environment (NOT in production!)
  • Documented in the problem record

Output: the incident should be eliminated.

4. Monitor & Close

For 2-4 weeks monitor whether the problem returns:

  • Zero new incidents in this category?
  • The metric (for example, server response time ping) within normal range?
  • Users no longer hitting the error?

Output: close the problem record and store the knowledge in the Knowledge Base.

Three Root Cause Analysis methods - which one for you

RCA is the heart of problem management. Here are the three most popular methods, from simplest to most advanced.

Method 1: 5 Whys (asking "why" five times)

When: simple problems, 1-2 person teams. The server is down, no employee receives email.

How: Start from the symptom and ask "why?" five times:

  1. Why do employees not get email? -> Mail server does not respond.
  2. Why does the mail server not respond? -> The disk is 99% full.
  3. Why is the disk 99% full? -> Logs have not been rotated for 6 months.
  4. Why were logs not rotated? -> The logrotate script did not run.
  5. Why did it not run? -> No admin configured it - it was done "quickly" 6 months ago.

Root cause: missing logrotate configuration procedure and low priority for maintenance tasks.

Time needed: 30 minutes to 1 hour.

Method 2: Fishbone Diagram (Ishikawa)

When: more complex problems, multiple possible causes. Hardware + software + procedures interacting.

How: Draw a fishbone and categorize possible causes onto the "bones" - typically 5 categories: People, Process, Technology, Tools, Environment.

Example: Problem: networked printers shut off every hour.

Root cause (often turns out to be a combination): old driver + overheating + no monitoring + wifi interference.

Time needed: 1-2 hours, requires a team.

Method 3: Timeline Analysis

When: very complex problems, many systems involved, hard to separate cause from effect.

How: Build a precise timeline of events - every log entry, every alert, every configuration change.

Example: Problem: SQL server stopped responding at 2:35 AM.

Root cause: backup strategy scheduled during business peak hours (2:30 is right after the nightly ETL) - causes resource contention.

Time needed: 2-4 hours, requires access to logs, alerts, monitoring.

Trend analysis - how to find recurring problems in a sea of incidents

Problem management starts with the question: "Which incidents recur?" Here is how to find out.

Step 1: Define "recurrence" - the same symptom (for example, login error, connection timeout) in the same IT category (for example, Active Directory) at least 3 times in a month, with MTTR higher than average.

Step 2: Build a trend report - in ManageEngine SDP use Analytics or a manual report:

  1. Export the last 3 months of tickets
  2. Group by category / IT component
  3. Count volume per category, average MTTR
  4. Look for categories with high volume + high MTTR

Step 3: Prioritize the top 5 problems - those that will deliver the highest ROI if solved.

A typical output of such an analysis: