Problem management ITSM - how to eliminate recurring incidents

Q: How many recurring incidents does problem management resolve?

The share of incidents that are recurrences of the same problem varies by organization and needs to be measured in your own ticketing system, grouping incidents by category and component. Effective problem management and RCA cut the number of recurrences and shorten the handling time of known problems, but the scale of the effect depends on the starting point. The key thing is to compare the recurrence rate before and after implementation.

Q: What are the RCA (Root Cause Analysis) methods?

Main methods: 5 Whys (asking 'why' five times), Fishbone Diagram (Ishikawa diagram - causes categorized), Failure Mode and Effects Analysis (FMEA - scenario analysis), Timeline Analysis (chronology of events), and Trend Analysis (patterns in historical tickets). Choose the method based on problem complexity - simple problems: 5 Whys; complex systems: Fishbone + Timeline.

Q: How do you configure problem management in ManageEngine ServiceDesk Plus?

ManageEngine SDP includes a Problem Management module (available from the Professional edition). Configuration: 1) Define problem states (New -> Assigned -> RCA -> Resolved -> Closed), 2) Link incidents to problems, 3) Set RCA report templates, 4) Configure automatic notifications about problem recurrences based on category/component, 5) Analyze trends in the Problem Management Analytics dashboard.

← Back to Blog

ITSM

Jakub Roszkiewicz · May 2026 · 12 min read

A share of IT incidents are repeats of the same underlying problem. A technician hits the same database error, the same misconfigured router, the same gap in the onboarding procedure - sometimes several times a week. It is not the technician's fault: it is the absence of problem management. Problem management ITSM is the process that finds the root cause, eliminates it and reduces the risk of the incident recurring. In this article I break it all down: the difference between an incident and a problem, the 4-phase process, three proven RCA methods, how it looks in ManageEngine ServiceDesk Plus, and how to calculate ROI for your own company.

4 phases

of the problem management process per ITIL

3 methods

of RCA: 5 Whys, Fishbone, Timeline

root cause

goal: eliminate the cause, not just the symptom

Incident vs Problem - why they are not the same, and why everyone confuses them

Let us start with the definitions, because everything turns on this.

Incident (Incident Management in ITIL) is an unplanned interruption to an IT service or degradation of its quality. An incident is reactive: someone reports the network is down, system login is broken, the printer needs paper. The goal of incident management is to restore service to normal as quickly as possible, regardless of what caused it. MTTR (Mean Time To Recovery) for an incident is minutes, at most hours.

Problem (Problem Management in ITIL) is the root cause of one or more incidents. A problem is proactive: deep investigation: "Why does the network fail every 3 weeks?" or "Why does login break every time we roll out M365?" The goal of problem management is to eliminate the cause permanently, so the incident never recurs. A problem can be "in resolution" for months.

Analogy: Incident is putting out the fire, problem management is removing the cause of fires (faulty wiring). An ITSM technician does both - but too many only do the firefighting and then wonder why the fires keep returning.

Consequences of neglecting problem management:

Technicians waste time solving the same problem repeatedly
Recurring incidents take longer than they would if the cause were known and documented
CSAT drops - users see the same problem returning
Helpdesk burns disproportionate time on the same matters

The 4-phase ITIL problem management process - how it works in practice

ITIL v4 defines problem management as a process with four main phases. Here is what it looks like in a real scenario.

1. Detect & Analyze

You identify patterns of recurring incidents. Sources:

Tickets in the same category/symptom every week
Many incidents for the same IT component (router, server, application)
Trend: rising average MTTR for that category

Tool: Incident dashboard in SDP, filter by category, sort by frequency.

2. Root Cause Analysis (RCA)

Deep investigation - why this is happening. Use one of three methods:

5 Whys (asking "why" five times)
Fishbone Diagram (Ishikawa cause diagram)
Timeline Analysis (chronology of events)

Output: an RCA document with an unambiguous root cause.

3. Fix & Verify

Plan the change (Change Management), implement the fix, test. The resolution should be:

Approved by the Change Advisory Board (CAB)
Tested in a test environment (NOT in production!)
Documented in the problem record

Output: the incident should be eliminated.

4. Monitor & Close

For 2-4 weeks monitor whether the problem returns:

Zero new incidents in this category?
The metric (for example, server response time ping) within normal range?
Users no longer hitting the error?

Output: close the problem record and store the knowledge in the Knowledge Base.

Three Root Cause Analysis methods - which one for you

RCA is the heart of problem management. Here are the three most popular methods, from simplest to most advanced.

Method 1: 5 Whys (asking "why" five times)

When: simple problems, 1-2 person teams. The server is down, no employee receives email.

How: Start from the symptom and ask "why?" five times:

Why do employees not get email? -> Mail server does not respond.
Why does the mail server not respond? -> The disk is 99% full.
Why is the disk 99% full? -> Logs have not been rotated for 6 months.
Why were logs not rotated? -> The logrotate script did not run.
Why did it not run? -> No admin configured it - it was done "quickly" 6 months ago.

Root cause: missing logrotate configuration procedure and low priority for maintenance tasks.

Time needed: 30 minutes to 1 hour.

Method 2: Fishbone Diagram (Ishikawa)

When: more complex problems, multiple possible causes. Hardware + software + procedures interacting.

How: Draw a fishbone and categorize possible causes onto the "bones" - typically 5 categories: People, Process, Technology, Tools, Environment.

Example: Problem: networked printers shut off every hour.

People: Admin did not check logs; no training in printer handling.
Process: No printer restart procedure; printer monitoring missing.
Technology: Printer driver is ancient (from 2018); wifi is weak in the printer room.
Tools: No tool to monitor printer status on the network dashboard.
Environment: Printer room is hot (38 C), printer is overheating.

Root cause (often turns out to be a combination): old driver + overheating + no monitoring + wifi interference.

Time needed: 1-2 hours, requires a team.

Method 3: Timeline Analysis

When: very complex problems, many systems involved, hard to separate cause from effect.

How: Build a precise timeline of events - every log entry, every alert, every configuration change.

Example: Problem: SQL server stopped responding at 2:35 AM.

2:30 - backup process started (scheduled job)
2:31 - disk began heating up (I/O spike)
2:32 - database timeout
2:33 - watchdog restarts SQL server
2:34 - server comes back, but in recovery mode (rebuilding transactions)
2:35 - SQL is responsive, but backup did not finish

Root cause: backup strategy scheduled during business peak hours (2:30 is right after the nightly ETL) - causes resource contention.

Time needed: 2-4 hours, requires access to logs, alerts, monitoring.

Trend analysis - how to find recurring problems in a sea of incidents

Problem management starts with the question: "Which incidents recur?" Here is how to find out.

Step 1: Define "recurrence" - the same symptom (for example, login error, connection timeout) in the same IT category (for example, Active Directory) at least 3 times in a month, with MTTR higher than average.

Step 2: Build a trend report - in ManageEngine SDP use Analytics or a manual report:

Export the last 3 months of tickets
Group by category / IT component
Count volume per category, average MTTR
Look for categories with high volume + high MTTR

Step 3: Prioritize the top 5 problems - those that will deliver the highest ROI if solved.

A typical output of such an analysis:

Password reset for Active Directory -> 23 tickets/month, MTTR 12 minutes (the procedure is manual and can be automated)
VPN timeout for remote workers -> 18 tickets/month, MTTR 35 minutes
Xerox printer offline on the 4th floor -> 12 tickets/month, MTTR 25 minutes

Three main problems to solve through problem management.

Problem management in ManageEngine ServiceDesk Plus - how to configure it

ManageEngine ServiceDesk Plus includes a full Problem Management module (available from the Professional edition). Here is a practical setup.

1. Go to Admin -> Problem Management -> Process settings

Define problem states: New -> Assigned -> RCA In Progress -> RCA Complete -> Fix Scheduled -> Resolved -> Verified -> Closed
Set SLA for problems - for example, RCA within 5 days, Fix within 30 days
Define roles: Problem Manager, RCA Owner, Change Owner

2. Link incidents to problems

When you have a recurring incident, create a problem record:

Click "Create Problem" from the incident ticket
The problem record has its own workflow, change history, links to all incidents
Each new incident in that category -> the system suggests an existing problem record

3. Create an RCA Report template

Admin -> Problem Management -> Templates -> Problem Template

The template should include sections:

Issue Description: what happened
Impact: how many users, how long
Investigation: what we looked into
Root Cause: the root cause (and RCA method)
Solution: planned change (link to Change Request)
Prevention: how to keep it from coming back

4. Configure automatic suggestions

Settings -> Incident Management -> Advanced -> Enable Problem Prediction/Suggestion

The system will automatically suggest a problem record when a new incident matches previous ones (category, component, error).

5. Problem analytics

Reports -> Problem Analytics

Top 10 problems by frequency
Average RCA time
Problems without resolution (stuck process)
Repeat incident rate per problem (does the number of recurrences drop after resolution?)

Best practices - how to do problem management that actually delivers

1. Dedicate 10-15% of one person's time (Problem Manager) to problem management
Problem management does not work if "everyone does a bit". You need someone who weekly analyzes trends, plans RCA, coordinates change. This person should have access to all systems and know how to work with technicians.
2. Run RCA on the TOP 3 problems each month, not on all
80/20 rule - 20% of problems cause 80% of incidents. Instead of RCA on everything, focus on those that bring the highest ROI. RCA on password resets (which can be automated) has higher ROI than RCA on a very rare error.
3. Integrate problem management with change management
Each problem creates a Change Request. Change Advisory Board reviews the fix. After change implementation, problem verification runs for 2-4 weeks. Without this link, problem management becomes "academic".
4. Store every RCA in the Knowledge Base
RCA is knowledge. Next time a technician sees this problem, they should find a KB article. Oh, there are 3 articles for this problem - maybe they need to be merged.
5. Measure success: % of recurrences drops after solving the problem
Good problem management means the problem actually goes away. The key KPI is the repeat incident rate - the share of recurrences for a given problem. After successful RCA and fix deployment it should clearly decline. Measure it before and after to evaluate the effect.

Problem management ROI - how to calculate it for your company

There is no universal ROI number for problem management - it depends on the scale of costs that recurring incidents generate in a given organization. Below I show the calculation method on a model example. All numbers are assumptions illustrating the way to calculate - plug in your own data.

Note: The example below is not a documented implementation nor a promise of a result. It only shows how to build your own calculation. Starting point: measure the actual share of recurring incidents in your ticketing system.

Step 1 - describe the baseline. Let us assume a model company:

10 helpdesk technicians
Assume 250 incidents/month
Assume average MTTR of 45 minutes and labor cost of 120 PLN/h
Model cost of 1 incident: 45 min / 60 x 120 PLN ~ 90 PLN
Model annual cost of handling incidents: 250 x 12 x 90 PLN = 270,000 PLN

Step 2 - establish the recurrence share. This number must be measured in your own system (report by category and component). In our example, assume 35% of incidents are recurrences:

Recurrences: 250 x 35% ~ 88 tickets/month
Model cost of recurrences: 88 x 12 x 90 PLN ~ 95,000 PLN/year

Step 3 - estimate the effect. Assume that RCA on a few most frequent problems lowers the recurrence rate from 35% to 15%. Then the model savings:

Recurrence reduction: (35% - 15%) x 250 x 12 x 90 PLN ~ 54,000 PLN/year
Add shorter handling time for known problems and lower technician turnover - effects harder to value but real

Step 4 - compare with costs. On the cost side, include the time of the person acting as Problem Manager, any ITSM licence, and team RCA training. ROI is:

ROI = (Annual savings - Annual costs) / Annual costs x 100%

In year one one-off costs apply (implementation, training), so ROI may be low or near zero. In later years, when mostly run costs remain, ROI grows. Most important: calculate it on real data from your company, not on the numbers in this example.

FAQ - problem management

What is the difference between an incident and a problem in ITIL?

An incident is an unplanned interruption to an IT service - reactive, aimed at restoring service as fast as possible (MTTR < 4h). A problem is the root cause that drives one or many incidents - proactive, aimed at eliminating the cause for good. Incident vs Problem: short-term fix (incident) vs long-term resolution (problem).

How many recurring incidents does problem management resolve?

The share of incidents that are recurrences of the same problem varies by organization - it must be measured in your own ticketing system, grouping incidents by category and component. Effective problem management and RCA cut the number of recurrences and shorten the handling time of known problems, but the scale of the effect depends on the starting point. The key is to compare the recurrence rate before and after implementation.

What are the RCA (Root Cause Analysis) methods?

Main methods: 5 Whys (asking "why" five times), Fishbone Diagram (Ishikawa diagram - causes categorized), Failure Mode and Effects Analysis (FMEA - scenario analysis), Timeline Analysis (chronology of events), and Trend Analysis (patterns in historical tickets). The choice depends on problem complexity - simple problems: 5 Whys; complex systems: Fishbone + Timeline.

How do you configure problem management in ManageEngine ServiceDesk Plus?

ManageEngine SDP includes a Problem Management module (available from the Professional edition). Configuration: 1) Define problem states (New -> Assigned -> RCA -> Resolved -> Closed), 2) Link incidents to problems, 3) Set RCA report templates, 4) Configure automatic notifications about problem recurrences based on category/component, 5) Analyze trends in the Problem Management Analytics dashboard.

What is the ROI of problem management?

Problem management ROI depends on the scale of costs that recurring incidents generate in a given company - there is no single universal number. To calculate it: estimate the annual cost of handling recurrences (number of recurring tickets x cost per ticket), estimate the reduction after RCA implementation, then compare with costs (Problem Manager time, licence, training). Formula: ROI = (savings - costs) / costs x 100%. Substitute your own data.

Jakub Roszkiewicz

CTO · Rotech Group · problem management ITSM and ManageEngine expert

Incident management in manufacturing - severities and priorities Escalation management in ITSM - how to escalate smartly Knowledge Base in helpdesk - how to reduce recurring tickets AI in ITSM 2026 - how artificial intelligence is changing IT helpdesk

Problem management implementation

Want to roll out problem management in your company?

Rotech Group will audit the process, identify the most frequent problems, configure ManageEngine SDP and train your team in RCA. Together we set measurable recurrence reduction targets based on your data.

Book a consultation →

Problem management ITSM -how to eliminate recurring incidents