A share of IT incidents are repeats of the same underlying problem. A technician hits the same database error, the same misconfigured router, the same gap in the onboarding procedure - sometimes several times a week. It is not the technician's fault: it is the absence of problem management. Problem management ITSM is the process that finds the root cause, eliminates it and reduces the risk of the incident recurring. In this article I break it all down: the difference between an incident and a problem, the 4-phase process, three proven RCA methods, how it looks in ManageEngine ServiceDesk Plus, and how to calculate ROI for your own company.
Incident vs Problem - why they are not the same, and why everyone confuses them
Let us start with the definitions, because everything turns on this.
Incident (Incident Management in ITIL) is an unplanned interruption to an IT service or degradation of its quality. An incident is reactive: someone reports the network is down, system login is broken, the printer needs paper. The goal of incident management is to restore service to normal as quickly as possible, regardless of what caused it. MTTR (Mean Time To Recovery) for an incident is minutes, at most hours.
Problem (Problem Management in ITIL) is the root cause of one or more incidents. A problem is proactive: deep investigation: "Why does the network fail every 3 weeks?" or "Why does login break every time we roll out M365?" The goal of problem management is to eliminate the cause permanently, so the incident never recurs. A problem can be "in resolution" for months.
Consequences of neglecting problem management:
- Technicians waste time solving the same problem repeatedly
- Recurring incidents take longer than they would if the cause were known and documented
- CSAT drops - users see the same problem returning
- Helpdesk burns disproportionate time on the same matters
The 4-phase ITIL problem management process - how it works in practice
ITIL v4 defines problem management as a process with four main phases. Here is what it looks like in a real scenario.
1. Detect & Analyze
You identify patterns of recurring incidents. Sources:
- Tickets in the same category/symptom every week
- Many incidents for the same IT component (router, server, application)
- Trend: rising average MTTR for that category
Tool: Incident dashboard in SDP, filter by category, sort by frequency.
2. Root Cause Analysis (RCA)
Deep investigation - why this is happening. Use one of three methods:
- 5 Whys (asking "why" five times)
- Fishbone Diagram (Ishikawa cause diagram)
- Timeline Analysis (chronology of events)
Output: an RCA document with an unambiguous root cause.
3. Fix & Verify
Plan the change (Change Management), implement the fix, test. The resolution should be:
- Approved by the Change Advisory Board (CAB)
- Tested in a test environment (NOT in production!)
- Documented in the problem record
Output: the incident should be eliminated.
4. Monitor & Close
For 2-4 weeks monitor whether the problem returns:
- Zero new incidents in this category?
- The metric (for example, server response time ping) within normal range?
- Users no longer hitting the error?
Output: close the problem record and store the knowledge in the Knowledge Base.
Three Root Cause Analysis methods - which one for you
RCA is the heart of problem management. Here are the three most popular methods, from simplest to most advanced.
Method 1: 5 Whys (asking "why" five times)
When: simple problems, 1-2 person teams. The server is down, no employee receives email.
How: Start from the symptom and ask "why?" five times:
- Why do employees not get email? -> Mail server does not respond.
- Why does the mail server not respond? -> The disk is 99% full.
- Why is the disk 99% full? -> Logs have not been rotated for 6 months.
- Why were logs not rotated? -> The logrotate script did not run.
- Why did it not run? -> No admin configured it - it was done "quickly" 6 months ago.
Root cause: missing logrotate configuration procedure and low priority for maintenance tasks.
Time needed: 30 minutes to 1 hour.
Method 2: Fishbone Diagram (Ishikawa)
When: more complex problems, multiple possible causes. Hardware + software + procedures interacting.
How: Draw a fishbone and categorize possible causes onto the "bones" - typically 5 categories: People, Process, Technology, Tools, Environment.
Example: Problem: networked printers shut off every hour.
- People: Admin did not check logs; no training in printer handling.
- Process: No printer restart procedure; printer monitoring missing.
- Technology: Printer driver is ancient (from 2018); wifi is weak in the printer room.
- Tools: No tool to monitor printer status on the network dashboard.
- Environment: Printer room is hot (38 C), printer is overheating.
Root cause (often turns out to be a combination): old driver + overheating + no monitoring + wifi interference.
Time needed: 1-2 hours, requires a team.
Method 3: Timeline Analysis
When: very complex problems, many systems involved, hard to separate cause from effect.
How: Build a precise timeline of events - every log entry, every alert, every configuration change.
Example: Problem: SQL server stopped responding at 2:35 AM.
- 2:30 - backup process started (scheduled job)
- 2:31 - disk began heating up (I/O spike)
- 2:32 - database timeout
- 2:33 - watchdog restarts SQL server
- 2:34 - server comes back, but in recovery mode (rebuilding transactions)
- 2:35 - SQL is responsive, but backup did not finish
Root cause: backup strategy scheduled during business peak hours (2:30 is right after the nightly ETL) - causes resource contention.
Time needed: 2-4 hours, requires access to logs, alerts, monitoring.
Trend analysis - how to find recurring problems in a sea of incidents
Problem management starts with the question: "Which incidents recur?" Here is how to find out.
Step 1: Define "recurrence" - the same symptom (for example, login error, connection timeout) in the same IT category (for example, Active Directory) at least 3 times in a month, with MTTR higher than average.
Step 2: Build a trend report - in ManageEngine SDP use Analytics or a manual report:
- Export the last 3 months of tickets
- Group by category / IT component
- Count volume per category, average MTTR
- Look for categories with high volume + high MTTR
Step 3: Prioritize the top 5 problems - those that will deliver the highest ROI if solved.
A typical output of such an analysis:
- Password reset for Active Directory -> 23 tickets/month, MTTR 12 minutes (the procedure is manual and can be automated)
- VPN timeout for remote workers -> 18 tickets/month, MTTR 35 minutes
- Xerox printer offline on the 4th floor -> 12 tickets/month, MTTR 25 minutes
- Define problem states: New -> Assigned -> RCA In Progress -> RCA Complete -> Fix Scheduled -> Resolved -> Verified -> Closed
- Set SLA for problems - for example, RCA within 5 days, Fix within 30 days
- Define roles: Problem Manager, RCA Owner, Change Owner
- Click "Create Problem" from the incident ticket
- The problem record has its own workflow, change history, links to all incidents
- Each new incident in that category -> the system suggests an existing problem record
- Issue Description: what happened
- Impact: how many users, how long
- Investigation: what we looked into
- Root Cause: the root cause (and RCA method)
- Solution: planned change (link to Change Request)
- Prevention: how to keep it from coming back
- Top 10 problems by frequency
- Average RCA time
- Problems without resolution (stuck process)
- Repeat incident rate per problem (does the number of recurrences drop after resolution?)
-
1. Dedicate 10-15% of one person's time (Problem Manager) to problem management
Problem management does not work if "everyone does a bit". You need someone who weekly analyzes trends, plans RCA, coordinates change. This person should have access to all systems and know how to work with technicians.
-
2. Run RCA on the TOP 3 problems each month, not on all
80/20 rule - 20% of problems cause 80% of incidents. Instead of RCA on everything, focus on those that bring the highest ROI. RCA on password resets (which can be automated) has higher ROI than RCA on a very rare error.
-
3. Integrate problem management with change management
Each problem creates a Change Request. Change Advisory Board reviews the fix. After change implementation, problem verification runs for 2-4 weeks. Without this link, problem management becomes "academic".
-
4. Store every RCA in the Knowledge Base
RCA is knowledge. Next time a technician sees this problem, they should find a KB article. Oh, there are 3 articles for this problem - maybe they need to be merged.
-
5. Measure success: % of recurrences drops after solving the problem
Good problem management means the problem actually goes away. The key KPI is the repeat incident rate - the share of recurrences for a given problem. After successful RCA and fix deployment it should clearly decline. Measure it before and after to evaluate the effect.
- 10 helpdesk technicians
- Assume 250 incidents/month
- Assume average MTTR of 45 minutes and labor cost of 120 PLN/h
- Model cost of 1 incident: 45 min / 60 x 120 PLN ~ 90 PLN
- Model annual cost of handling incidents: 250 x 12 x 90 PLN = 270,000 PLN
- Recurrences: 250 x 35% ~ 88 tickets/month
- Model cost of recurrences: 88 x 12 x 90 PLN ~ 95,000 PLN/year
- Recurrence reduction: (35% - 15%) x 250 x 12 x 90 PLN ~ 54,000 PLN/year
- Add shorter handling time for known problems and lower technician turnover - effects harder to value but real
Three main problems to solve through problem management.
Problem management in ManageEngine ServiceDesk Plus - how to configure it
ManageEngine ServiceDesk Plus includes a full Problem Management module (available from the Professional edition). Here is a practical setup.
1. Go to Admin -> Problem Management -> Process settings
2. Link incidents to problems
When you have a recurring incident, create a problem record:
3. Create an RCA Report template
Admin -> Problem Management -> Templates -> Problem Template
The template should include sections:
4. Configure automatic suggestions
Settings -> Incident Management -> Advanced -> Enable Problem Prediction/Suggestion
The system will automatically suggest a problem record when a new incident matches previous ones (category, component, error).
5. Problem analytics
Reports -> Problem Analytics
Best practices - how to do problem management that actually delivers
Problem management ROI - how to calculate it for your company
There is no universal ROI number for problem management - it depends on the scale of costs that recurring incidents generate in a given organization. Below I show the calculation method on a model example. All numbers are assumptions illustrating the way to calculate - plug in your own data.
Note: The example below is not a documented implementation nor a promise of a result. It only shows how to build your own calculation. Starting point: measure the actual share of recurring incidents in your ticketing system.
Step 1 - describe the baseline. Let us assume a model company:
Step 2 - establish the recurrence share. This number must be measured in your own system (report by category and component). In our example, assume 35% of incidents are recurrences:
Step 3 - estimate the effect. Assume that RCA on a few most frequent problems lowers the recurrence rate from 35% to 15%. Then the model savings:
Step 4 - compare with costs. On the cost side, include the time of the person acting as Problem Manager, any ITSM licence, and team RCA training. ROI is:
ROI = (Annual savings - Annual costs) / Annual costs x 100%
In year one one-off costs apply (implementation, training), so ROI may be low or near zero. In later years, when mostly run costs remain, ROI grows. Most important: calculate it on real data from your company, not on the numbers in this example.
FAQ - problem management
What is the difference between an incident and a problem in ITIL?
An incident is an unplanned interruption to an IT service - reactive, aimed at restoring service as fast as possible (MTTR < 4h). A problem is the root cause that drives one or many incidents - proactive, aimed at eliminating the cause for good. Incident vs Problem: short-term fix (incident) vs long-term resolution (problem).
How many recurring incidents does problem management resolve?
The share of incidents that are recurrences of the same problem varies by organization - it must be measured in your own ticketing system, grouping incidents by category and component. Effective problem management and RCA cut the number of recurrences and shorten the handling time of known problems, but the scale of the effect depends on the starting point. The key is to compare the recurrence rate before and after implementation.
What are the RCA (Root Cause Analysis) methods?
Main methods: 5 Whys (asking "why" five times), Fishbone Diagram (Ishikawa diagram - causes categorized), Failure Mode and Effects Analysis (FMEA - scenario analysis), Timeline Analysis (chronology of events), and Trend Analysis (patterns in historical tickets). The choice depends on problem complexity - simple problems: 5 Whys; complex systems: Fishbone + Timeline.
How do you configure problem management in ManageEngine ServiceDesk Plus?
ManageEngine SDP includes a Problem Management module (available from the Professional edition). Configuration: 1) Define problem states (New -> Assigned -> RCA -> Resolved -> Closed), 2) Link incidents to problems, 3) Set RCA report templates, 4) Configure automatic notifications about problem recurrences based on category/component, 5) Analyze trends in the Problem Management Analytics dashboard.
What is the ROI of problem management?
Problem management ROI depends on the scale of costs that recurring incidents generate in a given company - there is no single universal number. To calculate it: estimate the annual cost of handling recurrences (number of recurring tickets x cost per ticket), estimate the reduction after RCA implementation, then compare with costs (Problem Manager time, licence, training). Formula: ROI = (savings - costs) / costs x 100%. Substitute your own data.
Related articles
Incident management in manufacturing - severities and priorities Escalation management in ITSM - how to escalate smartly Knowledge Base in helpdesk - how to reduce recurring tickets AI in ITSM 2026 - how artificial intelligence is changing IT helpdeskWant to roll out problem management in your company?
Rotech Group will audit the process, identify the most frequent problems, configure ManageEngine SDP and train your team in RCA. Together we set measurable recurrence reduction targets based on your data.
Book a consultation →