Incident: a production line stops for two hours - this is a critical case with the shortest acceptable response time. Incident: one employee's mailbox does not work - this is a lower-priority case. In manufacturing, incident management is not just help desk; it is a process with direct operational and financial impact: every hour of line downtime is a real loss. In this article I break down severity vs priority, the classification matrix, SLA for manufacturing, and how to set the target MTTR.
Severity vs Priority - what is the difference
Severity - business impact. How bad is it? Can people work? Is the network down for 500 people or just for one?
Priority - urgency of the fix. How quickly does it have to be fixed?
Example: The CEO's email is not working (severity: LOW - 1 person, but priority: CRITICAL - because it is the CEO). The network is down all Friday (severity: CRITICAL - 100+ people, priority: CRITICAL).
Severity x Priority matrix
| Severity \ Priority | P1 (Immediate) | P2 (Urgent) | P3 (Standard) | P4 (Low) |
|---|---|---|---|---|
| Critical (entire production) | P1-CRIT (1h MTTR) | P1-URG (2h) | P2 (4h) | P3 (8h) |
| High (department/team) | P1-URG (2h) | P2 (4h) | P3 (8h) | P4 (24h) |
| Medium (1 user) | P2 (4h) | P3 (8h) | P4 (24h) | P4 (48h) |
| Low (1 OS, no impact) | P3 (8h) | P4 (24h) | P4 (24h) | P4 (48h) |
MTTR benchmark - how much time do you have?
P1-Critical (prod down): max 1 hour. In practice: IT on site in 15 minutes, diagnosis in 20, fix in 30. After resolution: RCA within 2 days.
P2-Urgent (department down): max 4 hours. IT in 30 min, 30 min diagnosis, 2h fix. RCA within 1 week.
P3-Standard (1 person cannot work): max 8 hours. The fix can be a "temporary patch" - e.g. application restart, password reset, if the permanent fix will be ready tomorrow.
P4-Low (something works but slowly, not critical): max 48 hours. This can wait until the next maintenance window.
RCA after a P1 incident - mandatory for manufacturing
Always! After every P1 - the team performs an RCA within 2 days. You document:
- Timeline (at 10:30 the network goes down, at 10:35 IT is called, at 11:00 the router is restarted, at 11:15 the service is restored)
- Root cause (a router firmware upgrade rolled out on Wednesday without testing introduced a bug)
- Resolution (rollback to the previous version, router restart)
- Long-term fix (e.g. testing every upgrade in a QA environment before production rollout)
- Prevention (procedure: every upgrade must be tested, approval by the change board)
Incident management in ManageEngine SDP
Setup:
- Admin -> Incident Management -> Priorities - define P1-P4 and SLAs
- Admin -> Impact/Urgency - define the severity matrix (Critical/High/Medium/Low)
- Configure escalation rules: P1 -> notify the IT manager + VP Operations, after 30 min
- Configure notifications: P1 -> SMS + email + Slack alert to all technicians
- Reports -> SLA compliance - track what % of P1s meet the 1h MTTR target
SLA compliance in manufacturing - what to track
KPI #1: % of P1 meeting MTTR < 1h - Target: 90%+. Below 80% = the process is not working.
KPI #2: MTTR trend per priority - Is MTTR rising or falling? A downward trend = good, people are learning.
KPI #3: Repeat incident rate (% of recurrences) - After a P1 there should be an RCA and a fix. If the same incident comes back - the RCA did not work.
KPI #4: Time to detect incident - Ideally a P1 should be auto-detected by monitoring (network goes down = alert in 30 seconds). If a P1 is reported by an employee via email = monitoring is misconfigured.
Incident management for your manufacturing site?
Rotech Group will configure incident management in ManageEngine SDP, define SLAs for P1-P4 and train the team in RCA. We will help you set measurable compliance targets for your plant.
Book a consultation