Root Cause Analysis (RCA) | Methods, Steps & Example

Root cause analysis (RCA) is a structured method for identifying the fundamental reason a problem occurred so that it can be permanently eliminated — not just patched. Rather than treating symptoms, RCA works backward from the observable failure to the deepest systemic cause that, if corrected, prevents recurrence. In manufacturing and process industries, that usually means tracing an equipment failure, quality defect, or safety event through physical causes, human factors, and latent system weaknesses until the true origin is exposed.

RCA is not a single tool — it is a disciplined investigative process that can draw on several analytical methods, from the simple 5 Whys to formal fault tree analysis. The output is always the same: a verified root cause, one or more corrective actions tied to it, and preventive actions that address similar failure modes before they surface.

What Is Root Cause Analysis?

Root cause analysis answers the question: "Why did this problem happen — and why will it happen again unless something changes?"

Three levels of cause are typically distinguished:

Level	Description	Example
Immediate cause	The direct trigger of the event	Conveyor motor thermal overload tripped
Contributing cause	Conditions that allowed the immediate cause to occur	Motor running at 110% rated current for six weeks
Root cause	The underlying systemic reason the contributing cause was present	No current monitoring in the PLC program; maintenance PM interval not reviewed after load increase

Stopping at the immediate cause produces a temporary fix. Stopping at the contributing cause improves things but does not close the systemic gap. Only addressing the root cause prevents the same failure mode from recurring in this or another asset.

Three levels of cause in manufacturing RCA — stopping at the immediate cause produces only a temporary fix; the root cause is the systemic condition that must change.

When to Use Root Cause Analysis

RCA is appropriate whenever a problem is:

Recurring — the same failure has happened two or more times
High-impact — significant unplanned downtime, safety incident, quality escape, or regulatory event
Poorly understood — the team disagrees on why it happened or the fix has already failed once
Costly — spare parts, overtime, or production loss justify the investigation effort

Not every fault warrants a full formal RCA. Minor, isolated events are better handled through normal work-order maintenance and basic troubleshooting. A useful threshold used by many reliability teams: if the combined cost (lost production + repair + labour) exceeds one hour of planned downtime, open an RCA.

The RCA Process Step by Step

How to do a root cause analysis follows the same broad sequence regardless of which analytical method you choose:

Step 1 — Define the Problem

Write a precise problem statement using objective, observable language. Include:

What happened (the failure mode)
When it happened (date, time, shift)
Where it happened (asset ID, line, cell)
How often it has occurred (first event or recurring)
What the impact was (downtime minutes, scrap quantity, safety consequence)

A weak problem statement ("Motor keeps tripping") produces a weak investigation. A strong statement ("Conveyor drive motor M-203 tripped on thermal overload three times in the last 30 days, each time after 4–6 hours of continuous runtime, causing an average of 47 minutes unplanned downtime per event") focuses the team immediately.

Step 2 — Collect Data and Preserve Evidence

Gather all available evidence before it disappears:

Operator logs, shift reports, and maintenance work orders
PLC fault codes and diagnostic buffer snapshots
SCADA historian trends (current, temperature, speed, torque)
Alarm logs with timestamps
Physical inspection findings (bearing condition, belt tension, coupling alignment)
Recent maintenance history (what was last done, and when)

Evidence collection is time-critical. PLC diagnostic buffers are often circular and overwrite older entries. Historian data may be pruned. Collect everything within 24 hours of the event.

Step 3 — Identify Causal Factors

Map out the sequence of events and conditions that led to the failure. Ask "what conditions had to be true for this event to occur?" at each step. This typically produces a chain of three to eight factors rather than a single cause.

Step 4 — Apply an RCA Method

Select the method appropriate to the complexity of the problem (see the next section). Work through the method rigorously using the evidence collected, not assumptions.

Step 5 — Identify the Root Cause

The root cause is reached when further "why" questioning points outside the team's direct control (a management system, a design standard, or a process that does not exist). It should be something that, if corrected, breaks the causal chain permanently.

Step 6 — Develop Corrective and Preventive Actions

For every root cause identified, specify:

Corrective action — fixes the condition that caused this specific event
Preventive action — changes the system to prevent the same failure mode on this and similar assets

Each action needs an owner, a due date, and a success metric.

Step 7 — Implement and Verify

Actions must be implemented and then verified as effective. Schedule a follow-up review (typically 30–90 days) to confirm the failure mode has not recurred and the metrics have improved.

Eight-step RCA process — PLC diagnostic buffers, SCADA historian trends, and alarm logs are the primary evidence sources for Steps 2 through 4.

Record the completed RCA in your maintenance management system (CMMS) and share the learnings across similar assets. Many industrial failures are not unique — the same root cause exists on every identical motor, pump, or conveyor in the facility.

The Main RCA Methods

5 Whys

The 5 Whys is the simplest and most widely used RCA technique. Ask "Why?" repeatedly until you reach a cause you can actually fix. Five iterations is a guideline, not a rule — some problems resolve in three; others require seven.

Strengths: Fast, requires no special tools, works well for straightforward causal chains.

Limitations: Can follow a single causal path and miss parallel contributing factors; quality depends heavily on the experience of the facilitator.

Fishbone Diagram (Ishikawa / Cause-and-Effect Diagram)

The fishbone diagram structures potential causes into categories, typically using the 6M framework:

Man (human factors, training, procedure)
Machine (equipment, tooling, fixtures)
Material (raw materials, components, consumables)
Method (process, work instructions, procedures)
Measurement (instrumentation, calibration, data quality)
Mother Nature / Environment (temperature, humidity, vibration)

Each category is explored for contributing factors, which are then evaluated against the evidence. The diagram produces a visual map of causal hypotheses before the team narrows to the verified root cause.

Strengths: Excellent for brainstorming; prevents the team from fixating on a single cause too early; keeps all hypotheses visible.

Limitations: Does not test causal relationships — everything on the diagram is a hypothesis until verified with data.

Fault Tree Analysis (FTA)

Fault tree analysis starts from the top-level failure event (the "undesired event") and works downward through logical AND/OR gates to identify all combinations of lower-level events that could produce the failure. It is particularly valuable for safety-critical systems where multiple independent failures must coincide to produce a catastrophic outcome.

Strengths: Rigorous, handles complex parallel failure paths, quantifiable when failure rate data is available.

Limitations: Time-intensive to build correctly; requires solid knowledge of system logic and failure modes.

Pareto Analysis

Pareto analysis applies the 80/20 rule to failure data: identify which failure modes, equipment classes, or locations account for the majority of downtime or cost, then focus RCA resources there. It does not explain why a failure occurred but ensures the team works on the problems with the highest leverage.

Strengths: Data-driven prioritisation; prevents teams from spending RCA effort on low-impact events.

Limitations: Requires a good historical dataset; does not replace causal analysis — it only directs where to apply it.

FMEA

FMEA (Failure Mode and Effects Analysis) is a proactive method used before failures occur to anticipate what could go wrong, estimate severity and likelihood, and prioritise preventive action. It complements RCA: FMEA prevents known failure modes; RCA captures failure modes FMEA missed. Combined with reliability centered maintenance, FMEA forms the foundation of a mature asset management strategy.

A Worked Manufacturing Example: Recurring Conveyor Motor Trip

This example walks through a real-world style RCA using the 5 Whys method.

Problem Statement

Motor M-203 (conveyor drive, Assembly Line 4) tripped on thermal overload on 14 April, 29 April, and 12 May 2026. Each trip occurred between 4.5 and 5.5 hours into the shift. Average downtime per event: 47 minutes. Total production impact: approximately 140 minutes over 30 days.

Evidence Collected

SCADA historian: Motor current trending from 18 A at shift start to 26–27 A by hour 4 (rated FLA: 22 A)
Alarm log: Thermal overload trip alarm each time, no other alarms preceding
Maintenance work orders: Gearbox oil changed 18 March; no other recent work
Physical inspection: Drive belt tension within spec; motor vents partially blocked by accumulated cardboard dust; ambient temperature at motor location 38°C on all three days (summer peak)
Previous thermal overload setting: 24 A (set in 2022 when line ran at 60% speed)
Current line speed: Increased to 85% in January 2026 following production target change

5 Whys Analysis

Why did motor M-203 trip on thermal overload? Because the motor current exceeded the thermal overload relay trip threshold of 24 A.

Why did the motor current exceed 24 A? Because the conveyor load increased as product accumulated ahead of the downstream packing robot (which runs slower than the conveyor during summer peak — heat causes adhesive label delays).

Why was the motor running near its rated limit even at normal load? Because the line speed was increased from 60% to 85% in January 2026, increasing the steady-state current from 18 A to approximately 22–23 A, with no review of the motor sizing or protection settings.

Why were the protection settings not reviewed when the line speed changed? Because the management-of-change (MOC) procedure for production parameter changes did not require an electrical or instrumentation review — it only required a production supervisor sign-off.

Why did the MOC procedure not include an electrical review? Because the procedure was written in 2019 when motor sizing calculations were not part of the change management workflow, and it has not been updated since.

Root cause: The management-of-change procedure does not trigger an electrical/instrumentation review for production speed or load changes, allowing motor protection settings to remain mismatched with actual operating conditions.

Corrective Actions

Action	Owner	Due
Clean motor M-203 vents and recheck thermal class rating	Maintenance	Immediate
Recalculate and reset thermal overload to 23.5 A (matching 85% speed load profile with 10% margin)	Electrical Engineer	This week
Add PLC current monitoring block to M-203 with early-warning alarm at 21 A	Controls Engineer	Within 14 days
Audit all conveyor motors on Lines 1–6 for post-January speed change load vs. protection mismatch	Maintenance Manager	30 days

Preventive Actions

Action	Owner	Due
Update MOC procedure to require electrical/instrumentation sign-off for any drive speed or load change >10%	Engineering Manager	30 days
Add motor current trending to monthly reliability KPI dashboard	Reliability Engineer	30 days
Schedule FMEA review for all conveyor drive systems	Reliability Engineer	60 days

Corrective vs Preventive Actions

These terms are often confused:

Corrective action fixes the specific failure that occurred. In the example above, cleaning the motor vents and resetting the overload relay are corrective actions — they restore M-203 to a safe operating state.

Preventive action changes the system so the root cause cannot produce the same failure in future. Updating the MOC procedure is a preventive action — it closes the systemic gap that allowed the wrong protection setting to persist undetected for five months.

A complete RCA produces both. Corrective actions alone are insufficient for recurring failures because they do not address the underlying system condition that allowed the failure to occur.

How PLC and SCADA Data Accelerates RCA

Modern control systems contain a wealth of evidence that dramatically shortens RCA investigation time — if you know where to look and how to use it. For a deeper treatment of fault diagnostics, see the PLC troubleshooting guide.

PLC Diagnostic Buffer

Most modern PLCs maintain a circular diagnostic buffer that records fault codes with timestamps, fault class, and the rack/slot/channel affected. Siemens S7 systems store this in the CPU's diagnostic buffer (accessible via TIA Portal's online diagnostics view). Allen-Bradley ControlLogix stores fault information in the controller properties fault log. This data pinpoints the exact moment of failure and any immediately preceding faults, often eliminating hours of speculation.

RCA application: Cross-reference the fault timestamp in the PLC buffer against the SCADA historian to see exactly what process values were present at the moment of failure.

SCADA Historian Trends

A process historian recording 1-second (or faster) sample rates on current, temperature, speed, pressure, and torque gives the RCA team a precise picture of conditions in the minutes and hours before a failure. Trends often reveal:

A gradual drift (slow degradation over weeks) versus a sudden step change (external disturbance or upstream change)
Whether the failure is thermally driven (temperature climbing over a full shift) versus load-driven (spike at a specific production event)
Correlation with shift changeover, product changeover, or ambient temperature cycles

In the motor example above, the SCADA trend showing current rising from 18 A to 27 A over four hours was the key piece of evidence that distinguished a thermal accumulation problem from a sudden electrical fault.

SCADA historian trend for motor M-203: current climbs steadily from 18 A at shift start to 27 A before tripping the overload relay — thermal accumulation, not a sudden electrical fault.

Alarm Logs

Alarm logs with millisecond timestamps reveal the sequence of events leading to a trip. A well-configured alarm system following alarm management best practices will show whether any early-warning alarms fired before the trip (and whether they were acknowledged), or whether the trip was the first alert the operator received. The absence of early-warning alarms is itself a finding — it means the protection philosophy did not give the operator time to intervene.

Predictive Maintenance Data

Vibration, temperature, and current signature analysis data from a PLC predictive maintenance program provides the baseline against which failure data is compared. If you have 12 months of vibration trend data on a motor bearing, you can determine whether the failure was a sudden random event or a slow degradation that passed the threshold because no alert existed. This evidence is essential for distinguishing between random failures (which need redundancy or on-condition replacement) and systematic failures (which need a design or process change).

Structured Data Collection from the Control System

When opening an RCA, collect the following from the control system before beginning causal analysis:

PLC diagnostic buffer — all faults in the 24 hours before the event
Historian trend — relevant process variables for at least 8 hours before the event (or since the last shift start)
Alarm log — all alarms from 2 hours before the event to 30 minutes after
Setpoint and parameter history — confirm no setpoint or configuration change was made before the failure
Batch/recipe log — identify which product or recipe was running at the time

This data package takes 15–30 minutes to pull and often resolves 50% of the causal hypotheses before the physical investigation begins.

Common Pitfalls in Root Cause Analysis

Stopping at the Symptom

"The motor tripped" is not a root cause. "The overload relay was set too low for the actual load" is a contributing cause. "The MOC process does not require electrical review" is a root cause. Teams under time pressure tend to stop one or two levels too early.

Blame as a Root Cause

Human error is almost always a contributing factor but almost never a root cause. Asking "why did the operator fail to notice the rising current?" leads to "the HMI had no current trend display," which leads to "the PLC program had no current monitoring block," which leads to a fixable system gap. Blame terminates the investigation; systems thinking continues it.

Confusing Correlation with Causation

The fishbone diagram and brainstorming steps generate hypotheses. Every hypothesis must be verified against evidence before it becomes a confirmed cause. "It rained the day before each failure" may correlate with moisture ingress in a motor terminal box — or it may be coincidence. Test it.

Single-Cause Thinking

Complex industrial failures typically have multiple contributing causes that must align for the failure to occur. The 5 Whys method can mask this if the facilitator follows only one causal thread. Use the fishbone diagram or fault tree analysis when the failure pattern suggests multiple independent factors at play.

No Follow-Through on Actions

RCA loses all value if corrective and preventive actions are not implemented, verified, and tracked to closure. Actions should be logged in the CMMS with a due date, owner, and verification criterion — not kept in a meeting document that no one reviews.

Fishbone (Ishikawa) diagram for the M-203 thermal overload RCA — the 6M categories surface all causal hypotheses before the team narrows to the verified root cause.

Frequently Asked Questions

What is root cause analysis? Root cause analysis is a structured investigative process used to identify the fundamental reason a problem occurred. Unlike reactive troubleshooting, which restores operation as quickly as possible, RCA works backward from the failure event to the underlying systemic cause — the condition that, if left unchanged, will allow the same failure to recur. The result of an RCA is a verified root cause plus corrective and preventive actions that permanently eliminate it.

What are the 5 Whys? The 5 Whys is an RCA technique that asks "Why?" repeatedly — typically five times — until the causal chain reaches an actionable root cause. Developed at Toyota as part of the Toyota Production System, it is the most widely used RCA method in manufacturing because it requires no special tools and can be completed in 20–30 minutes by a small team with direct knowledge of the failure. The technique works best for straightforward single-path failure chains; for complex failures with multiple contributing factors, fishbone diagrams or fault tree analysis provide more rigour.

What is a fishbone diagram? A fishbone diagram (also called an Ishikawa diagram or cause-and-effect diagram) is a visual tool that maps potential causes of a problem into categories — typically Man, Machine, Material, Method, Measurement, and Environment. The problem statement is written at the "head" of the fish; each category branch is a "bone" populated with hypothesised causes. The diagram is used in the brainstorming phase of an RCA to ensure the team considers all possible cause categories before narrowing to a verified root cause.

How do you do a root cause analysis? The core steps are: (1) write a precise problem statement; (2) collect all available data and preserve evidence; (3) map causal factors using a 5 Whys or fishbone approach; (4) verify hypotheses against evidence; (5) identify the root cause — the deepest systemic condition in the causal chain; (6) develop corrective actions for the specific failure and preventive actions for the system gap; (7) implement, verify, and document. In industrial settings, PLC diagnostic buffers, SCADA historian trends, and alarm logs are the most valuable data sources for steps 2 and 4.