What Is FMEA? A Practical Guide for Engineers
FMEA is one of those tools that every engineer has heard of, most have used, and almost no team actually does well.
The concept is straightforward: systematically identify ways your design or process can fail, assess how bad each failure would be, and put controls in place to prevent or detect them before they reach the customer.
In practice, FMEAs are usually 40-tab spreadsheets maintained by whoever was assigned the task last, last updated two design revisions ago, and opened primarily when an auditor asks for them.
This guide covers what FMEA actually is, how to build one properly, and why the maintenance problem is harder than the analysis itself.
What FMEA Stands For
Failure Mode and Effects Analysis.
- Failure mode: the specific way a component, process step, or system can fail to perform its intended function
- Effect: what happens as a result of that failure — to the next assembly, the system, or the end user
- Analysis: the structured process of identifying, assessing, and prioritising these failures so action can be taken
FMEA was developed by the US military in the 1940s (MIL-P-1629) and adopted widely in aerospace and automotive manufacturing. Today it's a standard requirement in automotive (IATF 16949), medical devices (ISO 14971), and defence industries — and a best practice in product development broadly.
Types of FMEA
Design FMEA (DFMEA)
DFMEA analyses the product design itself. It asks: what failure modes exist in this component or assembly, and what are their effects on the system and the customer?
DFMEA is done during the design phase, ideally before detailed drawings are released. The focus is on design decisions: material selection, geometry, tolerances, interfaces between components.
Process FMEA (PFMEA)
PFMEA analyses the manufacturing or assembly process. It asks: what could go wrong during production, and what effect would that have on the part quality and the end product?
PFMEA is done when the manufacturing process is being defined, typically during process planning before production launch.
System FMEA
System-level FMEA looks at interactions between subsystems. It's used for complex products where a failure in one subsystem can cascade to others — common in automotive, aerospace, and medical devices.
Most engineering teams in product development will use DFMEA and PFMEA. This guide focuses on DFMEA.
The Structure of a DFMEA
A standard DFMEA has the following columns:
Item / Function
What component or assembly interface are you analysing? What is its intended function?
Example: Rear axle shaft — transmit torque from differential to wheel hub
Potential Failure Mode
How could this component fail to perform its function?
One component can have multiple failure modes. Be specific — "shaft fails" is not useful. "Shaft fractures at the spline root under peak torque" is.
Common failure mode categories: fracture, deformation, wear, corrosion, fatigue, leakage, binding, electrical short/open, contamination
Potential Effect of Failure
If this failure mode occurs, what happens? Consider the immediate effect (on the next assembly), the system effect (on the overall product), and the end effect (on the customer or user).
Example: Shaft fracture → loss of drive → vehicle cannot move → driver stranded
Severity (S)
How serious is the worst-case effect? Rated 1–10.
| Rating | Criteria | |--------|----------| | 9–10 | Safety critical: may injure operator/user without warning | | 7–8 | Major loss of function, customer very dissatisfied | | 5–6 | Reduced function, customer dissatisfied | | 3–4 | Minor effect, customer slightly annoyed | | 1–2 | No discernible effect |
Severity is a property of the effect, not the failure mode. If the effect is catastrophic, severity is high regardless of how likely the failure is.
Potential Causes
What specific design or material conditions could cause this failure mode?
Be specific and actionable — not "material issue" but "insufficient fillet radius at spline root causes stress concentration exceeding fatigue limit under cyclic loading."
Each failure mode can have multiple causes. Analyse each cause separately.
Occurrence (O)
How likely is this cause to occur, given current design and process controls? Rated 1–10.
| Rating | Criteria | |--------|----------| | 9–10 | Almost certain: failure is expected | | 7–8 | High: repeated failures likely | | 5–6 | Moderate: occasional failures | | 3–4 | Low: relatively few failures | | 1–2 | Remote: unlikely under normal conditions |
Occurrence is based on historical data, engineering judgement, and similar designs.
Current Controls
What design features, analyses, or tests currently exist to either prevent the cause from occurring or detect the failure before it reaches the customer?
Separate prevention controls (reduce occurrence) from detection controls (reduce the chance a failure slips through).
Example prevention: increased fillet radius per FEA analysis, material hardness specification Example detection: fatigue test to 10× design life, dimensional inspection of fillet radius
Detection (D)
How likely are the current controls to detect the failure mode or its cause before it reaches the next stage or the customer? Rated 1–10.
| Rating | Criteria | |--------|----------| | 9–10 | Controls will almost certainly not detect it | | 7–8 | Low likelihood of detection | | 5–6 | Moderate likelihood | | 3–4 | High likelihood | | 1–2 | Controls will almost certainly detect it |
Counter-intuitive but important: low detection number = good detection. A 2 means your controls are very likely to catch the failure.
RPN (Risk Priority Number)
RPN = S × O × D
This gives a single number to prioritise action. Higher RPN = higher priority.
Important caveat: RPN is a relative ranking tool, not an absolute risk score. A failure with S=10, O=2, D=1 (RPN=20) is often more important to address than one with S=3, O=3, D=5 (RPN=45) — because the first has a catastrophic severity that warrants action regardless of RPN.
Always review high-severity items independently of their RPN.
Recommended Actions
For high-RPN items (or any item with severity 9–10), what specific actions will reduce risk?
Actions should target severity (redesign to eliminate the effect), occurrence (design change to prevent the cause), or detection (add or improve a test).
After actions are implemented, reassess S, O, and D to calculate the revised RPN.
How to Build an FMEA Step by Step
1. Define the scope. Which component, assembly, or process are you analysing? What are the functional boundaries?
2. Assemble the right people. A good FMEA is not a solo task. You need design engineering, manufacturing, quality, and ideally someone from field service who knows how products fail in use. The combined knowledge of this team is what makes the analysis valuable.
3. List all functions. For each item in scope, define what it must do. Be specific and measurable where possible.
4. Brainstorm failure modes. For each function, what are all the ways it could fail? Don't filter yet — capture everything.
5. Identify effects and rate severity. For each failure mode, trace the effect chain through to the end user. Rate severity.
6. Identify causes and rate occurrence. For each failure mode, what specific conditions cause it? Rate likelihood.
7. Document current controls and rate detection. What prevents or detects each cause? Rate how effective those controls are.
8. Calculate RPN and prioritise. Focus action on the highest RPN items, and always separately review any item with severity 9 or 10.
9. Define and implement actions. Assign owners, set due dates. Track completion.
10. Reassess after actions are implemented. Update the FMEA with revised ratings.
The Maintenance Problem
Here is the honest reality of FMEA in most engineering teams:
The initial FMEA gets done — sometimes thoroughly, sometimes quickly — and then the design changes. A component gets redesigned. A tolerance gets tightened. A new failure mode emerges from testing.
The FMEA doesn't get updated.
By the time the product ships, the FMEA reflects a design that no longer exists. It's worse than useless in an audit because it gives a false sense of coverage.
The root cause isn't lack of discipline. It's that re-running an FMEA against a changed design is genuinely time-consuming when it's done in a spreadsheet. Engineers are under schedule pressure. Updating the FMEA falls to the bottom of the priority list.
The only sustainable solution is to make updating the FMEA faster and easier — which means either a structured process with dedicated review gates, or tooling that reduces the time required to generate and update the analysis.
ForgePilot has an FMEA generation tool that produces structured failure mode analysis from your design inputs — so the barrier to keeping it current is lower than starting from a blank spreadsheet each time.
Summary
FMEA is a structured way to find and prioritise failure risks before they become field problems. Done well, it catches the failure modes that slip through design review because they're non-obvious or require considering multiple parts of the system simultaneously.
The core of a useful FMEA is specificity. Generic failure modes, generic causes, and generic controls produce a document that satisfies an audit checkbox but catches nothing.
The most important thing is getting the right people in the room, being specific about what can go wrong and why, and actually updating the document when the design changes.