NASA C-MAPSS FD001

Digital Twin Framework for Industrial Asset Prognostics

Engine wear is hard to observe directly, and sensor readings are noisy. Operators still need a reliable estimate of how many operating cycles remain before failure, not just the direction of a trend line.

My approach turns noisy engine data into predictions that give both a plausible range for remaining life and a sense of how trustworthy that prediction is, so maintenance and safety teams can plan with risk in view, not a single guess.

Primary benchmark

Test set

RMSE

12.51

NASA score

274.41

Lower is better

Sample RUL - Engine 31
The predicted interval is shown against ground truth. The bar is scaled using only this engine's low, high, and true values.
Predicted intervalTrue RULConfidence 89.9%
Low5.1
Median7.4
High9.2

Note: Units are in cycles.

Problem overview

When failure is costly, timing is everything

Why predict remaining useful life?

Aircraft engines degrade over time due to wear and operational stress. Predicting remaining useful life (RUL) is critical for planning maintenance and preventing unexpected failures.

Operators do not get clean lab readings: sensors drift, loads change, and fleets age differently. The question is not only how long a unit might run, but how wrong a prediction can be before a decision becomes unsafe or wasteful.

Close-up of a turbofan jet engine in an aircraft hangar, representing hardware where wear accumulates and RUL is estimated.
Degradation is physical: sensors and models must support safe maintenance timing.Photo: Y M / Unsplash

In real-world systems

Late predictions

If you think the engine has more life than it really does, maintenance and inspections can slip until failure becomes plausible, often the worst-case risk in safety-critical fleets.

Early predictions

If you pull maintenance too soon, you pay for parts, labor, and downtime you might not have needed, acceptable when safety dominates, painful when it happens fleet-wide.

“A single RUL number is easy to report and hard to defend when stakes are high, because it pretends the future is precise when the data is not.”

Common framing in prognostics and health management (PHM) practice
Person reviewing data and charts on a laptop, representing planning and decisions that should use ranges and uncertainty, not a single headline number.
Decision teams need spread and risk visible, not only a point forecast on a dashboard.Photo: Pexels

Beyond a single estimate

Traditional models often output one value per engine or time step. That is simple to log, but it does not reflect uncertainty and without uncertainty, planners cannot weigh late-failure risk against early-maintenance cost in a principled way.

Quantile intervals, confidence-style scores, and explanations turn the same sensors into a decision-facing view: not only when you think failure is near, but how tight that belief is, so teams can act with eyes open, not from a single guess.

How condition evolves

Three stages from stable operation to rising failure risk, a simple curve to anchor interval forecasts and explanations.

Healthy

Stable sensors, full margin before service.

Degradation

Wear accumulates, and early warnings matter.

Failure risk

Late RUL errors are most costly - uncertainty helps.

Method

How the model works

This project uses a data-driven pipeline to predict remaining useful life (RUL) of aircraft engines from multivariate sensor data. The numbered strip below matches each card, use “Show details” when a step has more than two bullet points.

01 · What goes into the model

Input data

NASA C-MAPSS FD001: multivariate sensor time series per engine from healthy operation to failure.

  • NASA C-MAPSS FD001 dataset
  • 21 sensor measurements per cycle

Key idea

Instead of treating rows as independent samples, the model learns from sequences of engine behavior over time.

02 · Steps performed

Data preprocessing

Remove uninformative sensors, build RUL targets, cap and normalize, then slice into fixed-length windows.

  • Removed constant / non-informative sensors
  • Generated remaining useful life (RUL) labels

Key idea

Stable training, a meaningful sequence representation, and a consistent scale across sensors.

03 · Beyond raw readings

Feature engineering

Engineer trends, deltas, rolling stats, interactions, and smoothed signals so degradation shows up in dynamics.

  • Rate of change (rc) - short-term variation
  • Trend (tr) - longer-horizon direction (slope)

Key idea

Degradation shows up in how sensors evolve, not only in their levels. The model leans on dynamic patterns (change and trend) more than raw magnitudes alone.

04 · LSTM for sequences

Model architecture

An LSTM reads each engineered window and learns temporal structure toward probabilistic RUL heads.

  • Long Short-Term Memory (LSTM) network
  • Input: last 30 cycles with multiple engineered features

Key idea

The LSTM summarizes how the engineered sequence evolves - not just a snapshot at one cycle.

05 · Intervals, not only a point estimate

Probabilistic prediction

Instead of one number, the model outputs lower, median, and upper RUL to encode uncertainty.

  • Traditional: a single RUL number
  • Here: lower bound (~10th percentile), median (~50th), upper bound (~90th)

Key idea

A range makes late-RUL risk visible: when the band is wide, decisions should be more cautious.

06 · From interval width

Confidence estimation

Narrow prediction intervals imply higher confidence; wide intervals flag uncertain regimes.

  • Confidence scales inversely with interval width (narrower → higher confidence)
  • Supports maintenance decisions under noisy, real-world telemetry

Key idea

Turns probabilistic outputs into a simple reliability signal for operators and analysts.

07 · Quantile (pinball) loss

Model training

Quantile (pinball) loss trains low, median, and high outputs with asymmetric penalties.

  • Pinball loss trains low, median, and high heads with different asymmetry
  • Underestimating vs overestimating RUL can carry different costs - quantile loss encodes that

Key idea

The model learns calibrated bounds around the central estimate, not only the middle of the distribution.

08 · Multiple views of quality

Evaluation metrics

RMSE, NASA score, coverage, hit rates, and weighted errors capture accuracy and risk together.

  • RMSE - overall point accuracy
  • NASA score - penalizes dangerously late predictions

Key idea

No single number tells the whole story: accuracy, risk, and reliability are tracked separately.

Key results

Precomputed evaluation on FD001 test engines (last window per engine). Strong global fit with interpretable interval behavior.

Interpretability

Global SHAP importance is computed as mean absolute SHAP values aggregated across samples and time steps, evaluated on the last window per test engine (100 engines), using 96 background samples.

Global SHAP

Mean absolute SHAP values aggregated over time - top features emphasize trends and rolling characteristics rather than raw sensor magnitudes alone.

Literature comparison

This section presents a concise benchmark comparison against selected published methods. Reported values should be interpreted in the context of each study's evaluation protocol and source-reported settings.

Note: This is not a strictly controlled head-to-head comparison. Preprocessing choices and feature-engineering pipelines vary across studies and this project, so alignment is interpreted as a loosely bounded comparison using NASA score and RMSE.

RankModel / SourceNASA scoreRMSEStatusLink / Advantage
1

Attention-LSTM

PHM Society

200.0012.33SOTAOpen paper

Best NASA score among listed baselines.

2

Quantile LSTM + SHAP

Current project (FD001)

274.4112.51ProposedOpen GitHub

Competitive RMSE with calibrated uncertainty intervals and SHAP interpretability.

3

CAELSTM

Scientific Reports (Nature)

282.3814.44SOTAOpen paper

Strong hybrid architecture with robust generalization.

4

Stacked LSTM

JOETEX

311.2015.22SOTAOpen paper

Strong and widely used deep sequential baseline.