Aug. 29, 2017

Machine learning for predictive maintenance: where to start?

BigData Republic

Data Science & Data Engineering

Think about all the machines you use during a year, all of them, from a toaster every morning to an airplane every summer holiday. Now imagine that, from now on, one of them would fail every day. What impact would that have? The truth is that we are surrounded by machines that make our life easier, but we also get more and more dependent on them. Therefore, the quality of a machine is not only based on how useful and efficient it is, but also on how reliable it is. And together with reliability comes maintenance.

When the impact of a failure cannot be afforded, such as a malfunctioning airplane engine for example, the machine is subjected to preventive maintenance, which involves periodic inspection and repair, often scheduled based on time-in-service. The challenge of proper scheduling grows with the complexity of machines: in a system with many components working together and influencing each other’s lifetime, how can we find the right moment when maintenance should be performed so that components are not prematurely replaced but the whole system still stays functioning reliably? Providing an answer to this question is the aim of predictive maintenance, where we seek to build models that quantify the risk of failure for a machine in any moment in time and use this information to improve scheduling of maintenance.

The success of predictive maintenance models depend on three main components: having the right data available, framing the problem appropriately and evaluating the predictions properly.

In this post, we will elaborate the first two points and give insights on how to choose the modelling technique that best fits the question you are trying to answer and the data you have at hand.

Data collection

To build a failure model, we require enough historical data that allows us to capture information about events leading to failure. In addition to that, general “static” features of the system can also provide valuable information, such as mechanical properties, average usage and operating conditions. However, more data is not always better. When collecting data to support a failure model, it is important to make an inventory the following:

What are the types of failure that can occur? Which ones will we try to predict?
How does the “failure process” look like? Is it a slow degradation process or an acute one?
Which parts of the machine/system could be related to each type of failure? What can be measured about each of them that reflect their state? How often and with which accuracy do these measurements need to be performed?

The life span of machines is usually in the order of years, which means that data has to be collected for an extended period of time in order to observe the system throughout its degradation process.

In an ideal scenario both data scientists and domain experts would be involved in the data collection plan to ensure that the data gathered is suitable for the model to be built. However, what mostly happens in real life is that the data has already been collected before the data scientist arrives and he/she must try to make the best of what is available.

Depending on the characteristics of the system and on the data available, a proper framing of the model to be built is essential: which question do we want the model to answer and is it possible with the data we have at hand?

Problem framing

When thinking about how to frame a predictive maintenance model, it is important to keep a couple of questions in mind:

What kind of output should the model give?
Is enough historical data available or just static data?
Is every recorded event labelled, i.e. which measurements correspond to good functioning and which ones correspond to failure? Or at least, is it known when each machine failed (if at all)?
When labelled events are available, what is the proportion of the number of events of each type of failure and events of well functioning?
How long in advance should the model be able to indicate that a failure will occur?
What are the performance targets that the model should be optimized for? High precision, high sensitivity/recall, high accuracy? What is the consequence of not predicting a failure or predicting a failure that will not happen?

With all this information at hand, we can now decide which modelling strategy fits best to the available data and the desired output, or at least which one is the best candidate to start with. There are multiple modelling strategies for predictive maintenance and we will describe four of them in relation to the question they aim to answer and which kind of data they require:

Regression models to predict remaining useful lifetime (RUL)
Classification models to predict failure within a given time window
Flagging anomalous behaviour
Survival models for the prediction of failure probability over time

STRATEGY 1: Regression models to predict remaining useful lifetime (RUL)

OUTPUT: How many days/cycles are left before the system fails?

DATA CHARACTERISTICS: Static and historical data are available, and every event is labelled. Several events of each type of failure are present in the dataset.

BASIC ASSUMPTIONS/REQUIREMENTS:

Based on static characteristics of the system and on how it behaves now, the remaining useful time can be predicted, which implies that both static and historical data are required and that the degradation process is smooth.
Just one type of “path to failure” is being modelled: if many types of failure are possible and the system’s behaviour preceding each one of them differs, one dedicated model should be made for each of them.
Labelled data is available and measurements were taken at different moments during the system’s lifetime.

STRATEGY 2: Classification models to predict failure within a given time window

Creating a model which can predict lifetimes very accurate can be very challenging. In practice however, one usually does not need to predict the lifetime very accurate far in the future. Often the maintenance team only needs to know if the machine will fail ‘soon’. This results in the next strategy:

QUESTION: Will a machine fail in the next N days/cycles?

DATA CHARACTERISTICS: Same as for strategy 1

BASIC ASSUMPTIONS/REQUIREMENTS: The assumptions of a classification model are very similar to those of regression models. They mostly differ on:

Since we are defining a failure in a time window instead of an exact time, the requirement of smoothness of the degradation process is relaxed.
Classification models can deal with multiple types of failure, as long as they are framed as a multi-class problem, e.g.: class = 0 corresponding to no failure in the next n days, class = 1 for failure type 1 in the next n days, class = 2 for failure type 2 in the next n days and so forth.
Labelled data is available and there are “enough” cases of each type of failure to train and evaluate the model.

In general, what regression and classification models are doing is modelling the relationship between features and the degradation path of the system. That means that if the model is applied to a system that will exhibit a different type of failure not present in the training data, the model will fail to predict it.

STRATEGY 3: Flagging anomalous behaviour

Both previous strategies require a lot of examples of both normal behaviour (of which we often have a lot of) and examples of failures. However, how many planes will you let crash to collect data? If you have mission critical systems, in which acute repairs are difficult, there are often only limited, or no examples of failures at all. In this case, a different strategy is necessary:

QUESTION: Is the behaviour shown normal?

DATA CHARACTERISTICS: Static and historical data are available, but either labels are unknown or too few failure events were observed or there are too many types of failure

BASIC ASSUMPTIONS/REQUIREMENTS: It is possible to define what normal behaviour is and the difference between current and “normal” behaviour is related to degradation leading to failure.

The generality of an anomaly detection model is both its biggest advantage and pitfall: the model should be able to flag every type of failure, despite of not having any previous knowledge about them. Anomalous behaviour, however, does not necessarily lead to failure. And if it does, the model does not give information about the time span it should occur.

The evaluation of an anomaly detection model is also challenging due to the lack of labelled data. If at least some labelled data of failure events is available, it can and should be used for evaluating the algorithm. When no labelled data is available, the model is usually made available and domain experts provide feedback on the quality of its anomaly flagging ability.

STRATEGY 4: Survival models for the prediction of failure probability over time

The previous three approaches focus on prediction, giving you enough information to apply maintenance before failure. If you however are interested in the degradation process itself and the resulting failure probability, this last strategy suits you best.

QUESTION: Given a set of characteristics, how does the risk of failure change in time?

DATA CHARACTERISTICS: Static data available, information on the reported failure time of each machine or recorded date of when a given machine became unobservable for failure.

A survival model estimates the probability of failure for a given type of machine given static features and is also useful to analyse the impact of certain features on lifetime. It provides, therefore, estimates for a group of machines of similar characteristics. Therefore, for a specific machine under investigation it does not take its specific current status into account.

Bottom line:

What is the most suitable approach for a predictive maintenance model? As for all other data science problems, there is no free lunch! The advice here is to start by understanding which types of failure you are trying to model, which type of output you would like the model to give and which kind of data is available. Having put all this put together with the advice given above, I hope you now know from where to start!

Some useful links:

Survival analysis in scikit-learn: https://github.com/sebp/scikit-survival
Imbalanced classes: https://svds.com/learning-imbalanced-classes/
Novelty and outlier detection on scikit-learn: http://scikit-learn.org/stable/modules/outlier_detection.html