Predictive Maintenance Railway Power Rail

Data Processing | Remapping | Classification | PCA | Isolation Forest

Preamble

This project was personal project I was working on while at SMRT which I converted into a capstone project at Singapore Management University's Computer Science Masters Course over 6 months. Having worked as a subject matter expert on third rail systems in SMRT for the past year, I noticed that the ingredients were in place for a transition from the current preventive maintenance regime into a condition-based regime. The Linear Variable Differential Transformer mounted on the suspension system for current collection gave a reasonably accurate read on the power rail gauge relative to trains. With some data processing to make the data usable, statistical techniques to reduce the effects of inconsistencies in the data and machine learning coupled with additional data sources to augment the search for anomalous data points, the potential to classify section of power rail by their degree of sag and the associated risk of its current condition seemed high.

The notebook below documents the EDA process I underwent while trying to find a way to generate value out of the data. Early sections sought to process the data into a usable state. This was followed by quick visualizations that identified problems with the data and their respective solutions. Finally, a system was designed to classify both power rail and power rail ramps into various risk categories, allowing maintenance teams to prioritize sections of power rail based on their condition as opposed to a mindless cyclically scheduled sweep of the network.

Introduction

Notebook Objective

This notebook documents the data engineering required to process 6 months (Jan 2021-Jun 2021) of SMRT LVDT datasets into usable data for the end goal of building a machine learning models to:

(1) Classify cases of potential degrading sections of 3R in the NSL into 4 catagories depending on urgency - e.g. SI1 - SI4
(2) Predict the time taken for a section of 3R identified to be problematic to degrade into a specific state - e.g. 160mm 3R Vertical Gauge

Glossary of Terms
Term/Abbreviation Definition/Meaning
3R The third rail, a power rail running adjacent to the two running rails on which the train rolls on
CCD Shoe The Current Collector Device shoe is a sheet of carbon attached to the CCD assembly to contact the 3R
Chainage A referencing system for identifying locations on the track by distance, each unit difference between chainage is 1m
Contact Range The chainage range where the 3R and CCD Shoe are in contact
Floating Range The chainage range where the 3R and CCD Shoes are not in contact and the CCD shoe is 'floating' in the air
LVDT A linear variable differential transformed sensor for measuring height of 3R and converting it to an electrical signal
MA Moving Average
Ramp Contact Range The chainage range where the CCD Shoes are in contact with the ramp
ROC Rate of Change

Note: Chainage is not necessarily in running order, for example chainage can jump from 1000 to 2000 with no physical location correspinding to chainage between 1000-2000. Additionally, a chainage can refer to two locations at once if not tagged with the sector it belongs to (e.g. Khatib-Yio Chu Kang could have a chainage 500 in its range and this chainage 500 could also exist in Admiralty-Woodlands)

Overview of Steps
  1. Extract data from all LVDT files
  2. Understand the characteristics of the LVDT data
  3. Isolate the LVDT data into the three categories: 'contact', 'ramp' and 'floating'
  4. Normalize each data set (by bound, left/right/, date, train) on a specific value so that they can be compared
  5. Compare data sets across time to get rate of change of measurements over time

Combining the relevant CSV files into a Pickle

This is done so that we do not need to iterate through all the csv files to concatenate the data in the future

Excel%20Files.PNG

The cell below can be skipped if the 'lvdtdata.pkl' 'file is available for use

Some quick improvements identified from browsing the data: (1) The dates are not in date format, (2) The decimal point values after 3 d.p. are unnecessary

Find relationships within/between variables EMU Number, Bound, Date

This step is done to understand the general charactersitics of the data. pandas_profiling is good for an initial general answer to some of these questions

Some quick insights identified: (1) 3R Vertical Gauge looks normally distributed. (2) Based on the Vert_Gauge histograms, >215 looks like the point where the LVDT hovers when not contacting the 3R or ramps, (3) The symmetry between the range 180-210 is likely the region when the LVDT is measuring 3R ramp vertical gauge

3R%20Vertical%20Gauge%20Histogram.PNG

Sensor performance check

Check if each sensors on the train (left and right) behave in the same way
Check if sensors on different different trains behave in the same way

First off, the good news is that the distributions for the vertical gauge range corresponding with contact between the 3R are generally normally distributed. This is important because it is expected that should the population of maintenence staff all attempt to gauge the 3R to the same value (170mm - the nominal value for 3R vertical gauge), the deviation from the target should indeed follow such a distribution.

Note: There is a slight but noticeable bias towards higher values for the measurements in the 'contacting range' (where the CCD shoe is touching the 3R) - this bias is either due to the sensor, or the actual physical state of the 3R (i.e. staff are intentionally fixing the 3R gauge slightly above 170mm.)

Statistically, having a normal distribution opens up the doors to many statistical analysis methods and tools for comparing and manipulating the distributions.

However, the vertical gauge in the 'floating range' (where the CCD shoe is not touching the 3R seems to be made up of multiple distributions - possibly due to the different resting positions of each sensor and the fluctuations it experiences while the train rumbles along the track. Using an example to explain this phenomenon, EMU501's and EMU533's left sensor seems to have one main rest spot around which the sensor readings fluctuate, hence the obvious single peak in both cases while EMU501's right sensor has 3 peaks, Potentially indicating 3 different resting spots of the CCD shoe.



Note: SMRT's convention of left and right is defined by the direction of increasing chainage, hence on both the NB and SB, left will refer to the same side and thus the same sensor

Hardware Hypotheses As suspected, the different EMUs have slight but noticeable different distribution shapes for the vertical gauge measured, this could be due to sensor or installation differences or a change in the state of the EMU (e.g. wear or change of CCD shoe at some point in time, different voltage regulation, different mechanical resistance of the CCD assembly, route travelled e.g. more NB han SB) that resulted in variation in the data collected. It could also be that the calibration of the sensors were done differently.
The ideal case would have been identical distribution shapes shared by all the Left Verts (since they measure the same 3R on both bounds) and another distribution shared by all the Right Verts.
Software Hypotheses In terms of software, it could also be that the sensor data logging and processing system may be artificially re-classifying data into certain values. This may have implications further down the analysis when a vertical gauge comparisons across days and between EMUs is to be done (e.g. a EMUs may have a higher mean than another, suggesting an offset is required for fair comparison).

Difference in readings between bounds

Let us test the bound difference hypothesis by further splitting the data by bound and zooming into the 3R gauge

Clearly the bounds have a different distribution of Left 3R and Right 3R (it seems most of the 3R on the NB is on the right and most of the 3R in the SB is on the left). It makes sense that the Left and Right cases have an inverse relationship given that if there is 3R on the Left, no 3R is required on the Right.

Also noticeable is the different skews of floating positions of the LVDT of the Left and Right sensors for each train and the generally similar skews when the Left or Right sensors are on either bound

The LVDT 3R reading contact range (150-190) generally has the shape of a Normal Distribution.

For the floating range (200-230), sometimes a clear double peak can be seen regardless of whether its a majority/minority 3R on Left/Right situation.

For the ramp contact range, it can be understood that the range lies between the contact range and the floating range - merging into both distributions due to the noise and the fact that physically, the ramps are a transition from the normal 3R range to the floating range. One thing that is for sure is that the ramp/conductor-rail fraction is much smaller than 1 and the ramp has an almost constant gradient - meaning that we expect a uniform distribution for the measurements at the ramp contact range as samples are taken along the ramp at different heights. This means that the total area of the conductor-rail region is larger than the area of the ramp (transition length region) - the accuracy of which is contingent on the tachometer of the EMU (the LVDT has a constant sampling rate based on distance and does not give more LVDT readings/m at the ramp).

Note: The sampling rate of the LVDT is approximately 10/m, meaning that the resolution of 3R gauge is around 10cm.

Using rate of change of 3R gauge (across distance) to isolate ramps chainage

Attempt to use rate of change to find chainages that belong to ramps under the assumption that the rate of change of 3R gauge across distance at ramps is greater than elsewhere (since ramps are a slanted and curved piece of 3R).

It is apparent that there are ROC spikes spaced out across the chainages, these could possibly correspond with ramp chainages

Visually, 2-3mm/m looks like a good value to set a cutoff point between ROC at conductor-rail and ROC at ramps. To check this assumption, we take the quantile that corresponds to the 3mm ROC. This turns out to be the 99.5 percentile. Meaning that for every 99.5m of distance 0.5m is a ramp. A HSR on the mainline is roughly 5m long. Hence we expect to see one ramp for every 1km of 3R. This is possible considering we are only analyzing the 3R on one side of the train, meaning that when we consider both the left and right sides, the actual ramp frequency on a bound is 1 ramp/500m, giving a total of 400 ramps in the entire network.

As a worked example, at 4 ramps per km the cut-off point is 1.5799, this gives 800 ramps in 200km. However, remember that this is applied based on the assumption that ROC has a threshold that can differentiate 3R, ramps and float.

Additionally, it will be best to find out the actual number of ramps in the line and bring the value back to determine which percentile of ROC corresponds to the ROC at ramps (based on the ramp:conductor-rail length ratio)

Note that the NB is primarily made up of floating LVDT readings, while the SB readings is majority 3R conductor rails.

We build on the previous assumption that the 99.5th precentile for the SB is a good cut-off point to determine which ranges of 3R belong to ramps. Below this value, we assume the LVDT is in the float range or contact range.

Immediately, it is noticeable that most of the vertical gauge pairs with high ROC do correspond with the expected vertical gauge ranges of the ramps. However there are also borderline cases where the range is within the expected vertical gauge in the contact range. When we plot the ROC against the vertical gauge ranges, the reason becomes clear: the noise values are large enough create large ROC values that basically merge into the actual ROC values expected at ramps. This means that we need to process the data even further, or find a feature that can distinguish the ROC measurements at the ramp from the ROC measurements from noise.

Under the assumption that noise is random, there should not be a sustained range of records with high ROC due to noise. Perhaps a moving average will help to check for sustained rate of change characteristics of ramps.

Using moving average of rate of change to isolate ramp chainages

Finally, this definitely looks like it has (very?) successfully captured the 3R ramp chainages. The vertical gauge can be seen climbing from a value within the 3R gauge contact range, all the way into the floating range before finding the next ramp. If the moving average (MA) of the ROC is a good way to filter out the ramps, it should be possible to find a clear boundary between the MAROC (moving average rate of change) values of noise in the floating range and the MAROC of noise in addition to minor gauge movements in the contact range. A chart of the MAROC against the vertical gauge will be useful for determining if we have succeeded - and will also be useful for determining the threshold to set MAROC at.

From the 'MAROC v 3R Left Vert' charts and the 'ROC v 3R Left Vert' charts, it is clear that the 3R vertical gauge range from 200-210 belongs strictly to the ramp (no small ROC/MAROC values characteristic of noise/actual minor gauge fluctuations). There is an option of conducting a search spread starting from values in this range, outwards, until the ROC/MAROC hits a value that is too low which indicates the ramp range has terminated. Alternatively, we can try increasing the MAROC window to identify sections of 3R that have a long sustained MAROC and simply label the range as ramps if it contains vertical gauge in the 200-210 range (note that this value may defer from dataset to dataset, but can be easily identified as the first vertical gauge value that has no ROC/MAROC values from the 0-1 range.

Unfortunately, it seems the second option of simply finding a window that can split the three clusters is not feasible, hence it will be necessary to consider the vertical gauge as well. One way could be to get the moving average of the vertical gauge and move it in tandem with the the MAROC to get both the average vertical gauge (it is expected to be higher than the contact range average, and lower than the floating range average) of a given window and its minimum ROC

Note: It is not necessary to match the windows since we are just using these values to characterize the vertical gauge at that point

Again, this is definitely bringing us one step closer to truly isolating the measurements on the ramp from the floats and contact range measurements. By getting the moving average with a forward window, readings that are above the y=x line refer to measurements that are heading into a stretch of increasing vertical gauge measurements - the trail of increasingly large vertical gauge measurements that departs from the contact range and enters the floating range from above the y=x line are instances where the sensor is moving out of a ramp towards the floating range while the trails that lead from the floating range back into the contact range (clockwise) are the instances of the sensor towards a region of lower vertical gauge. Interestingly, perhaps due to the controlled amount of data, it seems almost possible to follow the trails by eye.

Regardless, now that there are several features for us to identify the ramp gauge measurements, it is time to test out different methods of labeling the points and determining what is the best way to label them in the future.

Note that these vertical gauges are a forward window that averages the vertical gauge of 15 rows after the current point.

This labeling seems to miss out some ramp cases, possibly because the MA15 is outside the required range, it might be better to use the absolute difference between the MA15/MMED and the actual measurement to determine if it belongs to a ramp.

Note that MA15 looks forward by 15 rows to get the average of the sensor readings in those rows

This much better, let us take a look at how the labeling looks like on other plots.

The data labeling seems to be missing out the starting and ending of the ramp readings. Maybe the MMED15 filter is removing some of the points due to the threshold being too high. Additionally, now that it is clear that that the 3R_Left_Vert - Left Verti Gauge MA15 plot is a good way to visualize which points belong to ramps and which do not, we will use it to determine the success of the labeling.

Reducing the threshold does pick up more of the ramp labels, but also appears to mislabel some entries in the contact range. Let us try the MA filter with the same threshold.

The MA filter does a much better job than the MMED filter possibly because the MMED when containing a skewed window of data will output a value heavily in favor of the skew instead of assuming a balance like MA. (e.g. 1,1,1,2,2,3,3,6,8,9,10 will give a median of 3 when the average is >4. Let us see if some noise can be removed by using a centered MA to approximate the measurements.

Some mislabeled points were indeed corrected. One last modification which may improve the results is the construction of the ROC values based on a MA of the actual measurements (to reduce the flucatuations in ROC due to noise).

In total, on the NSL, there are 2 bounds, 27 stations 381 ramps at 5.1m each which approximates to around 1943m worth of ramps. Although the SB is a majority 3R side, a similar run was was conducted on the NB Left and the length of ramps was found to be similar, implying that the total distance of ramps is approximately the same on the majority and minority side.

This distance is roughly 390m for the left and right sides, hence by multiplying by 4 to account for left and right sides on both bounds, we will arrive at an approximate total of 1580m worth of ramps. This means that as we make the final tuning for identifying ramp measurements we can afford to relax the thresholds a little to factor in measurements that may be on the borderline of being classified as ramp measurements.

Note: For sides with less 3R, the stretches of 3R are shorter, meaning that proportionally, the ratio of ramps is to 3R is much higher.

HSR%20DRAWING.JPG

Ramp Documentation: Each ramp is roughly 5m long with a slope of 1:50 (a gain of 1mm every 50mm)

Since each reading is approximately 10cm=100mm apart, the gain between every reading at the ramp is expected to be 2mm. With a forward window of size 3, this means that at the reading right before entering the ramp range, the values in the forward rows of the gain will be [0,2,2] inclusive of the reading right before the ramp, and the median of the window will be 2 and the mean will be 4/3.

If our ROC is defined as two rows ahead in this exercise, the expected gain between non-ramp and ramp values is 4mm, hence we expect that at the ramps, the ROC will be approximately 4.

Summary of using average rate of change to isolate ramp measurements:

  1. Use centered MA to approximate point measurements (with a centered window of 3-5)
  2. Use the approximated point measurements to generate the ROC with a step of 2-3
  3. Generate the forward window MA for the approximated points (use a larger window of 10-15)
  4. Generate the MAROC for that point using a MA over the ROC (use a centered window of around 10-20)
  5. Label the points using the:

Additional Ideas: It may be possible to set a looser threshold for the experimented filters above and then add an additional filter made up of a centered MA with a large window of around 15 and filtering by having the average of this large centered MA being above a certain gauge measurement (e.g.165mm). This may be able to remove the noise observed in the contact range cluster.

Implementing the ramp, floating and contact measurement labeling pipeline

First we need to finalize the filter thresholds and methods once and for all. We take one of the more ramp-biased settings and attempt the minimum centered MA measurement threshold filter.

Seems a little too aggresive so tone it down a little.

Get the statistical equivalent of 165 and 225.

Close enough. With this the thresholds can adjust dynamically according to the data (except for the 3mm difference, but this value is related to the geometry of the ramp and not the data specifically). We will apply this feature engineering to the entire data set.

Final Ramp Labeling Pipeline

  1. Use centered MA to approximate point measurements (with a centered window of 3-5)
  2. Use the approximated point measurements to generate the ROC with a step of 2-3
  3. Generate the forward window MA for the approximated points (use a larger window of 10-15)
  4. Generate the MAROC for that point using a MA over the ROC (use a centered window of around 10-20)
  5. Generate large centered MA around point to determine if point lies in definite contact or floating range (use centered window of around 15-25)
  6. Label the points using the:

Any remaining corrections will be done will traversing the data down the chainage (e.g. detecting a chain of ramp labels and a random contact label in between = contact label is wrong, entire ramp chain is below the ramp mean = ramp labels are wrong and its a contact range with bad gauge).

The next step will be to analyze each of these distributions in isolation and normalize them according to a benchmark so that the different distributions can be better compared against one another."

Normalizing distributions of different LVDTs/batches of data

Preparation for cross comparisons of data across different sensors/batches of data

First, we apply the labeling for the entire dataset.

Label the entire dataset (and save as .pkl)

Note that due to the chainage jump, there are duplicate chainages (exactly 2 instances) at specific ranges, this will be fixed by breaking the lengths into a 'before jump' and 'after jump' section before comparison of vertical gauges of specific chainages and labels between different dates.

Also note that there is trash at around Chainage1 == 0.

chainage%20jump%20point.jpg

chainage%20jump%20sector.jpeg

Plotting the 'Chainage2' shows that there is some kind of chainage marking problems with the 'Chainage2' location marker hence we stick to 'Chainage1'.

Checking the labeled data reveals that some chainages are covered more than others (chainage duplicate? or some locations measured more than others?)

Everything seems to be in working order, and the labeling algorithm captures the different distributions well.

Get global averages to reverse the normalization of the individual subdfs

It is actually not critical that the values really represent the exact statistical parameters of the data since we are just using these values to generate a normal distribution for the contact range measurements and a uniform-ish distribution for the ramp range.

Based on the above, there are now approximate values we can use to reverse the normalized data. So we may proceed to normalize each dataset collected by each operation.

The statistical assumption here is that:

  1. The samples come from the same population (by conducting the normalization at the sensor level, this is checked)
  2. Comparisons being made across different populations will only be compared against populations in their own category (e.g. normalized ramp data against normalized ramp data)

With the labeled data, we are able to fulfill the requirements above.

Ramp data is roughly uniform, as expected.

Left and right data follows the normal distribution beautifully (with a slightly longer right rail). With the processed data coming out as expected, we can proceed to the final step which is comparing the data between dates.

Alternative method of standardizing data (translation of peak)

This assumes that the difference in distribution peaks are due to a fixed displacement of each sensor due to differences in the set-ups - and the method corrects for this difference in displacement.

Standardizing the distributions about a fixed value is as simple as translating the entire dataset by 170 - mean.

The distribution is almost identical to the method that involves converting each distribution to a standard normal distribution and retains the original standard deviation of all right-sensor readings (4.0915) and left-sensor readings (4.211). In the first method, the data from each sensor is standardized before being combined - essentially going through more steps in centering the mean to 0 and the standard deviation to 1, followed by reconverting the standardized values back into the equivalent actual values. From a procedure efficiency point of view, translation definitely surpasses standardizing each distribution and converting it back to its non-standardized form after recombining the distributions. From the perspective of the resultant distribution, the normal distribution parameters are expectedly similar. A key point to note is that the translation does not standardize the standard deviations of each distribution, the implications of this would be significant if the distributions had large dissimilar standard deviations (potentially warping the bell shaped normal distribution) but in this instance, it is apparent that the standard deviations of each sensor is roughly similar (3.8-4.3), allowing the translation procedure to work well.