Comparison of Agent Deployment Strategies for Collaborative Prognosis Maharshi Dhada Institute for Manufacturing Department of Engineering University of Cambridge Cambridge, U.K. CB3 0FS Email: mhd37@cam.ac.uk Marco Pérez Hernández Institute for Manufacturing Department of Engineering University of Cambridge Cambridge, U.K. CB3 0FS Adrià Salvador Palau Department of Physics and Technology University of Bergen Allégaten 55, 5007 Bergen, Norway Ajith Kumar Parlikad Institute for Manufacturing Department of Engineering University of Cambridge Cambridge, U.K. CB3 0FS Abstract—Collaborative prognosis is a technique that enables the industrial assets to learn from similar other assets in a fleet, and improve their data-driven prognosis models. When collabo- rative prognosis is implemented in a computationally distributed framework, each asset is monitored by its corresponding Digital Twin agent. Distributed collaborative prognosis is particularly beneficial for high value assets where the communication and the processing costs are negligible compared to the maintenance costs. This paper analyses the effects of Digital Twin deployment strategies on the effectiveness of predictive maintenance activities relying on distributed collaborative prognosis. Distributed and heterarchical multi-agent system architectures are analysed for large fleets of assets, with varying failure rates and noise levels in the failure data. The results show that no single architecture or deployment strategy can be deemed best across all failure rates and noise levels. The conclusion derived in this paper provides guidance to the asset owners to choose the most suitable combination for a given application. Index Terms—Collaborative Learning, Digital Twins, Progno- sis, Operations Research I. INTRODUCTION Predictive maintenance is characterised by real-time mainte- nance scheduling when the corresponding asset failure hazards rise over a certain threshold. It minimises unwanted preventive maintenance for healthy assets, and also expensive corrective maintenances by avoiding asset failures. As a result, the predictive maintenance policies have proven to be more cost- effective for the service providers, compared to conventional fixed maintenance plans. Predictive maintenance policies pri- marily rely on data-driven prognosis, where the statistical models are used to learn from historical failure data and predict the oncoming asset failures [1]. Nevertheless, the industrial processes are non-ergodic sys- tems. In the sense that the average behaviour of independent assets in the fleet is not representative of any single asset. As a consequence of this non-ergodicity, an asset’s failure behaviour can be best described by a model trained using its own failure data. But an individual asset often does not experience sufficient failures for its statistical model to learn the failure behaviour. In these circumstances, collaborative prognosis is This research was supported by the Next Generation Converged Digital Infrastructure project (EP/R004935/1) funded by the Engineering and Physical Sciences Research Council and BT. a technique that enables the assets with insufficient failures to identify similar other assets in the fleet and learn from their data. State-of-the art collaborative prognosis involves clustering similar assets, followed by exchanging the failure data within these clusters [1], [2]. The literature presents multiple instances where researchers have proposed collaborative learning for industrial asset op- erations. [3] showed that in a system comprising of multiple assets and historical failures, prediction of a given asset is improved by identifying similar historical behaviours from a library of past failure data, and evaluating the best fit for the current failure’s degradation curve. [4] used a genetic algorithm to identify clusters of the most similar historical failure trajectories, which in turn improved the prediction accuracy of models corresponding to each of those identified clusters. An example implementation of this was shown for fatigue crack growth, drilling bit degradation, and degradation of turnout system applications. [5] relied on collaborative learning to tackle the lack of sensing resources for the overall cohort of units, for the cases of both medical patients and industrial assets. Markov models and selective sensing were used to address the problem of incomplete data per individual units. The reader may refer to [1] for further insight about collaborative prognosis. Collaborative prognosis can be computationally distributed across the asset fleet using industrial multi-agent system (MAS) frameworks, where each asset is controlled by its cor- responding Digital Twin agent. The distributed nature makes collaborative prognosis more adaptable, scalable, resilient, flexible, and lean compared to a centralised application [6], [7]. Operational costs of MAS architectures with varying ex- tents of computational distributions were compared by [8] to identify the suitable architecture for a given asset value. [8] concluded that distributed collaborative prognosis is particu- larly beneficial for a fleet of high value assets. This is because the resulting increase in communication and computation costs due to data exchange and model training, are more than compensated by the reduction in corrective and preventive maintenance activities. However, apart from the underlying MAS architecture, asset failure characteristics also affect the overall costs of the op- erations. For example, the severity of asset failure determines the cost of its preventive or corrective maintenance. In most applications, an asset has multiple failure profiles, each mode with its own corresponding downtime [9]. This is shown in [10] where the maintenance cost is evaluated as a function of time to repair. Failure frequency is another characteristic that determines the total maintenance cost. [11], [12] highlight the use of frequency of failure events and downtime as key criteria for maintenance policy decision. Similar to the failure characteristics, several strategies for deploying the assets’ Digital Twin also exist. Cloud, Fog and Edge computing infrastructures are the main strategies for deploying distributed computing systems [13]. Cloud strategy provides almost unlimited computing and storage resources, away of the source where data is generated [14]. Whereas the Fog and Edge computing strategies push the computations to the end points of the network, which in our case are the assets. The distinction between Fog and Edge is not always clear, however both strategies aim to bring computation closer to the data generation, with benefits such as reducing communication latency [15]. To the best of the authors’ knowledge, the literature does not present analyses for the implications of the asset failure characteristics and the Digital Twin deployment strategies on predictive maintenance activities relying on collaborative prognosis. This paper analyses the implications of asset failure characteristics and Digital Twin deployment strategies on a predictive maintenance plan relying on distributed collabora- tive prognosis. Simulated asset fleets were used to study the aforementioned effects, which allowed analysing the effect of increasing asset failure frequencies. The logic used for the simulations enable the industries to gauge the implications of the agent deployment strategies suitable for their systems. Section II describes the simulation parameters used for the experiments, in line with the standard predictive maintenance policies and collaborative prognosis technique. Description of the experiments conducted and the results obtained are presented in sections III and IV respectively. The results are discussed in Section V. Finally, Section VI summarises the key conclusions and the potential future research directions. II. SIMULATION DESCRIPTION Open source MAS simulation software Netlogo was used to model the temporal evolution of a fleet of 500 high value assets, according to the parameters described in this section. The prognosis and clustering algorithms were run at the corresponding agents using Python extension for Netlogo. The dynamic simulation model used for the experiments is detailed in this section. A. Simulation Setup It must be noted that the simulations were performed for the cases where distributed collaborative prognosis has been deemed beneficial, i.e. for the fleets comprising of high value assets. MAS architectures analysed in the experiments were the heterarchical and distributed MAS architectures, presented in Fig. 1 and Fig. 2. Characteristically, the presence of a central agent overlooking the asset fleet network differentiates the heterarchical architecture from the distributed. The following agent types comprise the distributed and heterarchical MAS architectures: • Digital Twins, which are the components that run the predictive and maintenance planning algorithms for their corresponding assets. • Social Platform, wherever present, is responsible for enabling communications across the system. Further information about these architectures can be found in [8]. An asset’s health was represented using a synthetic health index, assumed to be evaluated using various operational parameters like in [16], [17]. A white noise was added to this health index to represent noisy data, based on which the Digital Twins would learn the corresponding failure behaviours/ func- tion parameters. Asset health indicators degraded according to inverse exponential functions, described as: HIi(tli) = ai ( 1− e−bi(tfi−tli) ) + ε0,σ, (1) where, tli is the local time of the ith asset, i.e. the time since the last corrective repair or installation. (ai, bi, tfi) are the parameters that determine the shape of Health Indicator function. bi is the curvature parameter, and ai determines the expected value of HIi at tli = 0. tfi is the average (or expected) time of failure. ε0,σ is a white noise term with standard deviation σ and 0 mean, conforming to Gaussian distribution. It must be noted that the true health of an asset was a monotonously decreasing function, represented only by the first term in 1. The noise term was added only for the data analysed by the Digital Twin, to represent the noisy sensors. Assets failed when (true) HIi ≤ 0. Asset failures corresponded to one of the three possible failure modes, each representing mild, moderate, and high severity of the failures. Whenever an asset failed, a weighted random choice with probabilities 0.6, 0.3, and 0.1 for mild, moderate, and high severity failures respectively was used to determine the failure mode. The failures were also associated with their own corrective maintenance downtimes, being 2, 4, and 6 time-steps respectively. This brought the simulated fleet closer to reality where an asset experiences diverse failures during its operations, each one with its own individual time to repair (TTR) [10]. The preventive maintenance activities on the other hand were associated with a downtime of only one time-step, which is later detailed in Section II-B. The goal for a given Digital Twin was to estimate the function parameters (ai, bi, tfi), and predict the impending failure of its corresponding asset. Collaborative prognosis was used for estimating the model parameters across the fleet, and a predictive maintenance policy was used to prevent the asset failures, details of which are presented later in Section II-B. Apart from the failures in the assets, the Digital Twins were also simulated to fail. Digital Twin failures however were relatively straightforward, as these were software components that could fail suddenly without any noticeable degradation. Digital Twin failures followed a Bernoulli distribution at each time-step, with a failure probability of 0.1. The implication of a Digital Twin failure was their inability to communicate with other agents as well as to make any decisions. B. Collaborative Prognosis and Maintenance Policy Collaborative prognostics was implemented by clustering the assets, and subsequently sharing health indices data within the clusters. k-means clustering algorithm was used for clus- tering the assets, which was run by the Social Platform and the Digital Twins for the heterarchical and distributed architectures respectively. The health indices observed over the most recent 200 time-steps of the assets were used to cluster the assets. More information about the distributed k- means clustering implementation can be found in [8], [18]. The function parameters were subsequently estimated based on the cluster’s data corpus using non-linear least squares fitting algorithm. The assets were preventively repaired when their time since installation or last repair surpassed the predicted time of failure multiplied by a factor, η: tli > ηtefi, η < 1, and corrective maintenance was conducted upon asset failure. Asset failures were associated with downtimes according to the severity. For the experiments discussed here, tefi was the estimated time of failure of the asset i in the fleet, and η was set to a fixed value of 0.7. The preventive maintenances were associated with a downtime of one time-step. C. Deployment strategies Two Digital Twin deployment strategies were analysed. These strategies can be thought of as cloud or edge deploy- ment, and are illustrated in Figures 1 and 2. Their behaviours and corresponding deployments are explained as follows: • Cloud Deployment: This strategy is depicted in Fig. 1, where the Digital Twins of the corresponding architec- tures are deployed away from the assets they monitor. In this strategy, a failure in the asset would not affect the corresponding Digital Twin capabilities. While simulating the cloud deployment strategy, the agent failure and the asset failure were simulated independently. • Edge Deployment: In this strategy it was believed that the assets possess sufficient computing infrastructure for the agents to be deployed directly on board. Such assets could have built-in computing infrastructure or have it retrofitted [19]. This strategy is depicted in Fig. 2, where the Digital Twins are deployed along with the assets that they monitor. Any asset failure in the case of edge deployment would propagate to the corresponding Digital Twin, causing it to cease its operation and hence the prognostics system capabilities for that asset. For the simulations, it was believed that any asset failure would cause its corre- sponding Digital Twin to fail. Since the Digital Twin had failed, the operators would not know what caused the failure, and would resort to the conservative estimate of a severe failure. Therefore, only severe asset failures were simulated for this strategy. Similar assets in Figures 1 and 2 are shown with same symbols, and are connected to one another. However, when the assets have failed (marked with cross), it is shown how for the case of cloud deployment the agents are still working and the network connection is not severed. Opposite is the case for edge deployment, where asset failure implies agent failure as well because the Digital Twin agent is deployed on the asset. III. EXPERIMENTS The parameter values for the various experiment cases are described in this section. The simulated asset fleet was divided into four asset clusters. The clusters were each characterised by a corresponding set of (ai, bi, tfi) parameters. The true cluster of any given asset was not known beforehand at the start of simulation. Parameter tfi represents the reference time to failure for the simulated fleet (tref ). Assets with a higher reference time to failure would experience less number of failures (lower failure frequency), and vice versa. tfi was varied across the experiment cases, with values in the ranges of 50, 100, 150, 200. Apart from the (ai, bi, tfi) parameters, the asset health degradation was also governed by the Gaussian noise term ε0,σ as shown in (1). Since the first term in (1) was normalised to the ai value, the Gaussian noise was characterised by mean 0, and a standard deviation σ ∈ (0, 1) that was varied across the experiment cases along a range of values as 0.05, 0.1, 0.15, 0.2. Figure 3 shows examples of the degradation functions describing a simulated asset fleet across the range of noise variations, and also two extreme asset failure frequencies (i.e. tfi = 50 and 200). The functions are shown for the case where the four asset clusters are represented by different ranges of tfi mentioned in the sub-captions. The noisy input used by the Digital Twins to estimate the true parameters is coloured red, whereas the dashed lines are the true health degradations. Digital Twins in the simulations discussed in this paper use the non-linear least squares fitting algorithm for estimating the (ai, bi, tfi) parameters. The noisy function is illustrated for a single cluster and across the range of noise values. Moreover, the shaded green region in Figure 3 show the threshold for preventive maintenance, given the true parameter values. This time is shown for the function plot in black. The simulations were run until either total number of failures exceeded 5000 or the total simulation time was greater than tfi ∗ 10 for the given simulation case. The simulation time was chosen as the expected number of failures per asset multiplied by its reference time to failure. Table I summarises the asset down times and the weights associated with each of the failure modes, including other governing parameter values (a) Heterarchical architecture (b) Distributed architecture Fig. 1: Cloud Deployment Strategy (a) Heterarchical architecture (b) Distributed architecture Fig. 2: Edge Deployment Strategy discussed yet. Using these values, heterarchical and distributed architectures were simulated for the two agent deployment strategies. Each experiment case was simulated ten times with different randomness seeds. TABLE I: Simulation parameters Number of assets 500 probability of agent failure 0.1 Runs per architecture 10 Noise standard deviations (ε0,σ) 0.05, 0.10, 0.15, 0.20 Reference times to failures (tfi) 50, 100, 150, 200 Stopping condition: Total failures, or 5000 Simulation time tfi ∗ 10 Corrective maintenance downtimes (and probabilities): Severe 6 time-steps (0.1) Moderate 4 time-steps (0.3) Mild 2 time-steps (0.6) IV. RESULTS Results of experimental cases are presented in Figures 4 and 5 in the form of box plots. The box plots represent total corrective and preventive maintenance activities performed for each experiment case. Since these are high value assets, the determining factor of the overall operations costs are the maintenance activities. Each box plot is generated based on ten replications of the simulation case. Common to both Figures 4 and 5, the x-axes indicate the agent deployment strategy, where Dist() and Het() indicate distributed or heterarchical architectures. The letters ‘e’ or ‘c’ in the parentheses denote edge or cloud Digital Twin deployment strategy. The reader must note that ‘edge’ and ‘cloud’ terminology here merely indicates whether the Digital Twin is deployed on the asset (in which case the Digital Twin fails when the asset fails) or not. The y-axes indicate (a) tfi in the range of 50 (b) tfi in the range of 200 Fig. 3: Asset degradation function plots for high and low failure frequency assets. The noise levels are shown at the top of the corresponding plots. True health indices are shown in black and grey, whereas the noisy data are shown in red. the corresponding maintenance activity counts. Furthermore, the reference times to failures increase from top to bottom subplots, and the data noise increases from left to right. The reader must be aware that the axes in Figure 4 are not evenly scaled, this is to improve the readability as the corrective maintenance counts vary greatly across the cases. V. DISCUSSION The box plots in Figures 4 and 5 show that the maintenance activities are substantially higher when the noise (σ) is higher. Furthermore, the following conclusions can further be derived: 1) Based on overall trends observed in the corrective and preventive maintenance activities • Going from left to right in Figures 4 and 5, it is observed that the corrective maintenances (CMs) increase but no significant increase is observed in the preventive maintenances (PMs). This is because the white noise causes early or late predictions of asset failures with equal chances. As a result the number of PMs stay nearly constant throughout the noise variations. However, early predictions have no effect on the CMs because the failures can be prevented with early PMs. But late predictions cause an increase in CMs. The effectiveness of the PMs reduce with increasing noise, and are rendered almost ineffective in the extreme case seen in the top right plot in Figure 4. • The above effect is mitigated with decreasing failure frequency (going from top to bottom) because the assets spend more time in the high health region/ above the PM threshold (refer the green shaded region Figure 3). This enables the cumulative data across the asset clusters to be closer to the actual function, therefore better estimates of the function parameters can be derived by the corresponding Digital Twins. 2) Based on the trends in corrective maintenance ac- tivities (the following conclusions are observed from Figure 4 only) • With less noise, the cloud Digital Twin deployment strategy performs better than the edge Digital Twin deployment. This is because only severe failures are observed in the case of edge Digital Twin deployment, and the assets have an overall lesser uptime compared to the cloud counterparts. As a result the Digital Twins in cloud strategy are able to collect more data from the operating time of the assets. However, when the noise is high, the parameter estimation is challenging. This problem is amplified when the failure frequency is high and the assets are often wrongly clustered, leading to incorrect estimation of the parameters (the case of 50, 0.2 and to some extent 100, 0.2). The results show that the connected architectures perform better in these cases. • For higher failure frequencies (upper row), the dis- tributed architecture performs better than the heter- archical, the reason for which is attributed to system resilience. A failed central agent in the heterarchi- cal case cannot update asset clusters, leading to suboptimal preventive maintenance activities. This disadvantage becomes clear as the noise increases as the PMs become increasingly ineffective. • However, the above mentioned conclusion does not hold for lower failure frequencies because the time Fig. 4: Corrective maintenance counts across the experiment cases. The noise levels are mentioned at the top, and the Tfi ranges are mentioned on the left lost upon agent failure is negligible with respect to the average time between failures of the assets. In these cases, the heterarchical architecture has shown to perform better. In summary, it is concluded that the choice of architecture and the Digital Twin deployment strategy is governed by the failure frequencies and the noise levels observed in the operations data. VI. CONCLUSIONS AND FUTURE WORK A. Conclusions This study analysed the effect of Digital Twin deployment strategies and failure characteristics on the operational cost of collaborative prognosis based predictive maintenance. Results help industries determine suitable Digital Twin deployment strategies for their asset failure profiles and agent architec- tures. Key conclusions from the discussions are reiterated in Figure 6. B. Future Work Following research stems from the very limitations of this research, including: • Granularity of failures: Degradations and the failures of the entire assets were simulated in the experiments discussed here. This granularity could be reduced to simulate particular cases where the assets experience Fig. 5: Preventive maintenance counts across the experiment cases. The noise levels are mentioned at the top, and the Tfi ranges are mentioned on the left partial failures, or keep operating under a sub-optimal performance. • Cost analyses: Current analyses only compares the main- tenance activities. Depending on the type of assets, the maintenance costs can be used as another comparison metric for the evaluations. The cost can also include the total uptime or production output of the assets. • Analysing a fleet of real assets: Simulated assets were used for the current analyses, enabling only comparative conclusions across the architectures and the agent deploy- ment strategies. Future work can extend this comparison for the fleets comprising of real assets. ACKNOWLEDGMENT The authors thank Dr Manuel Herrera at the Institute for Manufacturing for his feedback and help. Fig. 6: Summary of conclustions REFERENCES [1] A. Salvador Palau, “Distributed collaborative prognostics,” Ph.D. disser- tation, University of Cambridge, 2020. [2] M. Dhada, A. K. Jain, M. Herrera, M. P. Hernandez, and A. K. Par- likad, “Secure and communications-efficient collaborative prognosis,” IET Collaborative Intelligent Manufacturing, 2020. [3] T. Wang, J. Yu, D. Siegel, and J. Lee, “A similarity-based prognostics approach for remaining useful life estimation of engineered systems,” in 2008 International Conference on Prognostics and Health Management, Denver, USA, 2008. [4] O. F. Eker, F. Camci, and I. K. Jennions, “A Similarity-based Prognostics Approach for Remaining Useful Life Prediction,” in Second European Conference of the Prognostics and Health Management Society, Nantes, France. PHM Society, may 2014. [5] Y. Lin, S. Liu, and S. Huang, “Selective sensing of a heterogeneous population of units with dynamic health conditions,” IISE Transactions, vol. 50, no. 12, pp. 1076–1088, dec 2018. [6] A. S. Palau, M. H. Dhada, K. Bakliwal, and A. K. Parlikad, “An industrial multi agent system for real-time distributed collaborative prognostics,” Engineering Applications of Artificial Intelligence, vol. 85, pp. 590–606, 2019. [7] M. Herrera, M. Pérez-Hernández, A. Kumar Parlikad, and J. Izquierdo, “Multi-agent systems and complex networks: Review and applications in systems engineering,” Processes, vol. 8, no. 3, p. 312, 2020. [8] A. S. Palau, M. Dhada, and A. Parlikad, “Multi-Agent System architec- tures for collaborative prognostics,” Journal of Intelligent Manufactur- ing, 2019. [9] J. A. Nachlas, Reliability engineering: probabilistic models and main- tenance methods. Crc Press, 2017. [10] K. Upasani, M. Bakshi, V. Pandhare, and B. K. Lad, “Distributed maintenance planning in manufacturing industries,” Computers and Industrial Engineering, vol. 108, pp. 1–14, 2017. [11] A. Labib, “A decision analysis model for maintenance policy selection using a CMMS,” Journal of Quality in Maintenance Engineering, vol. 10(3), 2014. [12] A. Rastegari and M. Mobin, “Maintenance decision making, supported by computerized maintenance management system,” in 2016 Annual Reliability and Maintainability Symposium (RAMS), 2016, pp. 1–8. [13] P. Escamilla-Ambrosio, A. Rodrı́guez-Mota, E. Aguirre-Anaya, R. Acosta-Bermejo, and M. Salinas-Rosales, “Distributing computing in the internet of things: cloud, fog and edge computing overview,” in NEO 2016. Springer, 2018, pp. 87–115. [14] D. C. Marinescu, Cloud computing: theory and practice. Morgan Kaufmann, 2017. [15] I. Sittón-Candanedo, R. S. Alonso, J. M. Corchado, S. Rodrı́guez- González, and R. Casado-Vara, “A review of edge computing reference architectures and a new global edge proposal,” Future Generation Computer Systems, vol. 99, no. 2019, pp. 278–294, 2019. [16] T. Wang, J. Yu, D. Siegel, and J. Lee, “A similarity-based prognostics approach for remaining useful life estimation of engineered systems,” in Prognostics and Health Management, 2008. PHM 2008. International Conference on. IEEE, 2008, pp. 1–6. [17] J. Yan, M. Koc, and J. Lee, “A prognostic algorithm for machine performance assessment and its application,” Production Planning & Control, vol. 15, no. 8, pp. 796–801, 2004. [18] J. Qin, W. Fu, H. Gao, and W. X. Zheng, “Distributed k-means algorithm and fuzzy c-means algorithm for sensor networks based on multiagent consensus theory,” IEEE transactions on cybernetics, vol. 47, no. 3, pp. 772–783, 2016. [19] D. H. Arjoni, F. S. Madani, G. Ikeda, G. D. M. Carvalho, L. B. Cobianchi, L. F. Ferreira, and E. Villani, “Manufacture equipment retrofit to allow usage in the industry 4.0,” Proceedings - 2017 2nd International Conference on Cybernetics, Robotics and Control, CRC 2017, vol. 2018- Janua, pp. 155–161, 2018.