Reliability-oriented Telecommunication Network Routing Using Multi-agent Q-Learning Longyan Tan Institute for Manufacturing, Department of Engineering University of Cambridge Cambridge, United Kingdom lt592@cam.ac.uk Aitichya Chandra Institute for Manufacturing, Department of Engineering University of Cambridge Cambridge, United Kingdom ac2772@cam.ac.uk Luning Li Institute for Manufacturing, Department of Engineering University of Cambridge Cambridge, United Kingdom ll669@cam.ac.uk *Corresponding author Ajith Kumar Parlikad Institute for Manufacturing, Department of Engineering University of Cambridge Cambridge, United Kingdom aknp2@cam.ac.uk Abstract—The uninterrupted operation of telecommunication networks is critical to modern society, yet network components are prone to failures that can significantly disrupt services. Traditional routing protocols often respond reactively to such failures. This paper proposes a proactive, reliability-aware rout- ing strategy that enhances network resilience by integrating predictive models of component reliability and availability with conventional metrics such as congestion and delay. Due to the high dimensionality and complexity of the resulting optimization problem, we employ a decentralized multi-agent reinforcement learning (MARL) Q-routing algorithm. Each node acts as an autonomous agent that learns an optimal routing policy to minimize delivery time in a dynamically degrading environment. Experiments on a simulated network show that our approach sig- nificantly outperforms a weighted shortest path baseline in terms of delivery time, delivery rate, and computational efficiency. Index Terms—Telecommunication Network, Predictive Rout- ing, Reliability, Multi-agent Reinforcement Learning I. INTRODUCTION The uninterrupted operation of telecommunication networks is paramount in the modern digital era, underpinning critical societal functions, economic activities, and daily life [1]. However, network components, such as routers, switches, cables, and wireless cellular links, are susceptible to fail- ures arising from a multitude of factors including hardware degradation [2], software bugs [3], misconfigurations [4], and environmental events [5]. These failures can lead to significant service disruptions, performance degradation, and economic losses [6], [7]. Consequently, ensuring high levels of network reliability and availability has become a central focus for network operators and researchers. The vast majority of existing research in network rout- ing prioritizes performance metrics by assuming a perfectly functioning network [8]. However, in real-world deployments, particularly those operating over extended periods, this as- sumption is frequently violated, as the network topology may change due to component failure or performance degradation [9]. Consequently, the reliability of the network is not static; instead, it degrades over time and with use. Traditional net- work routing protocols, while adept at finding paths based on metrics like shortest distance or quality of service (QoS) parameters (e.g., delay, bandwidth), often react to failures only after they occur. These reactive approaches can lead to unacceptable delays in service restoration and significant packet loss compared to proactive approaches [10]. In contrast, this paper introduces a proactive, reliability- oriented routing framework that accounts for both the time- dependent and workload-induced degradation of network com- ponents. Our key innovation is the integration of reliability and availability models into the routing decision process using a multi-agent reinforcement learning (MARL) approach. Specifically, we adopt a Q-routing algorithm in which each network node acts as an independent agent that learns to optimize routing decisions based on local experience and network conditions. This work contributes: • A novel integration of time- and usage-based reliability models into a telecommunication routing framework; • A decentralized MARL-based algorithm for routing in dynamically degrading networks; • A comparative case study showing significant improve- ments in latency, delivery rate, and computational effi- ciency over a weighted shortest path benchmark. The remainder of the paper is structured as follows. Section II reviews related work. Section III presents the problem formulation and degradation models. Section IV introduces the reinforcement learning approach. Section V details the simula- tion setup and performance results. Section VI concludes and discusses future research directions. II. RELATED WORK A. Telecommunication network reliability and degradation Reliability, defined as the ability of a system to perform its required functions under stated conditions for a specified period [11], is a critical attribute, especially in increasingly complex environments. The reliability of a communication network is a critical concern, directly impacting its operational lifespan and maintenance costs [12]. Network performance is significantly impacted by a com- bination of environmental and operational stresses that cause uneven degradation across its devices. Nodes deployed in chal- lenging physical locations may experience accelerated aging due to high ambient temperatures, humidity, or power grid instability [15]. Simultaneously, operational demands, such as continuous high traffic loads or intensive processing tasks, place greater stress on specific devices based on their role in the network’s topology [16]. This degradation is not limited to hardware, and it also manifests as software aging, where issues like memory leaks can gradually erode performance [17]. Over time, these combined pressures indicate that certain nodes inevitably degrade faster than others, creating performance bottlenecks that can throttle data flow, increase latency, and ultimately compromise the entire network’s efficiency and reliability. In addition, the availability of wireless links is often highly dynamic and susceptible to a complex interplay of factors [18]. For example, unforeseen bursts of radio-frequency interference from nearby industrial equipment can abruptly degrade signal- to-noise ratios [19]. On a larger scale, natural disasters such as floods, earthquakes, or severe storms can cause widespread and prolonged disruptions, not only by physically damaging infrastructure like cell towers and antennas, but also by altering the signal propagation environment itself [20]. These factors can induce rapid and significant fluctuations in link quality, which in turn cascade into severe performance degradation for the entire network. B. Telecommunication network routing paradigm Routing is a fundamental function in multi-hop networks, responsible for the discovery and operation of paths that facilitate end-to-end communication between nodes. Routing protocols can be categorized into two classes: distance-vector protocols, such as the Ad-hoc On-demand Distance Vector (AODV) protocol [21], and link-state protocols, such as the Optimized Link State Routing (OLSR) protocol [22]. The primary goal of these protocols is to establish and maintain routes based on specific metrics. Many traditional routing algorithms prioritize finding the shortest path in terms of hop count [23]. While simple and effective in some contexts, this approach is often suboptimal in wireless ad-hoc networks. A shorter hop count may favor long, weak, and consequently unreliable links, which are prone to breakage and require more retransmissions, ultimately degrading performance [18]. To address these shortcomings, the concept of Quality- of-Service (QoS) routing has emerged, which aims to find paths that satisfy specific requirements regarding metrics like performance, delay, bandwidth, and reliability [24]. This involves using more sophisticated link metrics that better reflect the quality of a connection. For instance, the Expected Transmission Count (ETX) metric estimates the number of transmissions needed to successfully send a packet over a link, thereby capturing its quality [25]. Similarly, link metrics based on the Packet Delivery Ratio (PDR) can provide a direct measure of a link’s statistical reliability [13]. These advanced metrics form the basis for more intelligent, reliability-oriented routing strategies. C. Reliability-oriented routing methods Recognizing the limitations of traditional routing, a new class of reliability-oriented routing protocols has been devel- oped. These protocols explicitly incorporate reliability metrics into their decision-making process to enhance network lifetime and performance. The goal is to move beyond simple path length and consider the health and quality of the nodes and links that form a communication path. One such advanced protocol is dmin-Routing, a decentralized algorithm designed for discovering and maintaining routes in wireless ad-hoc networks that meet a specified minimum reliability [26]. How- ever, this work does not consider the network degradation, and the routing decision is based on a static reliability threshold. The work by Ergun et al. pioneered this perspective by linking routing decisions to the physical degradation of IoT nodes [27]. Their research posits that routing can be used to steer traffic away from devices experiencing high thermal stress, thereby slowing their degradation and balancing the reliability across the network. This research led to the development of the R3-IoT protocol, a distributed, adaptive routing protocol based on reinforcement learning [28]. However, the reliability modeling in the R3-IoT protocol mainly focuses on node hardware failure and ignores the link availability. In addition, the routing decision is made given a series of predictions, in- cluding network flow, communication time, expected transition count (ETC) and reliability decrease. The uncertainties within these predictions would strongly affect decision making. D. Research Gaps Despite progress, two key challenges remain insufficiently addressed in the literature: • Degradation Modeling: Existing work often neglects modeling of component failure and degradation over time and usage, particularly at the granularity needed for routing optimization. • Scalability under Uncertainty: High-dimensional rout- ing problems with multiple sources of uncertainty—such as node reliability, link availability, and queue dynam- ics —require computationally efficient algorithms. Few existing solutions scale effectively in such complex envi- ronments. This paper addresses these gaps through the integration of predictive reliability modeling with a decentralized multi- agent reinforcement learning framework, enabling robust and scalable routing under degradation. III. PROBLEM STATEMENT A. Telecommunication network modeling Consider an ad hoc wireless telecommunication network G = (V,E) deployed in a mesh topology, where V is the set of vertices and E is the set of telecommunication links. The net- work is composed of Nv vertices V = {v1, v2, ..., vNv}, which represent routers, switches, base stations, or end-use devices, and Ne edges, which represent cellular signals or WiFi. The edge ei,j ∈ E indicates the edge that links vertex vi and vj . To align the model with the practical telecommunication network, the Barabási-Albert network model is adopted to capture a real-world network topology [32]. This type of network model is widely recognized for its power-law degree distribution, preferential attachment mechanism, and high robustness to random failure. To simplify our model, we assume that all the vertices can send and receive data packets, and all the edges are bidirectional, indicating that the communication between any two vertices is mutual. 1) Data packets generation and delivery: In this telecom- munication network, data packets are generated with vari- able origin and destination vertices. Initially, Mp packets {p1, p2, ..., pMp} are generated randomly distributed over all the possible vertices with equal probability. Each packet has a source vertex sp and a destination vertex dp. During the transmission process, packets go through a series of intermediate neighboring vertices until they arrive at their destination nodes. Upon arrival, several new packets, whose destinations are randomly assigned, will be regenerated with probability ρ from the destination vertex of the delivered packet. The maximum packets generated from the network is Mmax p . 2) Vertex capacity and edge delay: For the ith vertex vi, its sending queue capacity Ci indicates the maximum number of packets that can be sent per time unit, and the storage buffer size Bi refers to the maximum number of packets that can be stored at the same time. For the edge ei,j , its transmission delay di,j (t) follows a time-variant sinusoidal wave: di,j (t) = d (0) i,j (1 + αi,j sin(ωi,jt+ ϕi,j)), where d (0) i,j is the base delay, αi,j is the amplitude (0 ≤ αi,j < 1), ωi,j is the frequency, and ϕi,j is the phase offset. During the packet transmission between two vertices, the sending node firstly informs the receiving node that one packet will be delivered and the receiving node will check its storage buffer for space. If the storage space is sufficient, the receiving node would reserve that space until the arrival of the packet after dj time unit. It should be noted that, if one packet is sent to a vertex whose storage buffer is full, it will be dropped and the sending vertex will be notified and requeue the packet in the sending buffer. B. Reliability and availability modeling In this work, the vertex reliability is defined as the success rate of a certain vertex sending packets. The link availability is specified as the probability of survival from external dis- ruption. 1) Vertex reliability: To model the degradation process of vertex reliability, a Weibull degradation reliability function considering time and utilization factors is utilized [33]. For vi ∈ V , its reliability Ri v (t) over time could be expressed as: Ri v(t) = e −( mi ηp,i + t ηt,i )β , ηp,i > 0, ηt,i > 0, β > 0. (1) In the Equation 1, mi is the number of packets that pass through vertex vi so far. ηp,i and ηt,i are the utilization- dependent and time-dependent scale parameters, respectively, accounting for the degradation speed over workload and time. The larger the value, the more slowly the reliability function decreases. β is the shape parameter. 2) Edge availability: Due to the randomness of external disruption, such as the adverse weather, impactful magnetic activity, or human error, regional wireless links might be unavailable for a short time window. For edge ei,j , the number of external disruptions Di,j within time unit t follows a Poisson distribution Di,j(t) ∼ Poisson (λi,jt) [34]: P (Di,j(t) = k) = (λi,jt) ke−λi,jt k! (2) where λi,jt is the disruption intensity within unit time interval t. If at least one disruption happens on the link within t, all the packets in transmission are dropped and requeued again in the sending node. The availability Ai,j e (t) of edge ei,j is: Ai,j e (t) = 1− pi,j (3) where pi,j = 1 − e−λi,jt is the probability of at least one disruption happening within t. C. Objective function and constraints In this work, the objective of the routing decision is to choose the next vertex for a packet given its destination so as to minimize the average delivery time over the time horizon T given the sending capacity Ci and storage capacity constraints Bi: min 1 Mmax p Mmax p∑ m=1 ∑ ei,j∈E T∑ t=0 xm i,j(t) · di,j(t) Ai,j e (t) + δ s.t. ∑ m ∑ vj∈V xm i,j(t) ≤ Ci,∀i, t∑ m ∑ vi∈V xm i,j(t) ≤ Bj ,∀j, t (4) where xm i,j(t) ∈ {0, 1} is a binary variable that denotes whether packet pm is routed from node i to node j at time t, and δ > 0 is an infinitesimal. IV. RELIABILITY-ORIENTED REINFORCEMENT LEARNING To solve the sequential decision-making routing problem, this research utilizes an online multi-agent Q-routing rein- forcement algorithm. This reinforcement algorithm is based on the Markov decision process and a three-dimensional Q- table. Each node represents an agent and routes the packets to its neighbours [29]. A. Markov decision process The telecommunication network routing is defined as a Markov decision process defined by a tuple as Z = ⟨S,A,R,P , γ⟩ with finite time horizon T = Nt. • State Space S: we define the state of packet at time t as st, and st ∈ S, where st = {xt, d}. xt is the index of vertex that the packet is in at current time t, and d is the destination vertex of the packet. • Action Space A: For a given vertex, its action ai is defined as its neighboring vertex that can be delivered the packets to at time t. The action space A of this node is the set of all the action {ai} ∈ A, i is the index of all available neighboring. • Reward Space R: the elements of R are the reward functions r (st+1|st, at), which is defined as Equation 5. In this reward function, qi,t is the length of storage buffer, qeq represents the equivalent queue length, w× ngrow is the penalty of queue increases, and R(t) is the reliability of next vertex. The additional rewards are 2000 if one packet reaches its destination and −50 if the packet is completely dropped. • Transition probability matrix P : P (st+1|st, at)→ [0, 1] represents the transition probability of transitioning from st to st+1 given the action at. • Discount factor γ: the discount factor γ ≤ 1 affects the value of the future. r (st+1|st, at) = 50R (t) + qeq − qi,t − w × ngrow (5) At each time step t, where t = 0, 1, 2, ..., N , the vertex agent vi makes the routing decision ai,t on all the packets in its sending queue and obtains the rewards ri,t. B. Multi-agent Q-routing algorithm Following the formalization of the sequential decision- making problem within a MDP modeling, the objective is to determine an optimal policy π⋆ (s) that maximizes the expected cumulative discounted reward [30]. This is intrin- sically linked to the concept of Q-values, which quantify the desirability of taking a specific routing action in a given state. A Q-value, denoted as Q (st, at), represents the expected total discounted future reward obtained by executing action a in state s and subsequently following an optimal policy. The optimal Q-function, Qstar (st, at), satisfies the Bellman optimality equation: Q⋆ (st, at) = R (st+1|st, at) + γ ∑ st+1 P (st+1|st, at)max at+1 Q⋆ (st+1, at+1) (6) After executing action at in state st, observing immediate reward R (st+1|st, at), and transitioning to state st+1, the Q- value for the state-action pair Q (st, at) is updated as follows: Q (st, at)← Q (st, at) + α[R (st+1|st, at) + γmax at+1 Q (st+1, at+1)−Q (st, at)]. (7) In this update, α ∈ (0, 1] is the learning rate, con- trolling the extent to which new information overrides existing Q-value estimates. The term α[R (st+1|st, at) + γmaxat+1 Q (st+1, at+1)−Q (st, at)] constitutes the temporal difference (TD) error, representing the discrepancy between the current Q-value estimate and a more accurate target value. The term γmaxat+1 Q (st+1, at+1) provides the estimated optimal value of the next state, ensuring convergence towards Q⋆ (st, at). Under conditions of sufficient exploration (all state-action pairs visited infinitely often) and a decaying learn- ing rate, Q-learning is guaranteed to converge to the optimal Q-function [31]. To encourage the exploration of the large action space, agents would perform ϵ-greedy algorithm: a = { argmax a Q (s, a) , with probability 1-ϵ Random action, with probability ϵ (8) The parameter ϵ decays over time with a decay rate rϵ. The ϵ-greedy algorithm prevents the agent from getting stuck in a suboptimal loop by ensuring it tries out different options in a large action space. The key benefit is the balance between exploiting known good actions and exploring new ones, which is crucial for discovering the most effective long-term strategy. V. EXPERIMENTAL RESULTS To evaluate the performance of the multi-agent Q-routing algorithm, we conduct a set of experiments on a simulated telecommunication network G with Nv = 100 vertices and Ne = 291 edges. The number of packets generated initially Mp = 3000 and the maximum packets from the network is Mmax p = 36000. The time horizon T = 1000. The other simulation environmental parameters are listed in Table I, and the algorithm parameters of Q-routing are shown in Table II. The key metrics in the learning process, including average delivery time, delivery rate, and total rewards are shown in Figure 1 , 2, and 3, respectively. The plot of average delivery time (see Fig. 1) demonstrates the agent’s efficiency. In the first few episodes, the average time is extremely high, which, combined with the low delivery rate, suggests inefficient exploration or failed attempts. The delivery time then plummets dramatically and stabilizes at TABLE I SIMULATION ENVIRONMENTAL PARAMETERS SETTING Environmental Parameters Values Sending capacity Ci 8 ∼ 12 Storage capacity Bi 70 ∼ 80 Edge base delay d0i,j 10 Utilization-dependent scale parameter ηp,i 2400 ∼ 3600 Time-dependent scale parameter ηt,i 3000 Shape parameter β 2 Poisson disruption rate λi,j 0.1 ∼ 0.2 TABLE II Q-ROUTING ALGORITHM PARAMETERS SETTING Algorithm Parameters Values Learning rate α 0.2 Learning decay rate rα 0.99 Discount factor γ 0.9 Greedy exploration parameter ϵ 0.6 Greedy decay rate rϵ 0.99995 Number of episodes NI 100 a low value (around 25 episodes) for the remainder of the training. The delivery rate metric (see Fig. 2) provides insight into the agent’s effectiveness. The delivery rate rapidly increases to nearly 100% in the same initial 35-episode period. This shows that the agent quickly learns the fundamental goal of making successful deliveries. The rate then remains stable at this near-perfect level, confirming the agent has mastered the delivery aspect of its task. The total rewards per episodes (see Fig.3) start at a negative value, indicating that the agent initially fails to send packets correctly and incurs penalties. However, the agent learns very quickly, with the reward showing a steep, linear increase within the first 35 episodes. After this initial phase, the reward Fig. 1. Average delivery time over episodes during learning. Fig. 2. Delivery rate over episodes during learning. Fig. 3. Total rewards during learning. stabilizes at a high plateau of approximately 6×107, indicating that the agent has successfully converged on a highly effective and consistent policy for maximizing its reward. The learning curve is strongly correlated with the delivery rate. To test the proposed Q-routing algorithm, we compare it with the weighted shortest path method which constructs the edge weights based on reliability and real-time delay and searches for the optimal path based on Dijkstra’s al- gorithm. The average delivery time, average delivery rate, and computation time results are presented in Fig. 4, 5, and 6, respectively. The computation time measures the speed of finding the optimal path, which relates to the algorithm execution complexity: O (n) for Q-routing and O ( N2 V ) for the weighted shortest path. The degradation resistance factor dr shown on the x-axis is the shift parameter of the utilization- dependent scale parameter ηp,i, indicating a parameter shift for all the vertices: Fig. 4. Average delivery time Comparison between Q-learning and weighted shortest path. ηp,i,shited = ηp,i,original × (1 + dr) (9) The parameter dr is designed to evaluate the performance of the Q-routing algorithm under various network reliability levels, where a larger dr value results in a slower rate of network degradation. The comparison results between Q-routing and weighted shortest path unequivocally demonstrate the superiority of the Q-routing approach across all three key performance indicators. While both algorithms were tested under identical conditions, the Q-routing model consistently achieved more favorable outcomes. The Q-routing algorithm maintained a high and stable delivery rate, consistently exceeding 90%, whereas the Weighted Shortest Path method achieves a rate of only approximately 45-50%. Furthermore, Q-routing is sub- stantially faster, with average delivery times being roughly half those of the Weighted Shortest Path. As for the computational efficiency, Q-routing requires less than 30 seconds of compu- tation time, while the Weighted Shortest Path demands nearly three minutes, with this time increasing alongside degradation resistance. The smaller error bars associated with Q-routing also indicate more consistent and predictable performance. In conclusion, the adaptive nature of Q-routing offers a decisive advantage over the static path-finding of the Weighted Shortest Path algorithm. The Q-routing algorithm is not only more effective at ensuring successful and timely deliveries, but is also more computationally efficient. With different degra- dation resistance rates, the Q-routing guarantees its robustness and better performance than Weighted Shortest Path. VI. CONCLUSION This paper introduces a proactive, reliability-oriented rout- ing strategy designed to enhance the resilience of telecommu- nication networks against component failures. The core of our framework is a novel decision-making approach that integrates Fig. 5. Average delivery rate Comparison between Q-learning and weighted shortest path. Fig. 6. Computation time comparison between Q-learning and weighted shortest path. predictive models of component reliability and availability with traditional network metrics. To address the complexity of this optimization problem, we employ a multi-agent Q- routing algorithm where each network node acts as a decen- tralized agent, learning an optimal policy to intelligently route traffic and minimize packet delivery time in a dynamically degrading environment. The Q-routing approach significantly outperformed the weighted shortest path method across all key metrics. It yielded a much higher delivery rate, reduced packet delivery times by approximately 50%, and is computationally faster. While this work demonstrates a significant improvement over traditional methods, several avenues exist for future research. The current implementation relies on a Q-table, which can be inefficient and slow to adapt in highly dynamic or large-scale network environments. Future iterations could leverage deep reinforcement learning techniques, such as Deep Q-Networks (DQN) and proximal policy optimization (PPO), to better handle state-space complexity and improve adaptabil- ity. Furthermore, the simulation was conducted with a finite number of packets; future studies should consider modeling continuous network flows to more accurately reflect real- world traffic loads and their impact on network degradation. Finally, the reliability and degradation models, while effective, could be refined by incorporating more granular, data-driven factors to create a more precise representation of real-world component aging and failure mechanisms. ACKNOWLEDGMENT This work was supported in part by the Boeing Company under Grant RG93345. REFERENCES [1] A. Uzoka, E. Cadet, and P. U. Ojukwu, ”The role of telecommunications in enabling Internet of Things (IoT) connectivity and applications,” Comprehensive Research and Reviews in Science and Technology, vol. 2, no. 02, pp. 055-073, 2024. [2] L. Xing, ”Cascading failures in Internet of Things: Review and perspec- tives on reliability and resilience,” IEEE Internet of Things Journal, vol. 8, no. 1, pp. 44-64, 2020. [3] W. Hou, ”Integrated reliability and availability analysis of networks with software failures and hardware failures,” Ph.D. dissertation, Dept. Elect. and Comput. Eng., Univ. of Waterloo, Waterloo, ON, Canada, 2003. [4] M. Chlosta, D. Rupprecht, T. Holz, and C. Pöpper, ”LTE security disabled: misconfiguration in commercial networks,” Proc. 12th Conf. Security and Privacy in Wireless and Mobile Networks (WiSec), May 2019, pp. 261-266. [5] C. G. Tuppen, ”Energy and telecommunications—an environmental impact analysis,” Energy & Environment, vol. 3, no. 1, pp. 70-81, 1992. [6] S. S. Savas, M. F. Habib, M. Tornatore, F. Dikbiyik, and B. Mukherjee, ”Network adaptability to disaster disruptions by exploiting degraded- service tolerance,” IEEE Communications Magazine, vol. 52, no. 12, pp. 58-65, 2014. [7] E. Koks, R. Pant, S. Thacker, and J. W. Hall, ”Understanding business disruption and economic losses due to electricity failures and flooding,” International Journal of Disaster Risk Science, vol. 10, no. 4, pp. 421- 438, 2019. [8] S. A. Changazi et al., ”Optimization of network topology robustness in IoTs: A systematic review,” Computer Networks, vol. 246, p. 110568, 2024. [9] S. K. Chaturvedi, Network Reliability: Measures and Evaluation. Hobo- ken, NJ: John Wiley & Sons, 2016. [10] B. S. Awoyemi, A. S. Alfa, and B. T. Maharaj, ”Network restoration for next-generation communication and computing networks,” Journal of Computer Networks and Communications, vol. 2018, Art. ID 4134878, 2018. [11] K. S. Trivedi and A. Bobbio, Reliability and Availability Engineering: Modeling, Analysis, and Applications. Cambridge, UK: Cambridge University Press, 2017. [12] M. Liu and D. M. Frangopol, ”Optimizing bridge network maintenance management under uncertainty with conflicting criteria: Life-cycle main- tenance, failure, and user costs,” Journal of Structural Engineering, vol. 132, no. 11, pp. 1835-1845, 2006. [13] M. Jacobsson and C. Rohner, ”Estimating packet delivery ratio for ar- bitrary packet sizes over wireless links,” IEEE Communications Letters, vol. 19, no. 4, pp. 609-612, 2015. [14] A. Abd Aziz, Y. A. Sekercioglu, P. Fitzpatrick, and M. Ivanovich, ”A sur- vey on distributed topology control techniques for extending the lifetime of battery powered wireless sensor networks,” IEEE Communications Surveys & Tutorials, vol. 15, no. 1, pp. 121-144, 2013. [15] V. C. Gungor, B. Lu, and G. P. Hancke, ”Opportunities and challenges of wireless sensor networks in smart grid,” IEEE Transactions on Industrial Electronics, vol. 57, no. 10, pp. 3557-3564, 2010. [16] R. H. Khan and J. Y. Khan, ”A comprehensive review of the application characteristics and traffic requirements of a smart grid communications network,” Computer Networks, vol. 57, no. 3, pp. 825-845, 2013. [17] J. Zhao, Y. Jin, K. S. Trivedi, and R. Matias Jr., ”Injecting memory leaks to accelerate software failures,” Proc. 22nd IEEE Int. Symp. Software Reliability Engineering (ISSRE), Nov. 2011, pp. 260-269. [18] G. Egeland and P. E. Engelstad, ”The availability and reliability of wireless multi-hop networks with stochastic link failures,” IEEE Journal on Selected Areas in Communications, vol. 27, no. 7, pp. 1132-1146, 2009. [19] M. Wildemeersch and J. Fortuny-Guasch, ”Radio frequency interference impact assessment on global navigation satellite systems,” EC Joint Research Centre, Ispra, Italy, Tech. Rep. JRC56534, pp. 50-51, 2010. [20] J. Rak et al., ”Fundamentals of communication networks resilience to disasters and massive disruptions,” Guide to Disaster-Resilient Commu- nication Networks, J. Rak and D. Hutchison, Eds. Cham, Switzerland: Springer, 2020, pp. 1-43. [21] C. Perkins, E. Belding-Royer, and S. Das, ”Ad hoc On-Demand Distance Vector (AODV) Routing,” No. RFC 3561, 2003. [22] T. Clausen and P. Jacquet, Eds., ”Optimized Link State Routing Protocol (OLSR),” No. RFC 3626, 2003. [23] A. Jiang and L. Zheng, ”An effective hybrid routing algorithm in WSN: Ant colony optimization in combination with hop count minimization,” Sensors, vol. 18, no. 4, p. 1020, 2018. [24] T. Mazhar et al., ”Quality of service (QoS) performance analysis in a traffic engineering model for next-generation wireless sensor networks,” Symmetry, vol. 15, no. 2, p. 513, 2023. [25] X. Ni, K. C. Lan, and R. Malaney, ”On the performance of expected transmission count (ETX) for wireless mesh networks,” Proc. 3rd Int. Conf. Performance Evaluation Methodologies and Tools (VALUE- TOOLS), pp. 1-10, 2008. [26] C. Kohlstruck and R. Gotzhein, ”dR min–Routing–A Decentralized Algorithm for Reliability-constrained Routing in Wireless Ad-hoc Net- works,” Proc. Int. Wireless Commun. and Mobile Comput. (IWCMC), pp. 1386-1393, 2022. [27] K. Erzun, R. Ayoub, P. Mercati, and T. Rosing, ”Improving mean time to failure of IoT networks with reliability-aware routing,” Proc. 10th Mediterranean Conf. Embedded Comput. (MECO), Jun. 2021, pp. 1-4. [28] K. Ergun, R. Ayoub, P. Mercati, and T. Rosing, ”Reinforcement learning based reliability-aware routing in IoT networks,” Ad Hoc Networks, vol. 132, p. 102869, 2022. [29] Q. Zhang, Y. Liu, Y. Xiang, and T. Xiahou, ”Reinforcement learning in reliability and maintenance optimization: A tutorial,” Reliability Engineering & System Safety, vol. 251, p. 110401, 2024. [30] L. Tan, F. Wei, X. Ma, R. Peng, H. Xiao, and L. Yang, ”Systemic Condition-Based Maintenance Optimization Under Inspection Uncer- tainties: A Customized Multiagent Reinforcement Learning Approach”, IEEE Transactions on Reliability, 2025. [31] C. J. C. H. Watkins and P. Dayan, “Q-learning, Machine Learning, vol. 8, no. 3-4, pp. 279–292, 1992. [32] Albert, R. and Barabási, A. L, ”Statistical mechanics of complex networks,” Reviews of modern physics, vol. 74, no. 1, pp. 47-97, 2002. [33] Ahmad, W., Hasan, O., Pervez, U., and Qadir, J, ”Reliability modeling and analysis of communication networks,” Journal of Network and Computer Applications, vol. 78, pp. 191-215, 2017. [34] Zarezadeh, S., Ashrafi, S., and Asadi, M, ”A shock model based approach to network reliability,” IEEE Transactions on Reliability, vol. 65, no. 2, pp. 992-1000, 2015.