Reliability-oriented Telecommunication Network
Routing Using Multi-agent Q-Learning

Longyan Tan
Institute for Manufacturing, Department of Engineering

University of Cambridge
Cambridge, United Kingdom

lt592@cam.ac.uk

Aitichya Chandra
Institute for Manufacturing, Department of Engineering

University of Cambridge
Cambridge, United Kingdom

ac2772@cam.ac.uk

Luning Li
Institute for Manufacturing, Department of Engineering

University of Cambridge
Cambridge, United Kingdom

ll669@cam.ac.uk
*Corresponding author

Ajith Kumar Parlikad
Institute for Manufacturing, Department of Engineering

University of Cambridge
Cambridge, United Kingdom

aknp2@cam.ac.uk

Abstract—The uninterrupted operation of telecommunication
networks is critical to modern society, yet network components
are prone to failures that can significantly disrupt services.
Traditional routing protocols often respond reactively to such
failures. This paper proposes a proactive, reliability-aware rout-
ing strategy that enhances network resilience by integrating
predictive models of component reliability and availability with
conventional metrics such as congestion and delay. Due to the
high dimensionality and complexity of the resulting optimization
problem, we employ a decentralized multi-agent reinforcement
learning (MARL) Q-routing algorithm. Each node acts as an
autonomous agent that learns an optimal routing policy to
minimize delivery time in a dynamically degrading environment.
Experiments on a simulated network show that our approach sig-
nificantly outperforms a weighted shortest path baseline in terms
of delivery time, delivery rate, and computational efficiency.

Index Terms—Telecommunication Network, Predictive Rout-
ing, Reliability, Multi-agent Reinforcement Learning

I. INTRODUCTION

The uninterrupted operation of telecommunication networks
is paramount in the modern digital era, underpinning critical
societal functions, economic activities, and daily life [1].
However, network components, such as routers, switches,
cables, and wireless cellular links, are susceptible to fail-
ures arising from a multitude of factors including hardware
degradation [2], software bugs [3], misconfigurations [4], and
environmental events [5]. These failures can lead to significant
service disruptions, performance degradation, and economic
losses [6], [7]. Consequently, ensuring high levels of network
reliability and availability has become a central focus for
network operators and researchers.

The vast majority of existing research in network rout-
ing prioritizes performance metrics by assuming a perfectly
functioning network [8]. However, in real-world deployments,

particularly those operating over extended periods, this as-
sumption is frequently violated, as the network topology may
change due to component failure or performance degradation
[9]. Consequently, the reliability of the network is not static;
instead, it degrades over time and with use. Traditional net-
work routing protocols, while adept at finding paths based
on metrics like shortest distance or quality of service (QoS)
parameters (e.g., delay, bandwidth), often react to failures
only after they occur. These reactive approaches can lead
to unacceptable delays in service restoration and significant
packet loss compared to proactive approaches [10].

In contrast, this paper introduces a proactive, reliability-
oriented routing framework that accounts for both the time-
dependent and workload-induced degradation of network com-
ponents. Our key innovation is the integration of reliability
and availability models into the routing decision process
using a multi-agent reinforcement learning (MARL) approach.
Specifically, we adopt a Q-routing algorithm in which each
network node acts as an independent agent that learns to
optimize routing decisions based on local experience and
network conditions. This work contributes:

• A novel integration of time- and usage-based reliability
models into a telecommunication routing framework;

• A decentralized MARL-based algorithm for routing in
dynamically degrading networks;

• A comparative case study showing significant improve-
ments in latency, delivery rate, and computational effi-
ciency over a weighted shortest path benchmark.

The remainder of the paper is structured as follows. Section
II reviews related work. Section III presents the problem
formulation and degradation models. Section IV introduces the
reinforcement learning approach. Section V details the simula-


tion setup and performance results. Section VI concludes and
discusses future research directions.

II. RELATED WORK

A. Telecommunication network reliability and degradation

Reliability, defined as the ability of a system to perform
its required functions under stated conditions for a specified
period [11], is a critical attribute, especially in increasingly
complex environments. The reliability of a communication
network is a critical concern, directly impacting its operational
lifespan and maintenance costs [12].

Network performance is significantly impacted by a com-
bination of environmental and operational stresses that cause
uneven degradation across its devices. Nodes deployed in chal-
lenging physical locations may experience accelerated aging
due to high ambient temperatures, humidity, or power grid
instability [15]. Simultaneously, operational demands, such as
continuous high traffic loads or intensive processing tasks,
place greater stress on specific devices based on their role in
the network’s topology [16]. This degradation is not limited to
hardware, and it also manifests as software aging, where issues
like memory leaks can gradually erode performance [17]. Over
time, these combined pressures indicate that certain nodes
inevitably degrade faster than others, creating performance
bottlenecks that can throttle data flow, increase latency, and
ultimately compromise the entire network’s efficiency and
reliability.

In addition, the availability of wireless links is often highly
dynamic and susceptible to a complex interplay of factors [18].
For example, unforeseen bursts of radio-frequency interference
from nearby industrial equipment can abruptly degrade signal-
to-noise ratios [19]. On a larger scale, natural disasters such
as floods, earthquakes, or severe storms can cause widespread
and prolonged disruptions, not only by physically damaging
infrastructure like cell towers and antennas, but also by altering
the signal propagation environment itself [20]. These factors
can induce rapid and significant fluctuations in link quality,
which in turn cascade into severe performance degradation
for the entire network.

B. Telecommunication network routing paradigm

Routing is a fundamental function in multi-hop networks,
responsible for the discovery and operation of paths that
facilitate end-to-end communication between nodes. Routing
protocols can be categorized into two classes: distance-vector
protocols, such as the Ad-hoc On-demand Distance Vector
(AODV) protocol [21], and link-state protocols, such as the
Optimized Link State Routing (OLSR) protocol [22]. The
primary goal of these protocols is to establish and maintain
routes based on specific metrics.

Many traditional routing algorithms prioritize finding the
shortest path in terms of hop count [23]. While simple and
effective in some contexts, this approach is often suboptimal
in wireless ad-hoc networks. A shorter hop count may favor
long, weak, and consequently unreliable links, which are

prone to breakage and require more retransmissions, ultimately
degrading performance [18].

To address these shortcomings, the concept of Quality-
of-Service (QoS) routing has emerged, which aims to find
paths that satisfy specific requirements regarding metrics
like performance, delay, bandwidth, and reliability [24]. This
involves using more sophisticated link metrics that better
reflect the quality of a connection. For instance, the Expected
Transmission Count (ETX) metric estimates the number of
transmissions needed to successfully send a packet over a
link, thereby capturing its quality [25]. Similarly, link metrics
based on the Packet Delivery Ratio (PDR) can provide a direct
measure of a link’s statistical reliability [13]. These advanced
metrics form the basis for more intelligent, reliability-oriented
routing strategies.

C. Reliability-oriented routing methods

Recognizing the limitations of traditional routing, a new
class of reliability-oriented routing protocols has been devel-
oped. These protocols explicitly incorporate reliability metrics
into their decision-making process to enhance network lifetime
and performance. The goal is to move beyond simple path
length and consider the health and quality of the nodes and
links that form a communication path. One such advanced
protocol is dmin-Routing, a decentralized algorithm designed
for discovering and maintaining routes in wireless ad-hoc
networks that meet a specified minimum reliability [26]. How-
ever, this work does not consider the network degradation, and
the routing decision is based on a static reliability threshold.
The work by Ergun et al. pioneered this perspective by linking
routing decisions to the physical degradation of IoT nodes
[27]. Their research posits that routing can be used to steer
traffic away from devices experiencing high thermal stress,
thereby slowing their degradation and balancing the reliability
across the network. This research led to the development of
the R3-IoT protocol, a distributed, adaptive routing protocol
based on reinforcement learning [28]. However, the reliability
modeling in the R3-IoT protocol mainly focuses on node
hardware failure and ignores the link availability. In addition,
the routing decision is made given a series of predictions, in-
cluding network flow, communication time, expected transition
count (ETC) and reliability decrease. The uncertainties within
these predictions would strongly affect decision making.

D. Research Gaps

Despite progress, two key challenges remain insufficiently
addressed in the literature:

• Degradation Modeling: Existing work often neglects
modeling of component failure and degradation over
time and usage, particularly at the granularity needed for
routing optimization.

• Scalability under Uncertainty: High-dimensional rout-
ing problems with multiple sources of uncertainty—such
as node reliability, link availability, and queue dynam-
ics —require computationally efficient algorithms. Few


existing solutions scale effectively in such complex envi-
ronments.

This paper addresses these gaps through the integration
of predictive reliability modeling with a decentralized multi-
agent reinforcement learning framework, enabling robust and
scalable routing under degradation.

III. PROBLEM STATEMENT

A. Telecommunication network modeling
Consider an ad hoc wireless telecommunication network

G = (V,E) deployed in a mesh topology, where V is the set of
vertices and E is the set of telecommunication links. The net-
work is composed of Nv vertices V = {v1, v2, ..., vNv}, which
represent routers, switches, base stations, or end-use devices,
and Ne edges, which represent cellular signals or WiFi. The
edge ei,j ∈ E indicates the edge that links vertex vi and vj . To
align the model with the practical telecommunication network,
the Barabási-Albert network model is adopted to capture a
real-world network topology [32]. This type of network model
is widely recognized for its power-law degree distribution,
preferential attachment mechanism, and high robustness to
random failure. To simplify our model, we assume that all the
vertices can send and receive data packets, and all the edges
are bidirectional, indicating that the communication between
any two vertices is mutual.

1) Data packets generation and delivery: In this telecom-
munication network, data packets are generated with vari-
able origin and destination vertices. Initially, Mp packets
{p1, p2, ..., pMp} are generated randomly distributed over all
the possible vertices with equal probability. Each packet has
a source vertex sp and a destination vertex dp.

During the transmission process, packets go through a series
of intermediate neighboring vertices until they arrive at their
destination nodes. Upon arrival, several new packets, whose
destinations are randomly assigned, will be regenerated with
probability ρ from the destination vertex of the delivered
packet. The maximum packets generated from the network
is Mmax

p .
2) Vertex capacity and edge delay: For the ith vertex vi, its

sending queue capacity Ci indicates the maximum number of
packets that can be sent per time unit, and the storage buffer
size Bi refers to the maximum number of packets that can
be stored at the same time. For the edge ei,j , its transmission
delay di,j (t) follows a time-variant sinusoidal wave: di,j (t) =
d
(0)
i,j (1 + αi,j sin(ωi,jt+ ϕi,j)), where d

(0)
i,j is the base delay,

αi,j is the amplitude (0 ≤ αi,j < 1), ωi,j is the frequency,
and ϕi,j is the phase offset.

During the packet transmission between two vertices, the
sending node firstly informs the receiving node that one packet
will be delivered and the receiving node will check its storage
buffer for space. If the storage space is sufficient, the receiving
node would reserve that space until the arrival of the packet
after dj time unit. It should be noted that, if one packet is sent
to a vertex whose storage buffer is full, it will be dropped and
the sending vertex will be notified and requeue the packet in
the sending buffer.

B. Reliability and availability modeling

In this work, the vertex reliability is defined as the success
rate of a certain vertex sending packets. The link availability
is specified as the probability of survival from external dis-
ruption.

1) Vertex reliability: To model the degradation process of
vertex reliability, a Weibull degradation reliability function
considering time and utilization factors is utilized [33]. For
vi ∈ V , its reliability Ri

v (t) over time could be expressed as:

Ri
v(t) = e

−(
mi
ηp,i

+ t
ηt,i

)β

, ηp,i > 0, ηt,i > 0, β > 0. (1)

In the Equation 1, mi is the number of packets that pass
through vertex vi so far. ηp,i and ηt,i are the utilization-
dependent and time-dependent scale parameters, respectively,
accounting for the degradation speed over workload and time.
The larger the value, the more slowly the reliability function
decreases. β is the shape parameter.

2) Edge availability: Due to the randomness of external
disruption, such as the adverse weather, impactful magnetic
activity, or human error, regional wireless links might be
unavailable for a short time window. For edge ei,j , the number
of external disruptions Di,j within time unit t follows a
Poisson distribution Di,j(t) ∼ Poisson (λi,jt) [34]:

P (Di,j(t) = k) =
(λi,jt)

ke−λi,jt

k!
(2)

where λi,jt is the disruption intensity within unit time interval
t. If at least one disruption happens on the link within t, all
the packets in transmission are dropped and requeued again in
the sending node. The availability Ai,j

e (t) of edge ei,j is:

Ai,j
e (t) = 1− pi,j (3)

where pi,j = 1 − e−λi,jt is the probability of at least one
disruption happening within t.

C. Objective function and constraints

In this work, the objective of the routing decision is to
choose the next vertex for a packet given its destination so as
to minimize the average delivery time over the time horizon T
given the sending capacity Ci and storage capacity constraints
Bi:

min
1

Mmax
p

Mmax
p∑

m=1

∑
ei,j∈E

T∑
t=0

xm
i,j(t) ·

di,j(t)

Ai,j
e (t) + δ

s.t.
∑
m

∑
vj∈V

xm
i,j(t) ≤ Ci,∀i, t∑

m

∑
vi∈V

xm
i,j(t) ≤ Bj ,∀j, t

(4)

where xm
i,j(t) ∈ {0, 1} is a binary variable that denotes

whether packet pm is routed from node i to node j at time t,
and δ > 0 is an infinitesimal.


IV. RELIABILITY-ORIENTED REINFORCEMENT LEARNING

To solve the sequential decision-making routing problem,
this research utilizes an online multi-agent Q-routing rein-
forcement algorithm. This reinforcement algorithm is based
on the Markov decision process and a three-dimensional Q-
table. Each node represents an agent and routes the packets to
its neighbours [29].

A. Markov decision process

The telecommunication network routing is defined as
a Markov decision process defined by a tuple as Z =
⟨S,A,R,P , γ⟩ with finite time horizon T = Nt.

• State Space S: we define the state of packet at time t as
st, and st ∈ S, where st = {xt, d}. xt is the index of
vertex that the packet is in at current time t, and d is the
destination vertex of the packet.

• Action Space A: For a given vertex, its action ai is
defined as its neighboring vertex that can be delivered
the packets to at time t. The action space A of this node
is the set of all the action {ai} ∈ A, i is the index of all
available neighboring.

• Reward Space R: the elements of R are the reward
functions r (st+1|st, at), which is defined as Equation 5.
In this reward function, qi,t is the length of storage buffer,
qeq represents the equivalent queue length, w× ngrow is
the penalty of queue increases, and R(t) is the reliability
of next vertex. The additional rewards are 2000 if one
packet reaches its destination and −50 if the packet is
completely dropped.

• Transition probability matrix P : P (st+1|st, at)→ [0, 1]
represents the transition probability of transitioning from
st to st+1 given the action at.

• Discount factor γ: the discount factor γ ≤ 1 affects the
value of the future.

r (st+1|st, at) = 50R (t) + qeq − qi,t − w × ngrow (5)

At each time step t, where t = 0, 1, 2, ..., N , the vertex
agent vi makes the routing decision ai,t on all the packets in
its sending queue and obtains the rewards ri,t.

B. Multi-agent Q-routing algorithm

Following the formalization of the sequential decision-
making problem within a MDP modeling, the objective is
to determine an optimal policy π⋆ (s) that maximizes the
expected cumulative discounted reward [30]. This is intrin-
sically linked to the concept of Q-values, which quantify
the desirability of taking a specific routing action in a given
state. A Q-value, denoted as Q (st, at), represents the expected
total discounted future reward obtained by executing action
a in state s and subsequently following an optimal policy.
The optimal Q-function, Qstar (st, at), satisfies the Bellman
optimality equation:

Q⋆ (st, at) = R (st+1|st, at)

+ γ
∑
st+1

P (st+1|st, at)max
at+1

Q⋆ (st+1, at+1) (6)

After executing action at in state st, observing immediate
reward R (st+1|st, at), and transitioning to state st+1, the Q-
value for the state-action pair Q (st, at) is updated as follows:

Q (st, at)← Q (st, at)

+ α[R (st+1|st, at) + γmax
at+1

Q (st+1, at+1)−Q (st, at)].

(7)

In this update, α ∈ (0, 1] is the learning rate, con-
trolling the extent to which new information overrides
existing Q-value estimates. The term α[R (st+1|st, at) +
γmaxat+1

Q (st+1, at+1)−Q (st, at)] constitutes the temporal
difference (TD) error, representing the discrepancy between
the current Q-value estimate and a more accurate target value.
The term γmaxat+1

Q (st+1, at+1) provides the estimated
optimal value of the next state, ensuring convergence towards
Q⋆ (st, at). Under conditions of sufficient exploration (all
state-action pairs visited infinitely often) and a decaying learn-
ing rate, Q-learning is guaranteed to converge to the optimal
Q-function [31].

To encourage the exploration of the large action space,
agents would perform ϵ-greedy algorithm:

a =

{
argmax

a
Q (s, a) , with probability 1-ϵ

Random action, with probability ϵ
(8)

The parameter ϵ decays over time with a decay rate rϵ. The
ϵ-greedy algorithm prevents the agent from getting stuck in a
suboptimal loop by ensuring it tries out different options in
a large action space. The key benefit is the balance between
exploiting known good actions and exploring new ones, which
is crucial for discovering the most effective long-term strategy.

V. EXPERIMENTAL RESULTS

To evaluate the performance of the multi-agent Q-routing
algorithm, we conduct a set of experiments on a simulated
telecommunication network G with Nv = 100 vertices and
Ne = 291 edges. The number of packets generated initially
Mp = 3000 and the maximum packets from the network is
Mmax

p = 36000. The time horizon T = 1000. The other
simulation environmental parameters are listed in Table I, and
the algorithm parameters of Q-routing are shown in Table II.

The key metrics in the learning process, including average
delivery time, delivery rate, and total rewards are shown in
Figure 1 , 2, and 3, respectively.

The plot of average delivery time (see Fig. 1) demonstrates
the agent’s efficiency. In the first few episodes, the average
time is extremely high, which, combined with the low delivery
rate, suggests inefficient exploration or failed attempts. The
delivery time then plummets dramatically and stabilizes at


TABLE I
SIMULATION ENVIRONMENTAL PARAMETERS SETTING

Environmental Parameters Values

Sending capacity Ci 8 ∼ 12
Storage capacity Bi 70 ∼ 80
Edge base delay d0i,j 10

Utilization-dependent scale parameter ηp,i 2400 ∼ 3600
Time-dependent scale parameter ηt,i 3000

Shape parameter β 2
Poisson disruption rate λi,j 0.1 ∼ 0.2

TABLE II
Q-ROUTING ALGORITHM PARAMETERS SETTING

Algorithm Parameters Values

Learning rate α 0.2
Learning decay rate rα 0.99

Discount factor γ 0.9
Greedy exploration parameter ϵ 0.6

Greedy decay rate rϵ 0.99995
Number of episodes NI 100

a low value (around 25 episodes) for the remainder of the
training.

The delivery rate metric (see Fig. 2) provides insight into
the agent’s effectiveness. The delivery rate rapidly increases
to nearly 100% in the same initial 35-episode period. This
shows that the agent quickly learns the fundamental goal of
making successful deliveries. The rate then remains stable at
this near-perfect level, confirming the agent has mastered the
delivery aspect of its task.

The total rewards per episodes (see Fig.3) start at a negative
value, indicating that the agent initially fails to send packets
correctly and incurs penalties. However, the agent learns very
quickly, with the reward showing a steep, linear increase
within the first 35 episodes. After this initial phase, the reward

Fig. 1. Average delivery time over episodes during learning.

Fig. 2. Delivery rate over episodes during learning.

Fig. 3. Total rewards during learning.

stabilizes at a high plateau of approximately 6×107, indicating
that the agent has successfully converged on a highly effective
and consistent policy for maximizing its reward. The learning
curve is strongly correlated with the delivery rate.

To test the proposed Q-routing algorithm, we compare it
with the weighted shortest path method which constructs
the edge weights based on reliability and real-time delay
and searches for the optimal path based on Dijkstra’s al-
gorithm. The average delivery time, average delivery rate,
and computation time results are presented in Fig. 4, 5, and
6, respectively. The computation time measures the speed
of finding the optimal path, which relates to the algorithm
execution complexity: O (n) for Q-routing and O

(
N2

V

)
for

the weighted shortest path. The degradation resistance factor
dr shown on the x-axis is the shift parameter of the utilization-
dependent scale parameter ηp,i, indicating a parameter shift for
all the vertices:


Fig. 4. Average delivery time Comparison between Q-learning and weighted
shortest path.

ηp,i,shited = ηp,i,original × (1 + dr) (9)

The parameter dr is designed to evaluate the performance
of the Q-routing algorithm under various network reliability
levels, where a larger dr value results in a slower rate of
network degradation.

The comparison results between Q-routing and weighted
shortest path unequivocally demonstrate the superiority of
the Q-routing approach across all three key performance
indicators. While both algorithms were tested under identical
conditions, the Q-routing model consistently achieved more
favorable outcomes. The Q-routing algorithm maintained a
high and stable delivery rate, consistently exceeding 90%,
whereas the Weighted Shortest Path method achieves a rate
of only approximately 45-50%. Furthermore, Q-routing is sub-
stantially faster, with average delivery times being roughly half
those of the Weighted Shortest Path. As for the computational
efficiency, Q-routing requires less than 30 seconds of compu-
tation time, while the Weighted Shortest Path demands nearly
three minutes, with this time increasing alongside degradation
resistance. The smaller error bars associated with Q-routing
also indicate more consistent and predictable performance.

In conclusion, the adaptive nature of Q-routing offers a
decisive advantage over the static path-finding of the Weighted
Shortest Path algorithm. The Q-routing algorithm is not only
more effective at ensuring successful and timely deliveries, but
is also more computationally efficient. With different degra-
dation resistance rates, the Q-routing guarantees its robustness
and better performance than Weighted Shortest Path.

VI. CONCLUSION

This paper introduces a proactive, reliability-oriented rout-
ing strategy designed to enhance the resilience of telecommu-
nication networks against component failures. The core of our
framework is a novel decision-making approach that integrates

Fig. 5. Average delivery rate Comparison between Q-learning and weighted
shortest path.

Fig. 6. Computation time comparison between Q-learning and weighted
shortest path.

predictive models of component reliability and availability
with traditional network metrics. To address the complexity
of this optimization problem, we employ a multi-agent Q-
routing algorithm where each network node acts as a decen-
tralized agent, learning an optimal policy to intelligently route
traffic and minimize packet delivery time in a dynamically
degrading environment. The Q-routing approach significantly
outperformed the weighted shortest path method across all key
metrics. It yielded a much higher delivery rate, reduced packet
delivery times by approximately 50%, and is computationally
faster.

While this work demonstrates a significant improvement
over traditional methods, several avenues exist for future
research. The current implementation relies on a Q-table,
which can be inefficient and slow to adapt in highly dynamic
or large-scale network environments. Future iterations could


leverage deep reinforcement learning techniques, such as Deep
Q-Networks (DQN) and proximal policy optimization (PPO),
to better handle state-space complexity and improve adaptabil-
ity. Furthermore, the simulation was conducted with a finite
number of packets; future studies should consider modeling
continuous network flows to more accurately reflect real-
world traffic loads and their impact on network degradation.
Finally, the reliability and degradation models, while effective,
could be refined by incorporating more granular, data-driven
factors to create a more precise representation of real-world
component aging and failure mechanisms.

ACKNOWLEDGMENT

This work was supported in part by the Boeing Company
under Grant RG93345.

REFERENCES

[1] A. Uzoka, E. Cadet, and P. U. Ojukwu, ”The role of telecommunications
in enabling Internet of Things (IoT) connectivity and applications,”
Comprehensive Research and Reviews in Science and Technology, vol.
2, no. 02, pp. 055-073, 2024.

[2] L. Xing, ”Cascading failures in Internet of Things: Review and perspec-
tives on reliability and resilience,” IEEE Internet of Things Journal, vol.
8, no. 1, pp. 44-64, 2020.

[3] W. Hou, ”Integrated reliability and availability analysis of networks with
software failures and hardware failures,” Ph.D. dissertation, Dept. Elect.
and Comput. Eng., Univ. of Waterloo, Waterloo, ON, Canada, 2003.

[4] M. Chlosta, D. Rupprecht, T. Holz, and C. Pöpper, ”LTE security
disabled: misconfiguration in commercial networks,” Proc. 12th Conf.
Security and Privacy in Wireless and Mobile Networks (WiSec), May
2019, pp. 261-266.

[5] C. G. Tuppen, ”Energy and telecommunications—an environmental
impact analysis,” Energy & Environment, vol. 3, no. 1, pp. 70-81, 1992.

[6] S. S. Savas, M. F. Habib, M. Tornatore, F. Dikbiyik, and B. Mukherjee,
”Network adaptability to disaster disruptions by exploiting degraded-
service tolerance,” IEEE Communications Magazine, vol. 52, no. 12,
pp. 58-65, 2014.

[7] E. Koks, R. Pant, S. Thacker, and J. W. Hall, ”Understanding business
disruption and economic losses due to electricity failures and flooding,”
International Journal of Disaster Risk Science, vol. 10, no. 4, pp. 421-
438, 2019.

[8] S. A. Changazi et al., ”Optimization of network topology robustness in
IoTs: A systematic review,” Computer Networks, vol. 246, p. 110568,
2024.

[9] S. K. Chaturvedi, Network Reliability: Measures and Evaluation. Hobo-
ken, NJ: John Wiley & Sons, 2016.

[10] B. S. Awoyemi, A. S. Alfa, and B. T. Maharaj, ”Network restoration for
next-generation communication and computing networks,” Journal of
Computer Networks and Communications, vol. 2018, Art. ID 4134878,
2018.

[11] K. S. Trivedi and A. Bobbio, Reliability and Availability Engineering:
Modeling, Analysis, and Applications. Cambridge, UK: Cambridge
University Press, 2017.

[12] M. Liu and D. M. Frangopol, ”Optimizing bridge network maintenance
management under uncertainty with conflicting criteria: Life-cycle main-
tenance, failure, and user costs,” Journal of Structural Engineering, vol.
132, no. 11, pp. 1835-1845, 2006.

[13] M. Jacobsson and C. Rohner, ”Estimating packet delivery ratio for ar-
bitrary packet sizes over wireless links,” IEEE Communications Letters,
vol. 19, no. 4, pp. 609-612, 2015.

[14] A. Abd Aziz, Y. A. Sekercioglu, P. Fitzpatrick, and M. Ivanovich, ”A sur-
vey on distributed topology control techniques for extending the lifetime
of battery powered wireless sensor networks,” IEEE Communications
Surveys & Tutorials, vol. 15, no. 1, pp. 121-144, 2013.

[15] V. C. Gungor, B. Lu, and G. P. Hancke, ”Opportunities and challenges of
wireless sensor networks in smart grid,” IEEE Transactions on Industrial
Electronics, vol. 57, no. 10, pp. 3557-3564, 2010.

[16] R. H. Khan and J. Y. Khan, ”A comprehensive review of the application
characteristics and traffic requirements of a smart grid communications
network,” Computer Networks, vol. 57, no. 3, pp. 825-845, 2013.

[17] J. Zhao, Y. Jin, K. S. Trivedi, and R. Matias Jr., ”Injecting memory leaks
to accelerate software failures,” Proc. 22nd IEEE Int. Symp. Software
Reliability Engineering (ISSRE), Nov. 2011, pp. 260-269.

[18] G. Egeland and P. E. Engelstad, ”The availability and reliability of
wireless multi-hop networks with stochastic link failures,” IEEE Journal
on Selected Areas in Communications, vol. 27, no. 7, pp. 1132-1146,
2009.

[19] M. Wildemeersch and J. Fortuny-Guasch, ”Radio frequency interference
impact assessment on global navigation satellite systems,” EC Joint
Research Centre, Ispra, Italy, Tech. Rep. JRC56534, pp. 50-51, 2010.

[20] J. Rak et al., ”Fundamentals of communication networks resilience to
disasters and massive disruptions,” Guide to Disaster-Resilient Commu-
nication Networks, J. Rak and D. Hutchison, Eds. Cham, Switzerland:
Springer, 2020, pp. 1-43.

[21] C. Perkins, E. Belding-Royer, and S. Das, ”Ad hoc On-Demand Distance
Vector (AODV) Routing,” No. RFC 3561, 2003.

[22] T. Clausen and P. Jacquet, Eds., ”Optimized Link State Routing Protocol
(OLSR),” No. RFC 3626, 2003.

[23] A. Jiang and L. Zheng, ”An effective hybrid routing algorithm in WSN:
Ant colony optimization in combination with hop count minimization,”
Sensors, vol. 18, no. 4, p. 1020, 2018.

[24] T. Mazhar et al., ”Quality of service (QoS) performance analysis in a
traffic engineering model for next-generation wireless sensor networks,”
Symmetry, vol. 15, no. 2, p. 513, 2023.

[25] X. Ni, K. C. Lan, and R. Malaney, ”On the performance of expected
transmission count (ETX) for wireless mesh networks,” Proc. 3rd
Int. Conf. Performance Evaluation Methodologies and Tools (VALUE-
TOOLS), pp. 1-10, 2008.

[26] C. Kohlstruck and R. Gotzhein, ”dR min–Routing–A Decentralized
Algorithm for Reliability-constrained Routing in Wireless Ad-hoc Net-
works,” Proc. Int. Wireless Commun. and Mobile Comput. (IWCMC),
pp. 1386-1393, 2022.

[27] K. Erzun, R. Ayoub, P. Mercati, and T. Rosing, ”Improving mean time
to failure of IoT networks with reliability-aware routing,” Proc. 10th
Mediterranean Conf. Embedded Comput. (MECO), Jun. 2021, pp. 1-4.

[28] K. Ergun, R. Ayoub, P. Mercati, and T. Rosing, ”Reinforcement learning
based reliability-aware routing in IoT networks,” Ad Hoc Networks, vol.
132, p. 102869, 2022.

[29] Q. Zhang, Y. Liu, Y. Xiang, and T. Xiahou, ”Reinforcement learning
in reliability and maintenance optimization: A tutorial,” Reliability
Engineering & System Safety, vol. 251, p. 110401, 2024.

[30] L. Tan, F. Wei, X. Ma, R. Peng, H. Xiao, and L. Yang, ”Systemic
Condition-Based Maintenance Optimization Under Inspection Uncer-
tainties: A Customized Multiagent Reinforcement Learning Approach”,
IEEE Transactions on Reliability, 2025.

[31] C. J. C. H. Watkins and P. Dayan, “Q-learning, Machine Learning, vol.
8, no. 3-4, pp. 279–292, 1992.

[32] Albert, R. and Barabási, A. L, ”Statistical mechanics of complex
networks,” Reviews of modern physics, vol. 74, no. 1, pp. 47-97, 2002.

[33] Ahmad, W., Hasan, O., Pervez, U., and Qadir, J, ”Reliability modeling
and analysis of communication networks,” Journal of Network and
Computer Applications, vol. 78, pp. 191-215, 2017.

[34] Zarezadeh, S., Ashrafi, S., and Asadi, M, ”A shock model based
approach to network reliability,” IEEE Transactions on Reliability, vol.
65, no. 2, pp. 992-1000, 2015.