Transformer-based Pavement Crack Tracking with Neural-PID
Controller on Vision-guided Robot

Jianqi Zhang1,2, Xu Yang2,3,∗, Wei Wang1, Ioannis Brilakis4, Hainian Wang3 and Ling Ding5

1School of Information Engineering, Chang’an University, Xi’an, 710064, China
2School of Future Transportation, Chang’an University, Xi’an, 710064, China

3School of Highway, Chang’an University, Xi’an, 710064, China
4Laing O’Rourke Centre, Engineering Department, Cambridge University, Cambridge, CB2 1PZ, United Kingdom

5School of Transportation Engineering, Chang’an University, Xi’an, 710064, China

jqzhang@chd.edu.cn, yang.xu@chd.edu.cn, wei.wang@chd.edu.cn, ib340@cam.ac.uk, wanghn@chd.edu.cn

Abstract -

Pavement crack tracking in unstructured road environ-

ments has been and continues to be a crucial and challenging

task, playing a vital role in achieving accurate crack seal-

ing for automated pavement crack repair. However, slen-

der cracks suffer from insufficient feature extraction and low

tracking efficiency. In this article, a hybrid adaptive control

scheme combined with a self-tuning neural network and pro-

portional–integral–derivative (PID) is proposed for dynamic

visual tracking of pavement cracks. Specifically, the scheme

extracts crack features on the road image plane based on a

S2TNet system and determines an optimal control input to

guide the robot. S2TNet cross-integrates the global features

through the multi-head attention module. It also adaptively

recalibrates the channel responses of partial feature maps for

fusion operations with the transformer module. Moreover,

the Neural–PID controller is designed for adaptive adjust-

ment of control parameters, and the scheme was validated

on a physical robot platform. Extensive experimental re-

sults showed that the effectiveness of the proposed method in

achieving real-time tracking for pavement cracks.

Keywords -

Crack Tracking; Crack Segmentation; Transformer; Neu-

ral–PID Control; Mobile Robot

1 Introduction

Pavement cracks are prevalent and hazardous defects

that significantly impact driving safety in highway trans-

portation. They primarily arise from a range of factors,

such as heavy traffic loads, subpar construction practices,

the influence of climate, and inadequate drainage[1, 2].

Failure to promptly repair pavement cracks can lead to ac-

celerated deterioration of the pavement structure through

the ingress of rainwater. Even a small crack can rapidly de-

grade into a pothole overnight, posing a significant hazard

to high-speed driving[3, 4]. Hence, regular maintenance

and repair of pavement cracks are imperative to prevent

crack deterioration and ensure traffic safety[5, 6]. Manual

sealing is the conventional approach for repairing pave-

ment cracks. However, manual pavement repair proves

to be time-consuming, expensive, and subjective. There-

fore, there is a growing demand for automated and efficient

repair methods in pavement crack tracking.

Recent studies have primarily focused on the develop-

ment of crack segmentation with convolutional neural net-

work (CNN)-based methods in road environments. For in-

stance, [7] constructed a novel crack segmentation network

called CrackW-Net, and designed the skip-level round-trip

sampling block, which can be easily used in various net-

work structures. [8] developed mobile robot system can

effectively segment pavement cracks in real scenarios at a

speed of 25 frames per second. [9] used a 3D printer as

a crack-filling machine. In recent years, path tracking re-

search based on mobile device motion control has become

popular. A crack sealing system was designed to control

the experimental three-dimensional (3D) printer to repair

cracks[2]. [10] proposed the cross-entropy-based adaptive

fuzzy control for crack tracking with VT-UMbot.

The insufficient feature extraction is significantly trig-

gered by the limited receptive field in the CNN segmenta-

tion model and it often leads to a coarse segmentation

of the cracks. Over the years, researchers have pro-

posed various techniques to improve object detectabil-

ity. These approaches include encoder-decoder[11], multi-

scale attention[12], and multi-scale feature extraction[13].

Additionally, efforts have been made to enhance object

feature representation[14] and fusion[15]. However, de-

spite these advancements, challenges still persist in the

field, such as inadequate detection of detailed features

and susceptibility to background lighting conditions. On

the other hand, low tracking efficiency is also caused by

Slender pavement cracks have extreme length-width ra-

tio and complex topology, which lead to irregular paths.

Path tracking research mainly focuses on distribution rules

and trajectory obeying certain rules. Recent tracking con-

trol methods range from traditional PID to various op-

mailto:e.author1@aa.bb.edu
mailto:e.author1@aa.bb.edu
mailto:e.author1@aa.bb.edu
mailto:e.author1@aa.bb.edu
mailto:e.author1@aa.bb.edu


timized and improved PID such as fuzzy control[10], ge-

netic algorithms[16] and ant colony algorithms[17]. How-

ever, challenges related to tuning of control parameters in

specialized environments significantly impact the perfor-

mance of path tracking.

This article presents a pavement crack tracking frame-

work that enhances tracking efficiency in unstructured road

scenarios by fusing real-time crack video context features

with transformer-based segmentation and proposing Neu-

ral–PID control strategies in the crack tracking. To address

the insufficient feature extraction and low tracking effi-

ciency, extensive experiments are conducted and verified.

The contributions of this work are fourfold:

• Aiming at the problem of pavement crack tracking,

a joint transformer-based fusion model and Neu-

ral–PID tracking control scheme is proposed. This al-

gorithm successfully achieves stable real-time track-

ing for pavement crack.

• Enhancing the performance and effectiveness of

crack segmentation in challenging road conditions

with insufficient feature extraction. This article In-

troduces a transformer-based fusion model, which

leverages multi-fusion strategies to address the chal-

lenges posed by coarse crack feature extraction.

• Considering pavement cracks with slender shape and

irregular path, a Neural–PID tracking control method

is proposed to improve the performance of tracking.

Specifically, adaptive adjustment of control parame-

ters is achieved by neural network.

• Conducting extensive experiments on self-created

S2T-Crack dataset, the proposed algorithm is suc-

cessfully deployed in self-developed vision-guided

robot. The results show that our method achieved

State-of-The-Art.

The structure of this article is organized as follows.

Section 2 provides the existing related work. Section 3

outlines the detailed design of our methodology. Section

4 presents the experimental validation of our approach.

Finally, Section 5 summarizes the article and discusses

future directions.

2 Related works

This section reviews the literature relevant to our pro-

posed pavement crack tracking.

Crack Segmentation. Crack segmentation is a crucial

distress inspection technique for different infrastructures,

including roads, bridges, tunnels, airports and buildings.

There are numerous crack segmentation methods devel-

oped based on deep learning. YOLOv5[18] is a single-

stage object detection model known for its architectural

features such as the incorporation of Cross-Stage Partial

(CSP) and Spatial Pyramid Pooling-Fast (SPPF) methods

in the backbone network, as well as the utilization of Fea-

ture Pyramid Network (FPN) and Path Aggregation Net-

work (PAN) in the Neck network. A lightweight pavement

crack detection model is proposed to realize the dual tasks

of object detection and semantic segmentation[19].

However, CNN models primarily focus on local feature

extraction, which may result in information ambiguity and

coarse segmentation when dealing with long-range de-

pendency relationships. Therefore, this research aims to

fuse YOLOv5 with Transformer to achieve effective crack

segmentation.

Vision Transformer. Thanks to strong representation

capabilities, researchers are looking at ways to apply trans-

former to computer vision tasks. In various visual bench-

marks, the performance of the transformer-based model

is similar to or better than other CNN types of networks.

[20] classified these visual transformer models accord-

ing to different tasks, and analyzes their advantages and

disadvantages, so as to review them. A new video in-

stance segmentation framework based on Transformer is

proposed, called VisTR, which regards the VIS task as a

direct end-to-end parallel sequence decoding / prediction

problem[21]. [22] designed a segmentation model called

SEgmentation TRansformer (SETR). A large number of

experiments show that SETR has achieved competitive

results on Cityscapes.

Compared to CNN, transformer incurs higher compu-

tational costs and longer training times. Given the subtle

nature of crack features, achieving fine-grained segmen-

tation of cracks is crucial. Therefore, this research in-

troduces self-attention and cross-attention mechanisms to

enhance feature extraction.

PID Control. PID control is widely used in path track-

ing control of mobile robots. In the absence of robot

knowledge, the PID controller may be the best controller

because it is model-free and its parameters can be easily

adjusted separately. However, the parameters depend on

artificial empirical values, and parameter optimization is

an existing challenge. [23] used the adaptive PID con-

troller to adjust the error to adjust the front wheel angle.

A robust PID controller for flight control of four-rotor air-

craft is proposed[24]. An adaptive fuzzy control (CEAFC)

method based on cross entropy is proposed for PID param-

eter tuning[10].

Traditional PID controllers are susceptible to external

disturbances when it comes to parameter adjustments,

leading to convergence issues and system uncertainty. To

address these challenges, this study proposes the Neural-

PID approach to ensure effective tracking performance.


Figure 1. General framework of our proposed scheme for pavement crack tracking on vision-guided robot. It
mainly includes two separate modules: transformer-based crack segmentation (including two branches and three
fusion modules), Neural-PID crack tracking (containing three layers networks). All modules are implemented
based on the unified YOLOv5 framework, and the details of each module are shown in Figure 2. It is worth
noting that both the input video images and tested results were conducted on the S2TCrack dataset.

3 Methodology

This work first describes related issues of pavement

crack tracking systems. Additionally, it is deployed on

vision-guided robot to achieve crack tracking. This section

presents the details of our proposed method.

3.1 Framework

This article focuses on two key aspects of crack track-

ing in road environment. Firstly, it addresses the chal-

lenge of achieving accurate crack segmentation in pave-

ment scenarios characterized by slender crack and com-

plex background. Secondly, it examines the low tracking

efficiency of crack tracking control methods in limited pa-

rameters tuning conditions. To address these challenges,

a crack tracking framework is proposed that ensembles

transformer-based fusion network and Neural–PID track-

ing control algorithm. This framework, illustrated in Fig-

ure 1, comprises two main modules: transformer-based

crack segmentation and Neural–PID tracking control. The

feature fusion module employs the yolov5 under the popu-

lar transformer to encode and decode crack video images,

enabling the fusion of image pixels at the feature level. In

order to adaptively tune the tracking controller parameters

more quickly, a three-layer structured neural network is

used. A detailed overview of the framework is presented

in the subsequent subsections.

3.2 Crack Segmentation with Transformer

The proposed module employs the yolov5 under the

popular transformer to encode and decode crack video

images, enabling the fusion of image pixels at the fea-

ture level. In contrast to the initial iteration of YOLOv5,

this study presents a novel approach that incorporates a

two-branch convolutional neural network backbone. This

backbone is illustrated by the light-green modules in Fig-

ure 2, and it is designed to extract crack features between

video frames from a vision-guided robot. In the context

of fusion utilizing FT modules, the fusion process occurs

at three distinct stages, facilitating the integration of fused

characteristics that comprise both coarse-grained and fine-

grained semantic information.

A common layer in the encoder and decoder structure

is multi-head attention, which consists of multiple parallel

self-attention mechanisms. In Self-Attention, Q, K, and

V are three vectors calculated on the same input (such as

a word in a sequence). Specifically, Q, K, and V can be

obtained by applying a linear transformation (e.g., using a

fully connected layer) to the original input word’s embed-

ding. The dimensions of these three vectors are usually

the same and depend on the decisions made during the

model design. During the computation of Self-Attention,

Q, K, and V are used to calculate attention scores, repre-

senting the relationship between the current position and

other positions. Attention scores are obtained by taking

the dot product of Q and K, dividing the scores by 8,

and applying softmax normalization. This process yields

weights for each position. Next, these weights are used to

compute the weighted sum of V, resulting in the output for

the current position. In order to illustrate the effectiveness

of our proposed FT fusion module, the feature extraction

network of YOLOv5 is extended and redesigned as a back-

bone composed of two streams to achieve modal fusion and

interaction.

3.3 Neural–PID Control for Crack Tracking

In the process of actual pavement crack path tracking

motion control, due to the complex control environment

and the nonlinear and time-varying characteristics of the

controlled object, the conventional PID control can not


adjust the adaptive parameters and achieve good adapt-

ability. Using the error back propagation technology, the

multi-layer feedforward neural network is called to become

a back propagation neural network. Because of its prop-

erties, it has excellent performance in nonlinear mapping,

such as function approximation and pattern recognition.

There are three layers in the back propagation neural net-

work model: input layer, hidden layer and output layer.

The input layer processes the type and quantity of in-

put. By controlling the number of layers and activation

functions, the hidden layer introduces the possibility of

nonlinear mapping. The output layer is responsible for

generating some information. The output of the neuron

model structure is usually expressed as a nonlinear com-

bination of input and weight.

5 (G) =
4G − 4−G

4G + 4−G
(1)

The three non-negative gain parameters of the PID con-

trol scheme are output by the BP neural network, so the

sigmoid function and other functions without negative out-

put values are applied.

6(G) = D ·
1

1 + 4−G
(2)

ℎ(G) = min(max(0, G), D) (3)

C (G) = D ·
4G

4G + 4−G
(4)

where u is upper bound of the output. It is used to regulate

the output range.

Back propagation neural network nonlinearly maps the

input, output and error to the three parameters kp, ki and kd

of the PID controller. In addition, the BP neural network

has three neuron points for the input layer, five neuron

points for the buried layer, and three neuron points for the

output layer. The commonly used Tanh function is used in

the hidden layer. Combined with BPNN and PID control

algorithm, the online self-tuning of PID control parameters

can be realized, and the optimal pavement crack tracking

motion control effect can be achieved. The structure of the

Neural–PID scheme is shown in Fig.1.

4 Experiments

This section focuses on evaluating the proposed method

through representative benchmarks and validation. The

first aspect covers the experimental settings. Then, the

crack segmentation results are analyzed and discussed.

Subsequently, our Neural–PID method is deployed on a

vision-guided robot to achieve real-time tracking of pave-

ment cracks.

4.1 Experimental Setting

The model training experiments were conducted on an

Intel(R) i9-13900K(F) CPU running at 5.8 GHz, along

with an NVIDIA GeForce RTX4090 GPU (24 GB) and

the following software versions: CUDA v10.2, cuDNN

v8.0.1, Pytorch v2.0, and Python v3.8. The unmanned

wheeled robot is equipped with an embedded Nvidia Jet-

son AGX Xavier computer, serving as the main processor

with the following specifications: 512 CUDA cores and

64 tensor cores within an Nvidia Volta GPU, v8.2 ARM

CPU with 8 cores, and 32 GB DDR4 memory. To acquire

pavement crack video images in the front view scene of

the unmanned wheeled robot, a front-mounted Realsense

D435i camera with a 135-degree field of view (FOV) and

an RGB-D perception unit is utilized. The embedded envi-

ronment includes Jetpack 4.4, PyTorch 1.8, Linux Ubuntu

18.04, and ROS Melodic, as shown in Figure 3.

The evaluation metrics utilized to assess the perfor-

mance of our proposed method are Precision (B), Preci-

sion (M), Recall (B), Recall (M), and AP (Average Preci-

sion). Furthermore, the AP incorporates mAP0.5 (B), and

mAP0.5 (M), which represent the AP with an IoU thresh-

old greater than 0.5, and mAP0.5:0.95 (B), mAP0.5:0.95

(M), which pertain to the average AP with an IoU thresh-

old ranging from 0.5 to 0.95 in increments of 0.05. The

Figure 2. The architecture of YOLOv5 uses a fusion transformer method that encompasses four separate compo-
nents: backbone, neck, head, result.


Figure 3. Working conditions of our vision-guided
robot under different perspectives are displayed.

notation (B) represents the metric of the predicted bound-

ary frame, corresponding to crack detection. Similarly,

the notation (M) represents the metric of the binary mask,

corresponding to crack segmentation.

4.2 Results of Crack Segmentation

This section presents an approach to significantly en-

hance the performance of crack segmentation using the

proposed method. The experimental results are analyzed

on the open data set CFD and the self-built data set

S2TCrack.

4.2.1 CFD Dataset

CFD is utilized for evaluation. The CFD dataset com-

prises 118 pavement crack images, each with dimensions

of 480 pixels by 320 pixels. These images were captured

by individuals standing on the road using an iPhone. The

ground truths were meticulously annotated at the pixel

level, a task that demands significant labor. The im-

ages exhibit high quality with a smooth and clean back-

ground. Table 1 compares the performance of YOLOv5,

our method (Ours), on the pre-trained models n, s, m, l, x.

Our method, using the different pretrained model, demon-

strated improved performance on the CFD dataset. The

following best performance metrics are: [Precision(M) =

0.6818, Recall(M) = 0.5178, APval0.5(M)=0.5304, AP-

val0.5:0.95(M)=0.2453]. Moreover, based on the com-

prehensive results obtained from the CFD dataset, our

proposed method exhibits significantly better performance

and versatility, showcasing its exceptional ability in pixel-

level crack segmentation tasks.

4.2.2 S2T-Crack Dataset

This section also includes a comparative experiment on

the self-built S2TCrack dataset, as presented in Figure4.

Our method demonstrates superior segmentation perfor-

mance in the pretrained model ’s’, which boasts a mere

6.7M parameters and 15.2M GFLOPs. Meanwhile, the

segmentation accuracy is moderately acceptable. Along

with the segmentation results of three scenes from the

self-built S2TCrack dataset, YOLOv5 roughly splits the

cracks, ignoring certain subtle features, which may result

in incomplete masks, leading to fractures or local losses.

Our method effectively generates masks that appropriately

cover the target cracks, thanks to the utilization of SA and

CA. To further enhance the performance, FT modules are

integrated to fuse crack features. Our method is capa-

ble of generating highly accurate binary masks, making it

suitable for various complex scenes.

4.3 Online Tuning of PID Parameters

This section presents an approach to significantly en-

hance the performance of crack tracking using the pro-

posed method. The experimental results are analyzed on

different control algorithms.

4.3.1 Comparison of Tracking Control

As shown in Figure 5, compared with CEAFC, the

Neural-PID control scheme approaches the ideal solution

with a faster convergence rate at iteration 200, indicat-

ing that the Neural-PID has stronger deterministic global

search ability and faster high-dimensional optimal solution

discovery speed. The results show that the Neural-PID

control algorithm is superior to the other three methods.

Table 1. Real-time segmentation results in the CFD dataset.

Method
Pretrained

Model
Batch_Size Precision(B) Precision(M) Recall(B) Recall(M) mAP

val
0.5(B) mAP

val
0.5(M) mAP

val
0.5:0.95(B) mAP

val
0.5:0.95(M)

Params

/M

GFLOPs

/M

n 32 0.6688 0.4424 0.4523 0.3158 0.4339 0.2301 0.1644 0.0393 1.9 6.7

s 16 0.7254 0.4621 0.4474 0.4474 0.4944 0.3875 0.2456 0.0854 7.4 25.7

m 8 0.7326 0.4562 0.4645 0.3947 0.4750 0.3631 0.2586 0.0545 21.7 69.8

l 2 0.7289 0.4637 0.4737 0.4211 0.5032 0.3849 0.2561 0.0911 47.3 146.4

x 2 0.7288 0.4726 0.5256 0.4211 0.4865 0.3756 0.2871 0.0822 88.2 264

n 32 0.7153 0.5958 0.5037 0.4167 0.5723 0.4548 0.3264 0.1967 2.0 6.9

s 16 0.7982 0.5653 0.4943 0.4817 0.5921 0.4906 0.3485 0.1873 7.5 25.7

m 8 0.7705 0.5831 0.5736 0.4524 0.5257 0.5187 0.3356 0.2294 21.8 69.9

l 2 0.7657 0.6024 0.5975 0.4688 0.5354 0.5018 0.3721 0.2453 47.4 146.7

x 2 0.7724 0.6818 0.5487 0.5178 0.5677 0.5304 0.3953 0.2102 88.4 265

Ours

YOLOv5


Figure 4. Visualization of segmentation results using YOLOv5 and our proposed method of our created S2T-
Crack dataset.

According to the convergence curve, the Neural-PID algo-

rithm needs 60 iterations to find the local optimal solution

and 90 iterations to get rid of the local optimal solution.

Compared with the 150 iterations required by the CEAFC

method, this is a huge reduction. Therefore, Neural-PID

can eliminate the local optimal solution and improve the

robustness of crack tracking control.

Figure 5. The comparison results of algorithm opti-
mization.

4.3.2 Analysis of Tracking Error

Table 2. The comparison results of crack tracking
error.

PID Fuzzy PID CEAFC Neural-PID

n 9.71 5.81 4.68 4.47

s 9.57 5.73 4.54 4.12

m 9.93 6.07 4.73 4.59

l 10.24 6.44 4.91 4.75

x 11.86 6.76 5.16 5.01

n 13.12 6.21 5.03 4.94

s 12.86 6.19 4.85 4.63

m 13.38 6.58 5.17 4.86

l 13.89 6.91 5.43 5.21

x 14.31 7.35 5.79 5.57

n 15.08 7.73 6.25 5.94

s 14.59 7.51 5.86 5.67

m 15.36 8.09 6.57 6.29

l 16.18 8.67 6.93 6.76

x 16.85 7.28 7.06 6.81

#3

Crack ID Segmentation Model

#2

#1

Control Method

Experiments are performed on real roads to verify the


performance of road crack tracking, as shown in Table

2. This average absolute error is used as a performance

evaluation index. The unmanned wheeled robot uses the

proposed method to compare the results of road crack

tracking error with other control methods during the track-

ing process. Crack #1 is a straight pavement crack. In the

case of crack #1, our algorithm achieves the smallest aver-

age crack tracking absolute error in the pre-trained model

’s’, with a measured value of 4.12 mm. Crack #2 is a

curved pavement crack. For the case of crack #2, our al-

gorithm achieves the smallest average absolute error in the

pre-trained model ’s’, with a measured value of 4.63 mm.

Crack #3 is a continuous turning pavement crack. Our

algorithm achieves the minimum mean absolute error in

the pre-trained model ’s’, and the measured value is 5.67

mm.

5 Conclusions

This article addresses two critical issues in road crack

tracking: insufficient feature extraction and low tracking

efficiency. To overcome these challenges, the research

primarily focuses on enhancing the pavement crack feature

extraction from crack video images using our transformer-

based crack segmentation method. By combining SA and

CA, and leveraging FT model, the performance of binary

masks in segmentation instances is significantly improved,

enabling fine-grained segmentation of pavement cracks.

Through the proposed Neural-PID, our method is deployed

on NVIDIA AGX Xavier to enable real-time tracking of

actual pavement cracks on a vision-guided robot. In future

research, the utilization of road crack depth images will

be considered, along with the exploration of alternative

control methods to enhance the accuracy and robustness

of the tracking control algorithm. The developed vision-

guided robot can be integrated with repair mechanisms to

accomplish road crack repairs.

Acknowledgements

The study presented in the article was partially sup-

ported by the National Key Research and Develop-

ment Program of China (No.2021YFB2601000), National

Natural Science Foundation of China (No.52078049,

No.52378431), Natural Science Foundation of Shaanxi

Province (2022JM-193), Fundamental Research Funds

for the Central Universities, CHD (No.300102210302,

No.300102210118), the 111 Project of Sustainable Trans-

portation for Urban Agglomeration in Western China

(No.B20035).

References

[1] Jingwei Liu, Xu Yang, Stephen Lau, Xin Wang,

Sang Luo, Vincent Cheng-Siong Lee, and Ling

Ding. Automated pavement crack detection and

segmentation based on two-step convolutional neu-

ral network. Computer-Aided Civil and Infras-

tructure Engineering, 35(11):1291–1305, 2020.

doi:10.1111/mice.12622.

[2] Jingwei Liu, Xu Yang, Xin Wang, and Jian Wei Yam.

A laboratory prototype of automatic pavement crack

sealing based on a modified 3D printer. International

Journal of Pavement Engineering, 23(9):2969–2980,

2022. doi:10.1080/10298436.2021.1875225.

[3] Jinchao Guan, Xu Yang, Ling Ding, Xiaoyun Cheng,

Vincent C.S. Lee, and Can Jin. Automated pixel-level

pavement distress detection based on stereo vision

and deep learning. Automation in Construction, 129:

103788, 2021. doi:10.1016/j.autcon.2021.103788.

[4] Jinchao Guan, Xu Yang, Pengfei Liu, Markus Oeser,

Han Hong, Yi Li, and Shi Dong. Multi-scale as-

phalt pavement deformation detection and measure-

ment based on machine learning of full field-of-

view digital surface data. Transportation Research

Part C: Emerging Technologies, 152:104177, 2023.

doi:10.1016/j.trc.2023.104177.

[5] Zhihao Pan, Jinchao Guan, Xu Yang, Kang

Fan, Jeremy C.H. Ong, Ningqun Guo, and

Xin Wang. One-stage 3D profile-based pave-

ment crack detection and quantification. Au-

tomation in Construction, 153:104946, 2023.

doi:10.1016/j.autcon.2023.104946.

[6] Jianqi Zhang, Xu Yang, Wei Wang, Jinchao Guan,

Ling Ding, and Vincent C. S. Lee. Automated

guided vehicles and autonomous mobile robots

for recognition and tracking in civil engineering.

Automation in Construction, 146:104699, 2023.

doi:10.1016/j.autcon.2022.104699.

[7] Chengjia Han, Tao Ma, Ju Huyan, Xiaoming Huang,

and Yanning Zhang. CrackW-Net: A Novel Pave-

ment Crack Image Segmentation Convolutional Neu-

ral Network. IEEE Transactions on Intelligent

Transportation Systems, 23(11):22135–22144, 2022.

doi:10.1109/TITS.2021.3095507.

[8] Guijie Zhu, Jiacheng Liu, Zhun Fan, Duan Yuan,

Peili Ma, Meihua Wang, Weihua Sheng, and Kelvin

C. P. Wang. A lightweight encoder–decoder network

for automatic pavement crack detection. Computer-

Aided Civil and Infrastructure Engineering, pages

1–23, 2023. doi:10.1111/mice.13103.

https://doi.org/10.1111/mice.12622
https://doi.org/10.1080/10298436.2021.1875225
https://doi.org/10.1016/j.autcon.2021.103788
https://doi.org/10.1016/j.trc.2023.104177
https://doi.org/10.1016/j.autcon.2023.104946
https://doi.org/10.1016/j.autcon.2022.104699
https://doi.org/10.1109/TITS.2021.3095507
https://doi.org/10.1111/mice.13103


[9] Frank K.A. Awuah and Alvaro Garcia-Hernández.

Machine-filling of cracks in asphalt concrete.

Automation in Construction, 141:104463, 2022.

doi:10.1016/j.autcon.2022.104463.

[10] Jianqi Zhang, Xu Yang, Wei Wang, Jinchao Guan,

Wenbo Liu, Hainian Wang, Ling Ding, and Vin-

cent C. S. Lee. Cross-entropy-based adaptive fuzzy

control for visual tracking of road cracks with

unmanned mobile robot. Computer-Aided Civil

and Infrastructure Engineering, pages 1–20, 2023.

doi:10.1111/mice.13108.

[11] Firdes Çelik and Markus König. A sigmoid-

optimized encoder–decoder network for crack

segmentation with copy-edit-paste transfer

learning. Computer-Aided Civil and Infras-

tructure Engineering, 37(14):1875–1890, 2022.

doi:10.1111/mice.12844.

[12] Xinzi Sun, Yuanchang Xie, Liming Jiang, Yu Cao,

and Benyuan Liu. DMA-Net: DeepLab With

Multi-Scale Attention for Pavement Crack Seg-

mentation. IEEE Transactions on Intelligent

Transportation Systems, 23(10):18392–18403, 2022.

doi:10.1109/TITS.2022.3158670.

[13] Bo Chen, Hua Zhang, Guijin Wang, Jianwen Huo,

Yonglong Li, and Linjing Li. Automatic concrete in-

frastructure crack semantic segmentation using deep

learning. Automation in Construction, 152:104950,

2023. doi:10.1016/j.autcon.2023.104950.

[14] Honghu Chu and Pang-jo Chun. Fine-grained crack

segmentation for high-resolution images via a mul-

tiscale cascaded network. Computer-Aided Civil

and Infrastructure Engineering, pages 1–20, 2023.

doi:10.1111/mice.13111.

[15] Dongho Kang, Sukhpreet S. Benipal, Dharshan L.

Gopal, and Young-Jin Cha. Hybrid pixel-level

concrete crack segmentation and quantification

across complex backgrounds using deep learning.

Automation in Construction, 118:103291, 2020.

doi:10.1016/j.autcon.2020.103291.

[16] Zhihan Zhang and Di Bai. Optimization of Im-

proved PID Control Strategy Based on Genetic

Algorithm. Journal of Physics: Conference Se-

ries, 2417(1):012025, 2022. doi:10.1088/1742-

6596/2417/1/012025.

[17] YeFei Kang, ZhiBin Li, and Tao Wang. Application

of PID Control and Improved Ant Colony Algorithm

in Path Planning of Substation Inspection Robot.

Mathematical Problems in Engineering, 2022:1–10,

2022. doi:10.1155/2022/9453219.

[18] Ultralytics. ultralytics/yolov5: v7.0 - YOLOv5

SOTA Realtime Instance Segmentation. https:

//github.com/ultralytics/yolov5.com,

2022. URL https://doi.org/10.5281/

zenodo.7347926. Accessed: 7th May, 2023.

[19] Yuchuan Du, Shan Zhong, Hongyuan Fang, Nian-

nian Wang, Chenglong Liu, Difei Wu, Yan Sun,

and Mang Xiang. Modeling automatic pavement

crack object detection and pixel-level segmentation.

Automation in Construction, 150:104840, 2023.

doi:10.1016/j.autcon.2023.104840.

[20] Kai Han, Yunhe Wang, Hanting Chen, Xinghao

Chen, Jianyuan Guo, Zhenhua Liu, Yehui Tang,

An Xiao, Chunjing Xu, Yixing Xu, Zhaohui Yang,

Yiman Zhang, and Dacheng Tao. A Survey on Vision

Transformer. IEEE Transactions on Pattern Analy-

sis and Machine Intelligence, 45(1):87–110, 2023.

doi:10.1109/TPAMI.2022.3152247.

[21] Yuqing Wang, Zhaoliang Xu, Xinlong Wang, Chun-

hua Shen, Baoshan Cheng, Hao Shen, and Huaxia

Xia. End-to-End Video Instance Segmentation

with Transformers. In 2021 IEEE/CVF Confer-

ence on Computer Vision and Pattern Recogni-

tion (CVPR), pages 8737–8746, Virtual, Online,

United states, June 2021. IEEE Computer Society.

doi:10.1109/CVPR46437.2021.00863.

[22] Sixiao Zheng, Jiachen Lu, Hengshuang Zhao, Xi-

atian Zhu, Zekun Luo, Yabiao Wang, Yanwei

Fu, Jianfeng Feng, Tao Xiang, Philip H.S. Torr,

and Li Zhang. Rethinking Semantic Segmen-

tation from a Sequence-to-Sequence Perspective

with Transformers. In 2021 IEEE/CVF Confer-

ence on Computer Vision and Pattern Recogni-

tion (CVPR), pages 6877–6886, Virtual, Online,

United states, June 2021. IEEE Computer Society.

doi:10.1109/CVPR46437.2021.00681.

[23] Wael Farag. Complex Trajectory Tracking Using

PID Control for Autonomous Driving. International

Journal of Intelligent Transportation Systems Re-

search, 18(2):356–366, 2020. doi:10.1007/s13177-

019-00204-2.

[24] Salman Bari, Syeda Shabih Zehra Hamdani,

Hamza Ullah Khan, Mutte Ur Rehman, and Ha-

roon Khan. Artificial neural network based self-

tuned PID controller for flight control of quad-

copter. In 2019 International Conference on En-

gineering and Emerging Technologies (ICEET),

pages 1–5, Lahore, Pakistan, February 2019. In-

stitute of Electrical and Electronics Engineers Inc.

doi:10.1109/CEET1.2019.8711864.

https://doi.org/10.1016/j.autcon.2022.104463
https://doi.org/10.1111/mice.13108
https://doi.org/10.1111/mice.12844
https://doi.org/10.1109/TITS.2022.3158670
https://doi.org/10.1016/j.autcon.2023.104950
https://doi.org/10.1111/mice.13111
https://doi.org/10.1016/j.autcon.2020.103291
https://doi.org/10.1088/1742-6596/2417/1/012025
https://doi.org/10.1088/1742-6596/2417/1/012025
https://doi.org/10.1155/2022/9453219
https://github.com/ultralytics/yolov5.com
https://github.com/ultralytics/yolov5.com
https://doi.org/10.5281/zenodo.7347926
https://doi.org/10.5281/zenodo.7347926
https://doi.org/10.1016/j.autcon.2023.104840
https://doi.org/10.1109/TPAMI.2022.3152247
https://doi.org/10.1109/CVPR46437.2021.00863
https://doi.org/10.1109/CVPR46437.2021.00681
https://doi.org/10.1007/s13177-019-00204-2
https://doi.org/10.1007/s13177-019-00204-2
https://doi.org/10.1109/CEET1.2019.8711864

	Introduction
	Related works
	Methodology
	Framework
	Crack Segmentation with Transformer
	Neural–PID Control for Crack Tracking

	Experiments
	Experimental Setting
	Results of Crack Segmentation
	CFD Dataset
	S2T-Crack Dataset

	Online Tuning of PID Parameters
	Comparison of Tracking Control
	Analysis of Tracking Error


	Conclusions