CRAAC: Consistency Regularised Active Learning with Automatic Corrections
for Real-life Road Image Annotations

Percy Lam
University of Cambridge

phl25@cam.ac.uk

Sooyong Park
University of Cambridge
martellw2ks@gmail.com

Weiwei Chen
University College London
weiwei.chen@ucl.ac.uk

Lavindra de Silva
University of Cambridge

lpd25@cam.ac.uk

Ioannis Brilakis
University of Cambridge

ib340@cam.ac.uk

Abstract

In annotating real-life large, noisy and domain-specific
images for digitising infrastructure, substantial human ef-
fort persists despite past advancements. This research pro-
vides practical and interpretable scores for human anno-
tators, enabling flexible annotation strategies, improving
automation and reducing the effort required to create and
correct image labels. The authors present the CRAAC so-
lution: Consistency Regularised Active learning and Au-
tomatic Corrections, which builds on Mask R-CNN with
three additional modules: consistency regularisation, scor-
ing modules for active learning and automatic corrections.
Experiments on our pavement image dataset, recorded with
a low silhouette score of 0.146 and qualitative annotation
inconsistencies, reduce the human effort of mouse clicks by
5-11% and improve the quality metrics of mAP and AR by
approx. 40% from the original Mask R-CNN. The automatic
correction further reduces the performance variation.

1. Introduction

Digitisation is seen as a key solution [18] for enhanc-
ing the economic, environmental and safety performance
[17] of our critical road infrastructure. The state-of-the-art
(SOTA) data preparation process requires much manual ef-
fort that may potentially negate the benefits of digitisation,
inspiring the current research.

This research specifically targets automation in annotat-
ing real-life road images by highway authorities and are
hence large, noisy and domain-specific. These images re-
quire annotations, the markup of image features that ma-
chine learning models are trained to automatically recog-

nise [24]. This project adopts instance segmentation [24]
labels. Throughout the paper, pseudo-labels refer to la-
bels predicted from a model without manual corrections,
whereas ground truth labels are the accepted labels af-
ter manual corrections or annotations. Automation enables
images to be labelled with less or no human input.

SOTA image annotation involves a range of automa-
tion. Open-set annotation such as the PaliGemma [29]
detects segmentation masks from text prompts of desired
generic objects. Labelling non-generic objects involves hu-
man annotators [1] with a range of semi-automatic assis-
tive tools [21], such as inferences from pre-trained mod-
els [22] and auto polygons adjustments [30]. These tools,
however, perform inconsistently on large, noisy, domain-
specific datasets. They also cannot recommend which im-
age improves the tool’s inference the most. The quests to
utilise human review efforts and acquire better explanations
for failed pseudolabels underpin the research.

This research makes the following contributions:

1. Design a pipeline that reduces human effort in both
generating and correcting annotations, by querying the
most advantageous images and mimicking past correc-
tions when reviewing pseudo-labels

2. Generate practical and interpretable inference scores
through bbox and mask scoring modules that enable
flexible annotation strategies

2. Related Works
Deep learning architectures for 2D image instance seg-

mentation are available as single-stage, such as YOLOv7
[31], or two-stage detectors such as Mask R-CNN [8]. Re-
search teams applied these detectors to prepare labelled


datasets and detect difficult objects such as pavement de-
fects [2, 20, 38]. Advancements in inferring and correcting
pseudo-labels prompt our research.

2.1. Enhancements in creating pseudo-labels
Research teams have been enhancing pseudo-labels cre-

ation by semi-supervised learning (SSL). SSL leverages
unlabelled images to improve predictions from a small la-
belled dataset [19] by two approaches. The first approach
implicitly extracts information in the unlabelled data with-
out labelling them, such as consistency regularisation that
utilises losses between predictions of the original image and
its augmented version [4, 19]. The extra loss can thus bol-
ster the performance of a trained model, single-stage or two-
stage detectors alike [9], or be used as a score for selecting
images for human review [7].

The second approach observes the distribution of the un-
labelled data without resorting to the training process and
enables labelling strategies such as active learning [32].
This technique selects the most informative or representa-
tive [41] pseudo-labelled instance to query from a large un-
labelled dataset, thereby sparing the need to review every
pseudo-label. Researchers improved the process by find-
ing better sampling methods [3], combining with other SSL
methods [10, 13], eliminating redundant predictions [36]
and considering better loss functions and terms. Training
losses can be modified by adding new loss terms [7] and
quantifying losses by scores [32, 39, 40].

Open-set annotation avails image annotation through
vision language training. Based on text-to-image embed-
dings, zero-shot object detection enables bounding boxes
(bbox) and/or masks to be detected from an input text
prompt [14, 26, 34, 37].

2.2. Enhancements in correcting pseudo-labels
In contrast with pseudo-label creation, pseudo-label cor-

rections by humans are still prevalent in domain-specific
applications, such as labelling fruits [15], roots [27] and
ulcers [11]. While some researchers such as Wu [35] at-
tempted to correct pseudo-labels automatically by simple
intersection and/or union of polygons, automatic correction
is still an open field for innovations.

The closest resemblance is pseudo-label refinement,
common in SSL or unsupervised domain adaptation. This
technique adjusts pseudo-labels by training another projec-
tion head with auxiliary tasks [5, 33] or applying clustering
to past ground truths [16]. Pseudo-label refinement however
only assesses the final corrected state and does not consider
the change from its original to the final state.

The current techniques overall leave a gap in knowledge
of not being able to annotate domain-specific datasets fully
automatically. Specifically, a gap remains in enabling hu-
man annotators to identify problems with images and devise

appropriate annotation strategies to complete the remaining
annotations. Existing research also did not fully capture the
value of past human corrections on pseudo-labels.

3. Method

Our proposed solution extends from the two-stage detec-
tor Mask R-CNN with three additional components. The
two-stage architecture allows modules to be modified in-
dependently without interfering with one another, while its
Region Proposal Network (RPN) provides proposal bboxes
to seek missing instances. The first additional compo-
nent, consistency regularisation (Sec. 3.1), aims to improve
the predictor’s performance by leveraging unlabelled data
in the classifier training, thereby outputting better pseudo-
labels that need fewer alterations. The second component,
the scoring module estimates the image informativeness
(Sec. 3.2) and outputs 5 scores in ranking images for active
learning, so human annotators can interpret the informative-
ness of unlabelled images, select useful images for review
and save reviewing effort. The auto-correction component
(Sec. 3.3) records previous manual corrections on predicted
pseudo-labels and mimics the corrections with the rest of
the predictions. The three components combined formed
the CRAAC solution as illustrated in Fig. 1.

3.1. Consistency Regularisation (CNS)

The authors introduce CNS loss in the classifier train-
ing to improve its prediction performance. The CNS loss
is calculated by adapting Jeong’s [9] structure for two-stage
detectors in Fig. 2. The structure first augments the input
image with a horizontal flip (xflip) and creates feature maps
for both images. The RPN generates region proposals for
the original image and flips horizontally the regions for the
augmented image. The classifier then predicts class proba-
bility vectors for the proposals of both images and softmax-
ed for CNS classification loss calculation.

Following Jeong [9], the CNS loss comprises two losses:
the loss due to the difference in predicted class probabil-
ity vectors (unsupervised class loss, Lcns

cls ) and the differ-
ence between the four scalars (bbox centre x, y and bbox
width w and height h) of the bboxes (unsupervised bbox
loss, Lcns

bbox) in the original versus the augmented image.
The two losses are then added with weight β to the super-
vised loss of Mask-R-CNN Lsup to train the classifier in
Eq. (1). The loss is introduced to help differentiate features
and improve the classifier for better pseudo-labels. The
RPN is only trained with Lsup to avoid the correspondence-
matching problem.

Ltotal = Lsup + βLCNS = Lsup + β(Lcns
cls +Lcns

bbox) (1)


Figure 1. The Overall Schematic Structure of CRAAC and Adopted Loss for Training

Figure 2. CNS Structure for Two-
stage Detectors by Jeong [9]

Figure 3. The Five Scoring Components for the Active Learning Pipeline

3.2. Active Learning (AL)

Besides improving the classifier, the authors also adopt
active learning (AL) to select the most informative im-
ages to reduce human review. The images informative-
ness is measured by the uncertainty and distribution scores
in Fig. 3. The uncertainty scores show the uncertainty in
classification (C), bbox regression (B) and mask prediction
(M ). The distribution scores quantify an image’s relevance
based on the scarcity of the predicted category (V ) and the
number of predicted instances in an image (A).

On the uncertainty scores, besides C provided by the
original Mask R-CNN, the authors estimate B and M
by training parallel scoring modules extending from Yoo
[39]. This explains the sources of model uncertainty better
and enables unlabelled images to be used, compared with
Wang’s [32] approach that relied on labelled ground truth.

The scoring module extends parallel from the box and
mask head in Mask R-CNN as shown in Fig. 4 and consists

of structures demonstrated in the supplementary material.
The inputs are feature maps from the box and mask head,
as well as the box regression loss (Lboxhead

bbox ) and mask loss
(Lmaskhead

mask ) of Mask R-CNN. The modules then predict
non-negative scalar scores B and M . When training the box
and mask scoring modules, the box scoring module com-
putes the Huber loss Lmodule

bbox between Lboxhead
bbox and B and

similarly Lmodule
mask from the mask scoring module in Eq. (2).

Ltotal = Lboxhead
bbox + Lmaskhead

mask + α(Lmodule
bbox + Lmodule

mask )
(2)

Where α weighs the effect of scoring module losses.
At inference, besides the model uncertainty scores C, B

and M , the category value V and the label abundance A
represent the data distribution. V is the inverse of the rela-
tive frequency of the predicted category in the labelled set.
Intuitively, images with more predicted instances (higher A)
in scarcer categories (higher V ) are more informative.


Vi = (
no of instances in category i

Total no of instances in all categories
)−1

The total informativeness score Si for each image i with
j ∈ 1, ..., N predicted instances is therefore the sum of the
five informative scores C,B,M, V,A in Eq. (3).

Si(Ci, Bi,Mi, Ai, Vi) = wBBi + wMMi + wAAi+

1

N

N∑
j=1

(wCCi,j + wV VCat(j))
(3)

Note that the equation calculates B, M and A per image,
and C and V per instance. The w denotes fixed weights
applied to each of the scores.

After training the classifier and the scoring modules in
each iteration, the model predicts pseudo-labels for all un-
labelled images and calculates the five informative scores.
After applying the weights in Eq. (3), the algorithm ranks
the unlabelled images by their informative score. The top
n (20 in Tab. 2) unlabelled images are the least certain im-
ages, quantified by their high box and mask loss (in B and
M ) and likely large A as discussed in Sec. 4.5. They are se-
lected for review, or in the experiment the pseudo-labels are
replaced by the ground truth labels and are added to the la-
belled dataset for the next training iteration. This targets to
maximise the improvements of the classifier with the least
number of images, hence requiring the least manual review
and producing better psuedo-labels in later iterations.

Conversely, the most certain images are the bottom n
(also 20 in Tab. 2) images with at least one pseudo-label.
They become the correction templates in AC as the predic-
tions tend to be more reliable (lower B and M , Sec. 4.5).
The selection prevents circumstantial corrections if all in-
stances are used (Sec. 4.7) and performs auto-correction on
common pseudo-label errors.

3.3. Automatic Corrections (AC)
The AC module gauges and mimics annotators’ modi-

fications to save human efforts from repetitive corrections.
From the correction templates of the most certain images,
the module reads the ground truth (manual correction) and
records the correction action. Correction actions include
deletion, addition, category changes in [12] and addition-
ally resized bboxes. The module records the category, bbox
details and the 1024-length embedding of every instance in
the correction template before and after the correction.

At inference, the module applies the trained model to
predict instances and region proposals on the testing set im-
ages. In the transverse crack example in Fig. 5, the cosine
similarity is calculated between the embedding of the pro-
posal bbox in the testing image against the embeddings of

instances in the correction templates. If the pairwise simi-
larity exceeds a preset threshold (see Sec. 4.3), and if the
width-to-height ratio (both predicted instance and region
proposal) and the predicted category (predicted instance
only) are similar to the correction template, the correction
template enters the clustering algorithm. The combined use
of computed similarity and physical attribute data helps dis-
tinguish physically different instances with similar compu-
tational features, such as longitudinal and transverse cracks.

The clustering algorithm decides the correction action.
The embedding of the predicted instance or the proposal
bbox of the testing image (the testing feature) enters a k-NN
classifier, where k = 3 (Delete, add and change category.
A size change keeps the predicted instance). The testing
feature clusters with the embeddings of similar instances
from the correction templates (the correction features). The
closest cluster recommends the action on the predicted in-
stance/proposal.

• A predicted instance is deleted or changes its category
if the closest template recommends so.

• A region proposal is added if it is the closest to a tem-
plate recommending an addition.

• No changes are made otherwise.

The testing feature in Fig. 5 is the closest to the addition
cluster and is hence added. If the predicted instance is sim-
ilar to fewer than three correction template instances, there
are not enough templates to form clusters so the algorithm
follows the most similar template.

The remaining predicted instances are post-processed
similar to [12]. The post-processing merges bboxes and
masks of the same class with an area difference and cen-
troid distance below a threshold (empirically set to 0.4 and
0.2 ×min(image height, image width)). This is to account
for enlarged masks with wiggly cracks and magnified in-
stances in the near field while ensuring actual fragmented
masks are merged (see supplementary material). The post-
processed outcome is evaluated against the ground truth of
the testing set in this experiment or becomes the pseudo-
labels for actual image annotation. The AC module thus
helps avoid repetitive corrections by humans.

4. Experiments
This section analysed and discussed the dataset, the per-

formance metrics, the testing setup and their performance.

4.1. Dataset Analysis
The author experimented with pavement images col-

lected from A12 Mountnessing, United Kingdom (6.6km
× 2 lanes) and annotated at the level of instance segmen-
tation. The images are taken on a vehicle-mounted pave-
ment camera Trimble MX9 [28] travelling at lane speed.


Figure 4. The Structure of the Scoring Module and the Automatic Correction Module

Figure 5. The Operating Mechanism of the Automatic Correction Module

Model training embarked after labelling the initial training
set of 101 positive images (contained instances). For ex-
perimentation, all images are annotated in advance. 209
positive images were separated as the testing set for all
iterations. The remaining 1663 positive images in the
dataset would be queried for human review and transferred
to the labelled set until all positive images were selected
for training. The image selection depended on the test-
ing setup in Sec. 4.3. The annotated dataset included four
crucial categories of defects for road maintenance, namely
crack transverse, crack longitudinal, potholes and patch.
Tab. 1 shows a summary of the image and instance counts.

4.2. Benchmarking Metrics

The experiments tested the hypothesis of saving human
effort while maintaining the quality of annotation. The re-
quired human effort was quantified by the number of mouse

Table 1. Defect distribution of the initial training, testing and the
remainder set

Batch Init. Train Testing Remainders

Nos. Images 1000 500 10181
Positive Images 101 209 1663
Nos. Instances 133 419 2718
Instance distribution
crack transverse 82 151 888
crack longitudinal 45 68 889
potholes 0 19 101
patch 6 181 840

clicks required to perform actions in Sec. 3.3 to amend
pseudo-labels into their final ground truth.

1. Deletion: 1 click (the bin button)


2. Addition: 5-7 clicks, the median clicks to create a mask
across categories, see the supp material

3. Category change: 2 clicks (open the dropdown list, then
choose a new category)

4. Change size (resize the mask): taken as 4 clicks (drag
four corners to their new positions)

The quality of the predicted pseudo-labels is reported in the
mean average precision over categories (mAP50) and aver-
age recall for 100 detections per image (AR50,maxdet=100)
at the IoU threshold of 0.5 against the actual ground truths.

4.3. Testing Setup
Experiments began by training the initial training set in

Sec. 4.1 with the COCO pretrained weights. The model
selected around 100 images per training iteration and trans-
ferred the selected images to the labelled set. The model
was then fine-tuned by the entire new labelled set in the
next training iteration, and iterated until the whole positive
image set was labelled and trained. Different experiments
were conducted, summarised as follows and in Tab. 2.

• Mask R-CNN: original (add 100 random images per iter)

• Sampling: Use original Mask R-CNN scores for sampling
(select 20 least certain + 80 random images)

• AL: Mask R-CNN + AL only (with the scoring module.
Select 20 least certain + 80 random images)

• CRA: Mask R-CNN + CNS + AL (Consistency
Regularised Active learning. With CNS. Select 20 least
certain + 20 most certain + 60 random images)

• CRAAC: Mask R-CNN + CNS + AL + AC (CRA with
Auto-Corrections. Setup as CRA, use the most certain 20
as correction templates)

• CRAAC 2: Mask R-CNN + CNS + AL + AC (Select 20
least certain + 80 random images for training. Take the
most certain 20 as correction templates but not added to
the next training iteration)

Effects of the number of sampled images were studied
in [12]. 20 most and least certain images were selected to
ensure sufficient samples for improving the classifier and
serving as correction templates in the AC, while not exces-
sive to make corrections only suitable for edge cases. The
rest of the 100 images added in each iteration were ran-
domly selected from the unlabelled dataset.

The similarity thresholds for the permissible actions and
the objectness of region proposals in the AC could be ad-
justed. The current set of thresholds was selected to salvage
missing instances and avoid accidental deletions, as well as
to maintain low mouse clicks across training sets of differ-
ent sizes, as shown in the supplementary material.

Referring to Eq. (3), the empirically found weights for
all tests involving AL were: wB = 0.5, wM = 0.5, wA =

0.1, wC = 1, wV = 1. These weights were chosen to pro-
vide scores with a balance between the model uncertainty
and data distribution. A higher wB and wM will cause a
stronger bias to model uncertainty, outweighing the impor-
tance of finding instances of scarcer categories. Sec. 4.5
suggested that wC could potentially be altered to prioritise
finding instances of a particular desired category.

All models were generally trained with the same learn-
ing rates and settings, detailed in supp material. The CNS
and the scoring modules were both trained for 5000 steps
in the first iteration and 1500 and 2000 steps subsequently.
The experiments were conducted on a desktop computer
with a single NVIDIA GeForce RTX3080Ti with 12GB of
VRAM, typically in a batch size of 6.

4.4. Performance
Figs. 6 to 8 show the precision (mAP50), recalls (AR50)

and required mouse clicks across testing setups. Results
generally improve with more training and plateau at approx.
1000 positive images. The original model saved about 20%
human effort from full manual labelling, with a further 5-
11% reduction by using AL, CNS and/or AC.

4.4.1 Original Mask R-CNN as compared with the rest

Figs. 6 to 8 show that testing setups outperformed the orig-
inal Mask R-CNN (the light green curve) by 40-50% in
mAP and AR and 5-11% in mouse clicks. From 400 trained
images onwards, the original model improved slower be-
cause AL prioritised training the more informative images
through weighting in Eq. (3). At around 1030 images, other
setups picked up 50-100 more instances than the original
model. When a model was trained with fewer useful im-
ages, it descended to a local minimum with worse perfor-
mance and cascaded the effect to further training iterations.

Even by sampling with scores in the original Mask R-
CNN, the initial front-loaded gain in precision and recall
was exceeded by set-ups with the three modules from ap-
prox. 800 images onwards. The mouse click counts also
tended to level with Mask R-CNN towards the end of train-
ing, trailing behind other set-ups. This indicated the contri-
bution of the three modules in training classifiers for better
pseudo-labels and more effective sampling and correction.

4.4.2 AL as compared with the rest with CR

Figs. 6 to 8 show that the model with only active learning
(the pink curve) matched in quality and human effort with
models with CNS training (CRA or CRAAC) before 1000
images. AL had likely captured more high-value images in
the early stages of training, front-loading the performance
gain. As training continued, the early advantage diminished
and the quality metrics, especially AR50, gradually fell be-
low models with CNS training. At the end of the training,


Table 2. The testing configurations

Images AC parameters
Consistency

Regularisation
Active

Learning
Auto

Correction
Least

Certain
Most

Certain Random Remarks
Thresholds

(del, add, chcats)
Objectness

Logit
Mask R-CNN × × × #N/A #N/A 100

Sampling × × × 20 0 80
Sample with only in-built
Mask R-CNN C, V and A

AL × ✓ × 20 0 80
CRA ✓ ✓ × 20 20 60
CRAAC ✓ ✓ ✓ 20 20 60 1.0, 0.7, 0.7 0.04

CRAAC 2 ✓ ✓ ✓ 20 0* 80
20 most certain reviewed for correction
templates, not trained 1.0, 0.8, 0.7 0.02

Figure 6. Precision of experiments, in mAP50

models with CNS yielded 6-18% growth of AP and 5-17%
growth of AR compared with the model with only AL.

4.4.3 CRA only compared with the rest with Auto-
matic Correction

The performance of the base CRA model underpinned the
AC module, which ran on its weights without extra training.
The AC module was encouraged to add instead of delete in-
stances because original predictions were by design supe-
rior to abandoned region proposals and deletion costs fewer
mouse clicks. Overall the AC smoothened the performance
gain with trained images and brought a trade-off of pre-
cision for more recall within appropriately 10% (CRA as
compared with CRAAC and CRAAC 2). The human ef-
fort saving with AC was generally in line with just CRA or
marginally better in the early stage of CRAAC 2. Between
configurations of AC (CRAAC against CRAAC 2), exper-
iments showed that using the least certain images and not
the most certain images for training yielded better perfor-
mance. More in-depth sensitivity tests on the AC module
were in the supplementary material.

4.5. Scores for Granular Sources of Uncertainty
Besides model performance, the CRAAC facilitates an-

notators to know the sources of pseudo-label uncertainty
through the five interpretable scores from AL in Fig. 3.
Tab. 3 shows the contribution of prediction uncertainty

Figure 7. Recall of experiments, in AR50

Figure 8. Required mouse click counts

(B + M ) and category value (V ) to the total score at the
early (400 images), mid (1000 images) and end-stage of
CRA training. The top 20 unlabelled images had higher to-
tal scores predominated by prediction uncertainty. Choos-
ing them for human review made sense because the high
prediction uncertainty meant the model struggled with the
images and was unrelated to the data distribution. Labels of
the reviewed images could be ascertained and added to the
training set, clearing the model uncertainty.

Conversely, to gather instances in scarcer categories,
users could choose images with more data distribution loss
or even raise the weighting of wv for the desired category.
This was why the AC adopted the most certain 20 images
(the bottom 20) with a larger proportion of V as correction


Table 3. Score distribution in different CRA training stages

400 images 1000 images End (1728 images)

B+M V
Total
Score B+M V

Total
Score B+M V

Total
Score

Top 20 64% 15% 3.86 57% 23% 3.25 52% 29% 2.26
Top 100 61% 18% 3.46 53% 26% 2.88 51% 31% 2.19

Bottom 20 35% 51% 1.24 47% 39% 1.41

Figure 9. Instance distribution of A12 and A14 dataset

templates. The least certain images may contain circum-
stantial corrections to be avoided in most images. Putting
extra weights for desired categories also helped extract their
instances earlier in training. A parametric study (supp. ma-
terial) showed if 5x weight was added to ”potholes”, the
model would extract more ”potholes” from the unlabelled
set (72% of the full dataset at 5x vs 50% at 1x) and raised
AR50 of ”potholes” from 11% to 42% in mid-stage training.
This weighting effect attenuated towards the end of training
as all instances were eventually trained.

4.6. Validation
The authors additionally performed validation tests on

another equally problematic pavement image dataset on
A14 Tothill in the UK. Compared with A12 Mountnessing
overlain with asphalt, A14 Tothill was built with a bare con-
crete surface and exhibited a dominating class bias towards
patches (Fig. 9). Experiments were conducted in similar se-
tups as Tab. 2, with results showing that all setups saved 5-
9% mouse clicks from the original Mask R-CNN model and
outperformed it in average precision and recall by approx.
20% - 30%. The AL setup in validation extracted signif-
icantly more instances in the early and mid-stage training,
leading to comparable results in AL, CRA and CRAAC.
The supp materials record details of the validation test.

4.7. Limitations
The noisiness of the dataset hampers the model. Be-

sides the inherent difficulties in detecting dark grey cracks
or potholes on a lighter grey asphalt surface, human anno-
tation inconsistencies aggravated the noisiness. These com-
monly include the need for larger bboxes to cover wiggly
cracks, circumstantial corrections that should not be repli-
cated and fragmented masks. Quantitatively, the silhou-
ette score [23] was calculated to show the intra- and inter-

category sparseness of the dataset. The average silhouette
score for all categories was 0.146, which was much worse
than a good threshold of 0.5 [6] or the silhouette score in
common datasets such as the Iris or the S-1 dataset [25].

The proposed modules could be adapted for other archi-
tecture in the future. While Jeong [9] also proposed a CNS
for a single-stage detector, more work may be required to
extract the bbox and mask loss for the scoring modules in
other architecture. The AC operated independently from
trained models so could theoretically be deployed in other
architecture, except that users may need to extract proposals
for the addition action from a position other than the RPN.

5. Conclusion
This research aims to reduce human input in annotat-

ing real-life large, noisy and domain-specific image datasets
while maintaining its quality. The authors propose a solu-
tion CRAAC, Consistency Regularised Active learning with
Automatic Corrections. The solution incorporates consis-
tency calculation to improve the classifier prediction, at-
taches a parallel scoring module to output interpretable
scores for selecting informative images in active learning
and automatically corrects pseudolabels from past human
corrections. Results show the CRAAC improves in mAP
and AR by 40-50% than using the original Mask R-CNN
and reduces 5-11% of human effort. Adopting only the pro-
posed active learning module matches the performance with
the setup adopting also consistency regularisation at the be-
ginning but loses precision and recall by 5-18% towards
the end of training. Automatic correction yields some gain
in recall and generally attenuates the fluctuation in perfor-
mance. The ability to estimate scores from labelled and un-
labelled images enables annotators to better understand the
labelling uncertainty in details and devise a more suitable
annotation strategy. The CRAAC solution overall reduces
manual labour in annotating images, addresses the needs
in creating and correcting pseudo-labels and generates in-
terpretable scores for annotation. The solution can be fur-
ther polished to become a data annotation tool for labelling
domain-specific datasets.

Acknowledgement
The dataset is provided courtesy of the National High-

ways, with partial manual annotation assisted by Mr. Runqi
Chen. The author (P Lam) is funded by the UK Engineer-
ing and Physical Sciences Research Council (EPSRC) Cen-
tre for Doctoral Training in Future Infrastructure and Built
Environment: Resilience in a Changing World (FIBE2)
[grant number EP/S02302X/1] and sponsored by the Na-
tional Highways, Costain and Trimble Solutions. This work
is supported by the Digital Roads, UK EPSRC [grant num-
ber EP/V056441/1].


References
[1] Amazon Web Services. Amazon SageMaker Data Labeling:

Create high-quality datasets for training machine learning
models, 7 2023. 1

[2] Deeksha Arya, Hiroya Maeda, Sanjay Kumar Ghosh, Durga
Toshniwal, Alexander Mraz, Takehiro Kashiyama, and
Yoshihide Sekimoto. Deep learning-based road damage de-
tection and classification for multiple countries. Automation
in Construction, 132, 12 2021. 2

[3] Jordan T. Ash, Chicheng Zhang, Akshay Krishnamurthy,
John Langford, and Alekh Agarwal. Deep Batch Active
Learning by Diverse, Uncertain Gradient Lower Bounds. In
8th International Conference on Learning Representations,
6 2019. 2

[4] Hritam Basak and Zhaozheng Yin. Pseudo-label Guided
Contrastive Learning for Semi-supervised Medical Image
Segmentation. In IEEE/CVF Conference on Computer Vi-
sion and Pattern Recognition, pages 19786–19797, 2023. 2

[5] Yongxing Dai, Jun Liu, Yan Bai, Zekun Tong, and Ling Yu
Duan. Dual-Refinement: Joint Label and Feature Re-
finement for Unsupervised Domain Adaptive Person Re-
Identification. IEEE Transactions on Image Processing,
30:7815–7829, 2021. 2

[6] Edwin S. Dalmaijer, Camilla L. Nord, and Duncan E. Astle.
Statistical power for cluster analysis. BMC Bioinformatics,
23(1), 12 2022. 8

[7] Ismail Elezi, Zhiding Yu, Anima Anandkumar, Laura Leal-
Taixé, and Jose M Alvarez. Not All Labels Are Equal: Ratio-
nalizing The Labeling Costs for Training Object Detection.
In IEEE/CVF Conference on Computer Vision and Pattern
Recognition, pages 14472–14481, New Orleans, LA, USA,
2022. 2

[8] Kaiming He, Georgia Gkioxari, Piotr Dollar, and Ross Gir-
shick. Mask R-CNN. In Proceedings of the IEEE In-
ternational Conference on Computer Vision, volume 2017-
October, pages 2980–2988. Institute of Electrical and Elec-
tronics Engineers Inc., 12 2017. 1

[9] Jisoo Jeong, Seungeui Lee, Jeesoo Kim, and Nojun Kwak.
Consistency-based Semi-supervised Learning for Object De-
tection. In Conference on Neural Information Processing
Systems, Vancouver, Canada, 2019. 2, 3, 8

[10] Qiuye Jin, Mingzhi Yuan, Qin Qiao, and Zhijian Song. One-
shot active learning for image segmentation via contrastive
learning and diversity-based sampling. Knowledge-Based
Systems, 241, 4 2022. 2

[11] Adrian Krenzer, Kevin Makowski, Amar Hekalo, Daniel Fit-
ting, Joel Troya, Wolfram G. Zoller, Alexander Hann, and
Frank Puppe. Fast machine learning annotation in the medi-
cal domain: a semi-automated video annotation tool for gas-
troenterologists. BioMedical Engineering Online, 21(1), 12
2022. 2

[12] Percy Lam, Weiwei Chen, Lavindra De Silva, and Ioannis
Brilakis. Correcting Road Image Annotations. In Apollo -
University of Cambridge Repository, 2024. 4, 6

[13] Yuan-Hong Liao, Amlan Kar, and Sanja Fidler. Towards
Good Practices for Efficiently Annotating Large-Scale Im-
age Classification Datasets. In IEEE/CVF Conference on

Computer Vision and Pattern Recognition, Nashville, TN,
USA, 6 2021. IEEE. 2

[14] Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao
Zhang, Jie Yang, Chunyuan Li, Jianwei Yang, Hang Su, Jun
Zhu, and Lei Zhang. Grounding DINO: Marrying DINO
with Grounded Pre-Training for Open-Set Object Detection.
arXiv preprint, 3 2023. 2

[15] Normaisharah Mamat, Mohd Fauzi Othman, Rawad Abdul-
ghafor, Ali A. Alwan, and Yonis Gulzar. Enhancing Image
Annotation Technique of Fruit Classification Using a Deep
Learning Approach. Sustainability (Switzerland), 15(2), 1
2023. 2

[16] Islam Nassar, Munawar Hayat, Ehsan Abbasnejad, Hamid
Rezatofighi, and Gholamreza Haffari. PROTOCON: Pseudo-
label Refinement via Online Clustering and Prototypical
Consistency for Efficient Semi-supervised Learning. In Pro-
ceedings of the IEEE/CVF Conference on Computer Vision
and Pattern Recognition (CVPR), pages 11641–11650, 6
2023. 2

[17] National Highways. Connecting the Country: Our Long
Term Strategic Plan to 2050. Technical report, National
Highways, Guildford, UK, 5 2023. 1

[18] National Highways. New plan maps our vision for the future,
5 2023. 1

[19] Yassine Ouali, Céline Hudelot, and Myriam Tami. An
Overview of Deep Semi-Supervised Learning. 6 2020. 2

[20] Vung Pham, Du Nguyen, and Christopher Donan. Road
Damages Detection and Classification with YOLOv7. 10
2022. 2

[21] Alberto Rizzoli. 13 Best Image Annotation Tools of 2023
[Reviewed], 2023. 1

[22] Roboflow Inc. Roboflow Annotate: Quickly Label Training
Data and Export To Any Format, 2023. 1

[23] Peter J Rousseeuw. Silhouettes: a graphical aid to the inter-
pretation and validation of cluster analysis. Technical report,
1987. 8

[24] Anthony Scalabrino. Image Annotation for Computer Vi-
sion, 2024. 1

[25] Ketan Rajshekhar Shahapure and Charles Nicholas. Cluster
quality analysis using silhouette score. In Proceedings - 2020
IEEE 7th International Conference on Data Science and Ad-
vanced Analytics, DSAA 2020, pages 747–748. Institute of
Electrical and Electronics Engineers Inc., 10 2020. 8

[26] Piotr Skalski and James Gallagher. YOLO-World: Real-
Time, Zero-Shot Object Detection, 2 2024. 2

[27] Abraham George Smith, Eusun Han, Jens Petersen, Niels
Alvin Faircloth Olsen, Christian Giese, Miriam Athmann,
Dorte Bodin Dresbøll, and Kristian Thorup-Kristensen.
RootPainter: deep learning segmentation of biological
images with corrective annotation. New Phytologist,
236(2):774–791, 10 2022. 2

[28] Trimble Inc. Trimble MX9 Mobile Mapping Solution. Tech-
nical report, USA, 2022. 4

[29] Leo Ueno and Trevor Lynn. PaliGemma: An Open Multi-
modal Model by Google., 5 2024. 1

[30] V7 Labs. Auto Annotation: Auto-Annotate Complex Ob-
jects 10x Faster, 2023. 1


[31] Chien-Yao Wang, Alexey Bochkovskiy, and Hong-
Yuan Mark Liao. YOLOv7: Trainable bag-of-freebies sets
new state-of-the-art for real-time object detectors. 7 2022. 1

[32] Jun Wang, Shaoguo Wen, Kaixing Chen, Jianghua Yu, Xin
Zhou, Peng Gao, Changsheng Li, and Guotong Xie. Semi-
supervised Active Learning for Instance Segmentation via
Scoring Predictions. In British Machine Vision Virtual Con-
ference, Virtual, 12 2020. 2, 3

[33] Shanshan Wang, Cheng Li, Rongpin Wang, Zaiyi Liu,
Meiyun Wang, Hongna Tan, Yaping Wu, Xinfeng Liu, Hui
Sun, Rui Yang, Xin Liu, Jie Chen, Huihui Zhou, Ismail
Ben Ayed, and Hairong Zheng. Annotation-efficient deep
learning for automatic medical image segmentation. Nature
Communications, 12(1), 12 2021. 2

[34] Weihan Wang, Qingsong Lv, Wenmeng Yu, Wenyi Hong, Ji
Qi, Yan Wang, Junhui Ji, Zhuoyi Yang, Lei Zhao, Xixuan
Song, Jiazheng Xu, Bin Xu, Juanzi Li, Yuxiao Dong, Ming
Ding, and Jie Tang. CogVLM: Visual Expert for Pretrained
Language Models. 11 2023. 2

[35] Haoyan Wu, Shikui Wei, Chuangchuang Tan, and Yao Zhao.
Pseudo-label Correction from Pixel to Image. In CTISC
2022 - 2022 4th International Conference on Advances in
Computer Technology, Information Science and Commu-
nications. Institute of Electrical and Electronics Engineers
Inc., 2022. 2

[36] Jiaxi Wu, Jiaxin Chen, and Di Huang. Entropy-based Ac-
tive Learning for Object Detection with Progressive Diver-
sity Constraint. In Proceedings of the IEEE Computer Soci-
ety Conference on Computer Vision and Pattern Recognition,
volume 2022-June, pages 9387–9396. IEEE Computer Soci-
ety, 2022. 2

[37] Jiarui Xu, Shalini De Mello, Sifei Liu, Wonmin Byeon,
Thomas Breuel, Jan Kautz, and Xiaolong Wang. GroupViT:
Semantic Segmentation Emerges from Text Supervision.
In IEEE/CVF Conference on Computer Vision and Pattern
Recognition, pages 18113–18123, New Orleans, LA, USA,
6 2022. IEEE. 2

[38] Cancan Yi, Jun Liu, Tao Huang, Han Xiao, and Hui Guan.
An efficient method of pavement distress detection based on
improved YOLOv7. Measurement Science and Technology,
34(11), 11 2023. 2

[39] Donggeun Yoo and In So Kweon. Learning Loss for Active
Learning. In IEEE/CVF Conference on Computer Vision and
Pattern Recognition, Long Beach, CA, USA, 2019. IEEE. 2,
3

[40] Weiping Yu, Sijie Zhu, Taojiannan Yang, and Chen Chen.
Consistency-based Active Learning for Object Detection. In
IEEE Computer Society Conference on Computer Vision and
Pattern Recognition Workshops, volume 2022-June, pages
3950–3959. IEEE Computer Society, 2022. 2

[41] Zhi Hua Zhou. A brief introduction to weakly supervised
learning, 1 2018. 2