CRAAC: Consistency Regularised Active Learning with Automatic Corrections for Real-life Road Image Annotations Percy Lam University of Cambridge phl25@cam.ac.uk Sooyong Park University of Cambridge martellw2ks@gmail.com Weiwei Chen University College London weiwei.chen@ucl.ac.uk Lavindra de Silva University of Cambridge lpd25@cam.ac.uk Ioannis Brilakis University of Cambridge ib340@cam.ac.uk Abstract In annotating real-life large, noisy and domain-specific images for digitising infrastructure, substantial human ef- fort persists despite past advancements. This research pro- vides practical and interpretable scores for human anno- tators, enabling flexible annotation strategies, improving automation and reducing the effort required to create and correct image labels. The authors present the CRAAC so- lution: Consistency Regularised Active learning and Au- tomatic Corrections, which builds on Mask R-CNN with three additional modules: consistency regularisation, scor- ing modules for active learning and automatic corrections. Experiments on our pavement image dataset, recorded with a low silhouette score of 0.146 and qualitative annotation inconsistencies, reduce the human effort of mouse clicks by 5-11% and improve the quality metrics of mAP and AR by approx. 40% from the original Mask R-CNN. The automatic correction further reduces the performance variation. 1. Introduction Digitisation is seen as a key solution [18] for enhanc- ing the economic, environmental and safety performance [17] of our critical road infrastructure. The state-of-the-art (SOTA) data preparation process requires much manual ef- fort that may potentially negate the benefits of digitisation, inspiring the current research. This research specifically targets automation in annotat- ing real-life road images by highway authorities and are hence large, noisy and domain-specific. These images re- quire annotations, the markup of image features that ma- chine learning models are trained to automatically recog- nise [24]. This project adopts instance segmentation [24] labels. Throughout the paper, pseudo-labels refer to la- bels predicted from a model without manual corrections, whereas ground truth labels are the accepted labels af- ter manual corrections or annotations. Automation enables images to be labelled with less or no human input. SOTA image annotation involves a range of automa- tion. Open-set annotation such as the PaliGemma [29] detects segmentation masks from text prompts of desired generic objects. Labelling non-generic objects involves hu- man annotators [1] with a range of semi-automatic assis- tive tools [21], such as inferences from pre-trained mod- els [22] and auto polygons adjustments [30]. These tools, however, perform inconsistently on large, noisy, domain- specific datasets. They also cannot recommend which im- age improves the tool’s inference the most. The quests to utilise human review efforts and acquire better explanations for failed pseudolabels underpin the research. This research makes the following contributions: 1. Design a pipeline that reduces human effort in both generating and correcting annotations, by querying the most advantageous images and mimicking past correc- tions when reviewing pseudo-labels 2. Generate practical and interpretable inference scores through bbox and mask scoring modules that enable flexible annotation strategies 2. Related Works Deep learning architectures for 2D image instance seg- mentation are available as single-stage, such as YOLOv7 [31], or two-stage detectors such as Mask R-CNN [8]. Re- search teams applied these detectors to prepare labelled datasets and detect difficult objects such as pavement de- fects [2, 20, 38]. Advancements in inferring and correcting pseudo-labels prompt our research. 2.1. Enhancements in creating pseudo-labels Research teams have been enhancing pseudo-labels cre- ation by semi-supervised learning (SSL). SSL leverages unlabelled images to improve predictions from a small la- belled dataset [19] by two approaches. The first approach implicitly extracts information in the unlabelled data with- out labelling them, such as consistency regularisation that utilises losses between predictions of the original image and its augmented version [4, 19]. The extra loss can thus bol- ster the performance of a trained model, single-stage or two- stage detectors alike [9], or be used as a score for selecting images for human review [7]. The second approach observes the distribution of the un- labelled data without resorting to the training process and enables labelling strategies such as active learning [32]. This technique selects the most informative or representa- tive [41] pseudo-labelled instance to query from a large un- labelled dataset, thereby sparing the need to review every pseudo-label. Researchers improved the process by find- ing better sampling methods [3], combining with other SSL methods [10, 13], eliminating redundant predictions [36] and considering better loss functions and terms. Training losses can be modified by adding new loss terms [7] and quantifying losses by scores [32, 39, 40]. Open-set annotation avails image annotation through vision language training. Based on text-to-image embed- dings, zero-shot object detection enables bounding boxes (bbox) and/or masks to be detected from an input text prompt [14, 26, 34, 37]. 2.2. Enhancements in correcting pseudo-labels In contrast with pseudo-label creation, pseudo-label cor- rections by humans are still prevalent in domain-specific applications, such as labelling fruits [15], roots [27] and ulcers [11]. While some researchers such as Wu [35] at- tempted to correct pseudo-labels automatically by simple intersection and/or union of polygons, automatic correction is still an open field for innovations. The closest resemblance is pseudo-label refinement, common in SSL or unsupervised domain adaptation. This technique adjusts pseudo-labels by training another projec- tion head with auxiliary tasks [5, 33] or applying clustering to past ground truths [16]. Pseudo-label refinement however only assesses the final corrected state and does not consider the change from its original to the final state. The current techniques overall leave a gap in knowledge of not being able to annotate domain-specific datasets fully automatically. Specifically, a gap remains in enabling hu- man annotators to identify problems with images and devise appropriate annotation strategies to complete the remaining annotations. Existing research also did not fully capture the value of past human corrections on pseudo-labels. 3. Method Our proposed solution extends from the two-stage detec- tor Mask R-CNN with three additional components. The two-stage architecture allows modules to be modified in- dependently without interfering with one another, while its Region Proposal Network (RPN) provides proposal bboxes to seek missing instances. The first additional compo- nent, consistency regularisation (Sec. 3.1), aims to improve the predictor’s performance by leveraging unlabelled data in the classifier training, thereby outputting better pseudo- labels that need fewer alterations. The second component, the scoring module estimates the image informativeness (Sec. 3.2) and outputs 5 scores in ranking images for active learning, so human annotators can interpret the informative- ness of unlabelled images, select useful images for review and save reviewing effort. The auto-correction component (Sec. 3.3) records previous manual corrections on predicted pseudo-labels and mimics the corrections with the rest of the predictions. The three components combined formed the CRAAC solution as illustrated in Fig. 1. 3.1. Consistency Regularisation (CNS) The authors introduce CNS loss in the classifier train- ing to improve its prediction performance. The CNS loss is calculated by adapting Jeong’s [9] structure for two-stage detectors in Fig. 2. The structure first augments the input image with a horizontal flip (xflip) and creates feature maps for both images. The RPN generates region proposals for the original image and flips horizontally the regions for the augmented image. The classifier then predicts class proba- bility vectors for the proposals of both images and softmax- ed for CNS classification loss calculation. Following Jeong [9], the CNS loss comprises two losses: the loss due to the difference in predicted class probabil- ity vectors (unsupervised class loss, Lcns cls ) and the differ- ence between the four scalars (bbox centre x, y and bbox width w and height h) of the bboxes (unsupervised bbox loss, Lcns bbox) in the original versus the augmented image. The two losses are then added with weight β to the super- vised loss of Mask-R-CNN Lsup to train the classifier in Eq. (1). The loss is introduced to help differentiate features and improve the classifier for better pseudo-labels. The RPN is only trained with Lsup to avoid the correspondence- matching problem. Ltotal = Lsup + βLCNS = Lsup + β(Lcns cls +Lcns bbox) (1) Figure 1. The Overall Schematic Structure of CRAAC and Adopted Loss for Training Figure 2. CNS Structure for Two- stage Detectors by Jeong [9] Figure 3. The Five Scoring Components for the Active Learning Pipeline 3.2. Active Learning (AL) Besides improving the classifier, the authors also adopt active learning (AL) to select the most informative im- ages to reduce human review. The images informative- ness is measured by the uncertainty and distribution scores in Fig. 3. The uncertainty scores show the uncertainty in classification (C), bbox regression (B) and mask prediction (M ). The distribution scores quantify an image’s relevance based on the scarcity of the predicted category (V ) and the number of predicted instances in an image (A). On the uncertainty scores, besides C provided by the original Mask R-CNN, the authors estimate B and M by training parallel scoring modules extending from Yoo [39]. This explains the sources of model uncertainty better and enables unlabelled images to be used, compared with Wang’s [32] approach that relied on labelled ground truth. The scoring module extends parallel from the box and mask head in Mask R-CNN as shown in Fig. 4 and consists of structures demonstrated in the supplementary material. The inputs are feature maps from the box and mask head, as well as the box regression loss (Lboxhead bbox ) and mask loss (Lmaskhead mask ) of Mask R-CNN. The modules then predict non-negative scalar scores B and M . When training the box and mask scoring modules, the box scoring module com- putes the Huber loss Lmodule bbox between Lboxhead bbox and B and similarly Lmodule mask from the mask scoring module in Eq. (2). Ltotal = Lboxhead bbox + Lmaskhead mask + α(Lmodule bbox + Lmodule mask ) (2) Where α weighs the effect of scoring module losses. At inference, besides the model uncertainty scores C, B and M , the category value V and the label abundance A represent the data distribution. V is the inverse of the rela- tive frequency of the predicted category in the labelled set. Intuitively, images with more predicted instances (higher A) in scarcer categories (higher V ) are more informative. Vi = ( no of instances in category i Total no of instances in all categories )−1 The total informativeness score Si for each image i with j ∈ 1, ..., N predicted instances is therefore the sum of the five informative scores C,B,M, V,A in Eq. (3). Si(Ci, Bi,Mi, Ai, Vi) = wBBi + wMMi + wAAi+ 1 N N∑ j=1 (wCCi,j + wV VCat(j)) (3) Note that the equation calculates B, M and A per image, and C and V per instance. The w denotes fixed weights applied to each of the scores. After training the classifier and the scoring modules in each iteration, the model predicts pseudo-labels for all un- labelled images and calculates the five informative scores. After applying the weights in Eq. (3), the algorithm ranks the unlabelled images by their informative score. The top n (20 in Tab. 2) unlabelled images are the least certain im- ages, quantified by their high box and mask loss (in B and M ) and likely large A as discussed in Sec. 4.5. They are se- lected for review, or in the experiment the pseudo-labels are replaced by the ground truth labels and are added to the la- belled dataset for the next training iteration. This targets to maximise the improvements of the classifier with the least number of images, hence requiring the least manual review and producing better psuedo-labels in later iterations. Conversely, the most certain images are the bottom n (also 20 in Tab. 2) images with at least one pseudo-label. They become the correction templates in AC as the predic- tions tend to be more reliable (lower B and M , Sec. 4.5). The selection prevents circumstantial corrections if all in- stances are used (Sec. 4.7) and performs auto-correction on common pseudo-label errors. 3.3. Automatic Corrections (AC) The AC module gauges and mimics annotators’ modi- fications to save human efforts from repetitive corrections. From the correction templates of the most certain images, the module reads the ground truth (manual correction) and records the correction action. Correction actions include deletion, addition, category changes in [12] and addition- ally resized bboxes. The module records the category, bbox details and the 1024-length embedding of every instance in the correction template before and after the correction. At inference, the module applies the trained model to predict instances and region proposals on the testing set im- ages. In the transverse crack example in Fig. 5, the cosine similarity is calculated between the embedding of the pro- posal bbox in the testing image against the embeddings of instances in the correction templates. If the pairwise simi- larity exceeds a preset threshold (see Sec. 4.3), and if the width-to-height ratio (both predicted instance and region proposal) and the predicted category (predicted instance only) are similar to the correction template, the correction template enters the clustering algorithm. The combined use of computed similarity and physical attribute data helps dis- tinguish physically different instances with similar compu- tational features, such as longitudinal and transverse cracks. The clustering algorithm decides the correction action. The embedding of the predicted instance or the proposal bbox of the testing image (the testing feature) enters a k-NN classifier, where k = 3 (Delete, add and change category. A size change keeps the predicted instance). The testing feature clusters with the embeddings of similar instances from the correction templates (the correction features). The closest cluster recommends the action on the predicted in- stance/proposal. • A predicted instance is deleted or changes its category if the closest template recommends so. • A region proposal is added if it is the closest to a tem- plate recommending an addition. • No changes are made otherwise. The testing feature in Fig. 5 is the closest to the addition cluster and is hence added. If the predicted instance is sim- ilar to fewer than three correction template instances, there are not enough templates to form clusters so the algorithm follows the most similar template. The remaining predicted instances are post-processed similar to [12]. The post-processing merges bboxes and masks of the same class with an area difference and cen- troid distance below a threshold (empirically set to 0.4 and 0.2 ×min(image height, image width)). This is to account for enlarged masks with wiggly cracks and magnified in- stances in the near field while ensuring actual fragmented masks are merged (see supplementary material). The post- processed outcome is evaluated against the ground truth of the testing set in this experiment or becomes the pseudo- labels for actual image annotation. The AC module thus helps avoid repetitive corrections by humans. 4. Experiments This section analysed and discussed the dataset, the per- formance metrics, the testing setup and their performance. 4.1. Dataset Analysis The author experimented with pavement images col- lected from A12 Mountnessing, United Kingdom (6.6km × 2 lanes) and annotated at the level of instance segmen- tation. The images are taken on a vehicle-mounted pave- ment camera Trimble MX9 [28] travelling at lane speed. Figure 4. The Structure of the Scoring Module and the Automatic Correction Module Figure 5. The Operating Mechanism of the Automatic Correction Module Model training embarked after labelling the initial training set of 101 positive images (contained instances). For ex- perimentation, all images are annotated in advance. 209 positive images were separated as the testing set for all iterations. The remaining 1663 positive images in the dataset would be queried for human review and transferred to the labelled set until all positive images were selected for training. The image selection depended on the test- ing setup in Sec. 4.3. The annotated dataset included four crucial categories of defects for road maintenance, namely crack transverse, crack longitudinal, potholes and patch. Tab. 1 shows a summary of the image and instance counts. 4.2. Benchmarking Metrics The experiments tested the hypothesis of saving human effort while maintaining the quality of annotation. The re- quired human effort was quantified by the number of mouse Table 1. Defect distribution of the initial training, testing and the remainder set Batch Init. Train Testing Remainders Nos. Images 1000 500 10181 Positive Images 101 209 1663 Nos. Instances 133 419 2718 Instance distribution crack transverse 82 151 888 crack longitudinal 45 68 889 potholes 0 19 101 patch 6 181 840 clicks required to perform actions in Sec. 3.3 to amend pseudo-labels into their final ground truth. 1. Deletion: 1 click (the bin button) 2. Addition: 5-7 clicks, the median clicks to create a mask across categories, see the supp material 3. Category change: 2 clicks (open the dropdown list, then choose a new category) 4. Change size (resize the mask): taken as 4 clicks (drag four corners to their new positions) The quality of the predicted pseudo-labels is reported in the mean average precision over categories (mAP50) and aver- age recall for 100 detections per image (AR50,maxdet=100) at the IoU threshold of 0.5 against the actual ground truths. 4.3. Testing Setup Experiments began by training the initial training set in Sec. 4.1 with the COCO pretrained weights. The model selected around 100 images per training iteration and trans- ferred the selected images to the labelled set. The model was then fine-tuned by the entire new labelled set in the next training iteration, and iterated until the whole positive image set was labelled and trained. Different experiments were conducted, summarised as follows and in Tab. 2. • Mask R-CNN: original (add 100 random images per iter) • Sampling: Use original Mask R-CNN scores for sampling (select 20 least certain + 80 random images) • AL: Mask R-CNN + AL only (with the scoring module. Select 20 least certain + 80 random images) • CRA: Mask R-CNN + CNS + AL (Consistency Regularised Active learning. With CNS. Select 20 least certain + 20 most certain + 60 random images) • CRAAC: Mask R-CNN + CNS + AL + AC (CRA with Auto-Corrections. Setup as CRA, use the most certain 20 as correction templates) • CRAAC 2: Mask R-CNN + CNS + AL + AC (Select 20 least certain + 80 random images for training. Take the most certain 20 as correction templates but not added to the next training iteration) Effects of the number of sampled images were studied in [12]. 20 most and least certain images were selected to ensure sufficient samples for improving the classifier and serving as correction templates in the AC, while not exces- sive to make corrections only suitable for edge cases. The rest of the 100 images added in each iteration were ran- domly selected from the unlabelled dataset. The similarity thresholds for the permissible actions and the objectness of region proposals in the AC could be ad- justed. The current set of thresholds was selected to salvage missing instances and avoid accidental deletions, as well as to maintain low mouse clicks across training sets of differ- ent sizes, as shown in the supplementary material. Referring to Eq. (3), the empirically found weights for all tests involving AL were: wB = 0.5, wM = 0.5, wA = 0.1, wC = 1, wV = 1. These weights were chosen to pro- vide scores with a balance between the model uncertainty and data distribution. A higher wB and wM will cause a stronger bias to model uncertainty, outweighing the impor- tance of finding instances of scarcer categories. Sec. 4.5 suggested that wC could potentially be altered to prioritise finding instances of a particular desired category. All models were generally trained with the same learn- ing rates and settings, detailed in supp material. The CNS and the scoring modules were both trained for 5000 steps in the first iteration and 1500 and 2000 steps subsequently. The experiments were conducted on a desktop computer with a single NVIDIA GeForce RTX3080Ti with 12GB of VRAM, typically in a batch size of 6. 4.4. Performance Figs. 6 to 8 show the precision (mAP50), recalls (AR50) and required mouse clicks across testing setups. Results generally improve with more training and plateau at approx. 1000 positive images. The original model saved about 20% human effort from full manual labelling, with a further 5- 11% reduction by using AL, CNS and/or AC. 4.4.1 Original Mask R-CNN as compared with the rest Figs. 6 to 8 show that testing setups outperformed the orig- inal Mask R-CNN (the light green curve) by 40-50% in mAP and AR and 5-11% in mouse clicks. From 400 trained images onwards, the original model improved slower be- cause AL prioritised training the more informative images through weighting in Eq. (3). At around 1030 images, other setups picked up 50-100 more instances than the original model. When a model was trained with fewer useful im- ages, it descended to a local minimum with worse perfor- mance and cascaded the effect to further training iterations. Even by sampling with scores in the original Mask R- CNN, the initial front-loaded gain in precision and recall was exceeded by set-ups with the three modules from ap- prox. 800 images onwards. The mouse click counts also tended to level with Mask R-CNN towards the end of train- ing, trailing behind other set-ups. This indicated the contri- bution of the three modules in training classifiers for better pseudo-labels and more effective sampling and correction. 4.4.2 AL as compared with the rest with CR Figs. 6 to 8 show that the model with only active learning (the pink curve) matched in quality and human effort with models with CNS training (CRA or CRAAC) before 1000 images. AL had likely captured more high-value images in the early stages of training, front-loading the performance gain. As training continued, the early advantage diminished and the quality metrics, especially AR50, gradually fell be- low models with CNS training. At the end of the training, Table 2. The testing configurations Images AC parameters Consistency Regularisation Active Learning Auto Correction Least Certain Most Certain Random Remarks Thresholds (del, add, chcats) Objectness Logit Mask R-CNN × × × #N/A #N/A 100 Sampling × × × 20 0 80 Sample with only in-built Mask R-CNN C, V and A AL × ✓ × 20 0 80 CRA ✓ ✓ × 20 20 60 CRAAC ✓ ✓ ✓ 20 20 60 1.0, 0.7, 0.7 0.04 CRAAC 2 ✓ ✓ ✓ 20 0* 80 20 most certain reviewed for correction templates, not trained 1.0, 0.8, 0.7 0.02 Figure 6. Precision of experiments, in mAP50 models with CNS yielded 6-18% growth of AP and 5-17% growth of AR compared with the model with only AL. 4.4.3 CRA only compared with the rest with Auto- matic Correction The performance of the base CRA model underpinned the AC module, which ran on its weights without extra training. The AC module was encouraged to add instead of delete in- stances because original predictions were by design supe- rior to abandoned region proposals and deletion costs fewer mouse clicks. Overall the AC smoothened the performance gain with trained images and brought a trade-off of pre- cision for more recall within appropriately 10% (CRA as compared with CRAAC and CRAAC 2). The human ef- fort saving with AC was generally in line with just CRA or marginally better in the early stage of CRAAC 2. Between configurations of AC (CRAAC against CRAAC 2), exper- iments showed that using the least certain images and not the most certain images for training yielded better perfor- mance. More in-depth sensitivity tests on the AC module were in the supplementary material. 4.5. Scores for Granular Sources of Uncertainty Besides model performance, the CRAAC facilitates an- notators to know the sources of pseudo-label uncertainty through the five interpretable scores from AL in Fig. 3. Tab. 3 shows the contribution of prediction uncertainty Figure 7. Recall of experiments, in AR50 Figure 8. Required mouse click counts (B + M ) and category value (V ) to the total score at the early (400 images), mid (1000 images) and end-stage of CRA training. The top 20 unlabelled images had higher to- tal scores predominated by prediction uncertainty. Choos- ing them for human review made sense because the high prediction uncertainty meant the model struggled with the images and was unrelated to the data distribution. Labels of the reviewed images could be ascertained and added to the training set, clearing the model uncertainty. Conversely, to gather instances in scarcer categories, users could choose images with more data distribution loss or even raise the weighting of wv for the desired category. This was why the AC adopted the most certain 20 images (the bottom 20) with a larger proportion of V as correction Table 3. Score distribution in different CRA training stages 400 images 1000 images End (1728 images) B+M V Total Score B+M V Total Score B+M V Total Score Top 20 64% 15% 3.86 57% 23% 3.25 52% 29% 2.26 Top 100 61% 18% 3.46 53% 26% 2.88 51% 31% 2.19 Bottom 20 35% 51% 1.24 47% 39% 1.41 Figure 9. Instance distribution of A12 and A14 dataset templates. The least certain images may contain circum- stantial corrections to be avoided in most images. Putting extra weights for desired categories also helped extract their instances earlier in training. A parametric study (supp. ma- terial) showed if 5x weight was added to ”potholes”, the model would extract more ”potholes” from the unlabelled set (72% of the full dataset at 5x vs 50% at 1x) and raised AR50 of ”potholes” from 11% to 42% in mid-stage training. This weighting effect attenuated towards the end of training as all instances were eventually trained. 4.6. Validation The authors additionally performed validation tests on another equally problematic pavement image dataset on A14 Tothill in the UK. Compared with A12 Mountnessing overlain with asphalt, A14 Tothill was built with a bare con- crete surface and exhibited a dominating class bias towards patches (Fig. 9). Experiments were conducted in similar se- tups as Tab. 2, with results showing that all setups saved 5- 9% mouse clicks from the original Mask R-CNN model and outperformed it in average precision and recall by approx. 20% - 30%. The AL setup in validation extracted signif- icantly more instances in the early and mid-stage training, leading to comparable results in AL, CRA and CRAAC. The supp materials record details of the validation test. 4.7. Limitations The noisiness of the dataset hampers the model. Be- sides the inherent difficulties in detecting dark grey cracks or potholes on a lighter grey asphalt surface, human anno- tation inconsistencies aggravated the noisiness. These com- monly include the need for larger bboxes to cover wiggly cracks, circumstantial corrections that should not be repli- cated and fragmented masks. Quantitatively, the silhou- ette score [23] was calculated to show the intra- and inter- category sparseness of the dataset. The average silhouette score for all categories was 0.146, which was much worse than a good threshold of 0.5 [6] or the silhouette score in common datasets such as the Iris or the S-1 dataset [25]. The proposed modules could be adapted for other archi- tecture in the future. While Jeong [9] also proposed a CNS for a single-stage detector, more work may be required to extract the bbox and mask loss for the scoring modules in other architecture. The AC operated independently from trained models so could theoretically be deployed in other architecture, except that users may need to extract proposals for the addition action from a position other than the RPN. 5. Conclusion This research aims to reduce human input in annotat- ing real-life large, noisy and domain-specific image datasets while maintaining its quality. The authors propose a solu- tion CRAAC, Consistency Regularised Active learning with Automatic Corrections. The solution incorporates consis- tency calculation to improve the classifier prediction, at- taches a parallel scoring module to output interpretable scores for selecting informative images in active learning and automatically corrects pseudolabels from past human corrections. Results show the CRAAC improves in mAP and AR by 40-50% than using the original Mask R-CNN and reduces 5-11% of human effort. Adopting only the pro- posed active learning module matches the performance with the setup adopting also consistency regularisation at the be- ginning but loses precision and recall by 5-18% towards the end of training. Automatic correction yields some gain in recall and generally attenuates the fluctuation in perfor- mance. The ability to estimate scores from labelled and un- labelled images enables annotators to better understand the labelling uncertainty in details and devise a more suitable annotation strategy. The CRAAC solution overall reduces manual labour in annotating images, addresses the needs in creating and correcting pseudo-labels and generates in- terpretable scores for annotation. The solution can be fur- ther polished to become a data annotation tool for labelling domain-specific datasets. Acknowledgement The dataset is provided courtesy of the National High- ways, with partial manual annotation assisted by Mr. Runqi Chen. The author (P Lam) is funded by the UK Engineer- ing and Physical Sciences Research Council (EPSRC) Cen- tre for Doctoral Training in Future Infrastructure and Built Environment: Resilience in a Changing World (FIBE2) [grant number EP/S02302X/1] and sponsored by the Na- tional Highways, Costain and Trimble Solutions. This work is supported by the Digital Roads, UK EPSRC [grant num- ber EP/V056441/1]. References [1] Amazon Web Services. Amazon SageMaker Data Labeling: Create high-quality datasets for training machine learning models, 7 2023. 1 [2] Deeksha Arya, Hiroya Maeda, Sanjay Kumar Ghosh, Durga Toshniwal, Alexander Mraz, Takehiro Kashiyama, and Yoshihide Sekimoto. Deep learning-based road damage de- tection and classification for multiple countries. Automation in Construction, 132, 12 2021. 2 [3] Jordan T. Ash, Chicheng Zhang, Akshay Krishnamurthy, John Langford, and Alekh Agarwal. Deep Batch Active Learning by Diverse, Uncertain Gradient Lower Bounds. In 8th International Conference on Learning Representations, 6 2019. 2 [4] Hritam Basak and Zhaozheng Yin. Pseudo-label Guided Contrastive Learning for Semi-supervised Medical Image Segmentation. In IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 19786–19797, 2023. 2 [5] Yongxing Dai, Jun Liu, Yan Bai, Zekun Tong, and Ling Yu Duan. Dual-Refinement: Joint Label and Feature Re- finement for Unsupervised Domain Adaptive Person Re- Identification. IEEE Transactions on Image Processing, 30:7815–7829, 2021. 2 [6] Edwin S. Dalmaijer, Camilla L. Nord, and Duncan E. Astle. Statistical power for cluster analysis. BMC Bioinformatics, 23(1), 12 2022. 8 [7] Ismail Elezi, Zhiding Yu, Anima Anandkumar, Laura Leal- Taixé, and Jose M Alvarez. Not All Labels Are Equal: Ratio- nalizing The Labeling Costs for Training Object Detection. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14472–14481, New Orleans, LA, USA, 2022. 2 [8] Kaiming He, Georgia Gkioxari, Piotr Dollar, and Ross Gir- shick. Mask R-CNN. In Proceedings of the IEEE In- ternational Conference on Computer Vision, volume 2017- October, pages 2980–2988. Institute of Electrical and Elec- tronics Engineers Inc., 12 2017. 1 [9] Jisoo Jeong, Seungeui Lee, Jeesoo Kim, and Nojun Kwak. Consistency-based Semi-supervised Learning for Object De- tection. In Conference on Neural Information Processing Systems, Vancouver, Canada, 2019. 2, 3, 8 [10] Qiuye Jin, Mingzhi Yuan, Qin Qiao, and Zhijian Song. One- shot active learning for image segmentation via contrastive learning and diversity-based sampling. Knowledge-Based Systems, 241, 4 2022. 2 [11] Adrian Krenzer, Kevin Makowski, Amar Hekalo, Daniel Fit- ting, Joel Troya, Wolfram G. Zoller, Alexander Hann, and Frank Puppe. Fast machine learning annotation in the medi- cal domain: a semi-automated video annotation tool for gas- troenterologists. BioMedical Engineering Online, 21(1), 12 2022. 2 [12] Percy Lam, Weiwei Chen, Lavindra De Silva, and Ioannis Brilakis. Correcting Road Image Annotations. In Apollo - University of Cambridge Repository, 2024. 4, 6 [13] Yuan-Hong Liao, Amlan Kar, and Sanja Fidler. Towards Good Practices for Efficiently Annotating Large-Scale Im- age Classification Datasets. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 6 2021. IEEE. 2 [14] Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, and Lei Zhang. Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection. arXiv preprint, 3 2023. 2 [15] Normaisharah Mamat, Mohd Fauzi Othman, Rawad Abdul- ghafor, Ali A. Alwan, and Yonis Gulzar. Enhancing Image Annotation Technique of Fruit Classification Using a Deep Learning Approach. Sustainability (Switzerland), 15(2), 1 2023. 2 [16] Islam Nassar, Munawar Hayat, Ehsan Abbasnejad, Hamid Rezatofighi, and Gholamreza Haffari. PROTOCON: Pseudo- label Refinement via Online Clustering and Prototypical Consistency for Efficient Semi-supervised Learning. In Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 11641–11650, 6 2023. 2 [17] National Highways. Connecting the Country: Our Long Term Strategic Plan to 2050. Technical report, National Highways, Guildford, UK, 5 2023. 1 [18] National Highways. New plan maps our vision for the future, 5 2023. 1 [19] Yassine Ouali, Céline Hudelot, and Myriam Tami. An Overview of Deep Semi-Supervised Learning. 6 2020. 2 [20] Vung Pham, Du Nguyen, and Christopher Donan. Road Damages Detection and Classification with YOLOv7. 10 2022. 2 [21] Alberto Rizzoli. 13 Best Image Annotation Tools of 2023 [Reviewed], 2023. 1 [22] Roboflow Inc. Roboflow Annotate: Quickly Label Training Data and Export To Any Format, 2023. 1 [23] Peter J Rousseeuw. Silhouettes: a graphical aid to the inter- pretation and validation of cluster analysis. Technical report, 1987. 8 [24] Anthony Scalabrino. Image Annotation for Computer Vi- sion, 2024. 1 [25] Ketan Rajshekhar Shahapure and Charles Nicholas. Cluster quality analysis using silhouette score. In Proceedings - 2020 IEEE 7th International Conference on Data Science and Ad- vanced Analytics, DSAA 2020, pages 747–748. Institute of Electrical and Electronics Engineers Inc., 10 2020. 8 [26] Piotr Skalski and James Gallagher. YOLO-World: Real- Time, Zero-Shot Object Detection, 2 2024. 2 [27] Abraham George Smith, Eusun Han, Jens Petersen, Niels Alvin Faircloth Olsen, Christian Giese, Miriam Athmann, Dorte Bodin Dresbøll, and Kristian Thorup-Kristensen. RootPainter: deep learning segmentation of biological images with corrective annotation. New Phytologist, 236(2):774–791, 10 2022. 2 [28] Trimble Inc. Trimble MX9 Mobile Mapping Solution. Tech- nical report, USA, 2022. 4 [29] Leo Ueno and Trevor Lynn. PaliGemma: An Open Multi- modal Model by Google., 5 2024. 1 [30] V7 Labs. Auto Annotation: Auto-Annotate Complex Ob- jects 10x Faster, 2023. 1 [31] Chien-Yao Wang, Alexey Bochkovskiy, and Hong- Yuan Mark Liao. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. 7 2022. 1 [32] Jun Wang, Shaoguo Wen, Kaixing Chen, Jianghua Yu, Xin Zhou, Peng Gao, Changsheng Li, and Guotong Xie. Semi- supervised Active Learning for Instance Segmentation via Scoring Predictions. In British Machine Vision Virtual Con- ference, Virtual, 12 2020. 2, 3 [33] Shanshan Wang, Cheng Li, Rongpin Wang, Zaiyi Liu, Meiyun Wang, Hongna Tan, Yaping Wu, Xinfeng Liu, Hui Sun, Rui Yang, Xin Liu, Jie Chen, Huihui Zhou, Ismail Ben Ayed, and Hairong Zheng. Annotation-efficient deep learning for automatic medical image segmentation. Nature Communications, 12(1), 12 2021. 2 [34] Weihan Wang, Qingsong Lv, Wenmeng Yu, Wenyi Hong, Ji Qi, Yan Wang, Junhui Ji, Zhuoyi Yang, Lei Zhao, Xixuan Song, Jiazheng Xu, Bin Xu, Juanzi Li, Yuxiao Dong, Ming Ding, and Jie Tang. CogVLM: Visual Expert for Pretrained Language Models. 11 2023. 2 [35] Haoyan Wu, Shikui Wei, Chuangchuang Tan, and Yao Zhao. Pseudo-label Correction from Pixel to Image. In CTISC 2022 - 2022 4th International Conference on Advances in Computer Technology, Information Science and Commu- nications. Institute of Electrical and Electronics Engineers Inc., 2022. 2 [36] Jiaxi Wu, Jiaxin Chen, and Di Huang. Entropy-based Ac- tive Learning for Object Detection with Progressive Diver- sity Constraint. In Proceedings of the IEEE Computer Soci- ety Conference on Computer Vision and Pattern Recognition, volume 2022-June, pages 9387–9396. IEEE Computer Soci- ety, 2022. 2 [37] Jiarui Xu, Shalini De Mello, Sifei Liu, Wonmin Byeon, Thomas Breuel, Jan Kautz, and Xiaolong Wang. GroupViT: Semantic Segmentation Emerges from Text Supervision. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18113–18123, New Orleans, LA, USA, 6 2022. IEEE. 2 [38] Cancan Yi, Jun Liu, Tao Huang, Han Xiao, and Hui Guan. An efficient method of pavement distress detection based on improved YOLOv7. Measurement Science and Technology, 34(11), 11 2023. 2 [39] Donggeun Yoo and In So Kweon. Learning Loss for Active Learning. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 2019. IEEE. 2, 3 [40] Weiping Yu, Sijie Zhu, Taojiannan Yang, and Chen Chen. Consistency-based Active Learning for Object Detection. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, volume 2022-June, pages 3950–3959. IEEE Computer Society, 2022. 2 [41] Zhi Hua Zhou. A brief introduction to weakly supervised learning, 1 2018. 2