1 Evaluating Artificial Intelligence in Breast Cancer Screening Dr Sarah Elizabeth Hickman Department of Radiology University of Cambridge This dissertation is submitted for the degree of Doctor of Philosophy Clare College June 2022 2 Declaration This dissertation is the result of my own work and includes nothing which is the outcome of work done in collaboration except as declared in the preface and specified in the text. It is not substantially the same as any that I have submitted, or, is being concurrently submitted for a degree or diploma or other qualification at the University of Cambridge or any other University or similar institution except as declared in the preface and specified in the text. I further state that no substantial part of my dissertation has already been submitted, or, is being concurrently submitted for any such degree, diploma or other qualification at the University of Cambridge or any other University or similar institution except as declared in the preface and specified in the text. It does not exceed the prescribed word limit for the Degree Committee for Clinical Medicine and Veterinary Medicine. Sarah Elizabeth Hickman June 2022 3 Abstract Evaluating Artificial Intelligence in Breast Cancer Screening Dr Sarah Elizabeth Hickman This thesis evaluates the application and performance of artificial intelligence (AI) in breast cancer screening. Breast cancer screening is conducted on a population scale using mammographic imaging for the earlier detection of breast cancer and has been shown to reduce mortality. A shortage of trained radiologists, as well as the demands of double reading, mean an approach to alleviate pressures within the breast screening workflow is sought. In addition, interval cancers occur at an estimated rate of 3.7/1000 women screened in the UK, thus methods to improve the sensitivity of screening and detect cancers earlier are also needed. Advances in AI over the past decade have demonstrated comparable performance to human readers and could provide a method for an adapted screening workflow to improve both efficiency and efficacy of screening. However, the 2021 National Screening Committee (NSC) report concluded that there was insufficient evidence to support the adoption of AI into the UK breast screening programme. This thesis aims to fill the gaps in evidence highlighted in the NSC report for the performance of AI algorithms within a UK breast cancer screening population, as well as explore the various potential workflow deployment approaches of AI in the screening programme. I start by conducting a systematic review and meta-analysis of the current literature investigating the performance of stand-alone AI applications in breast cancer screening for detection and diagnosis as well as triage approaches. I then describe the creation of a large scale independent medical imaging database which is used in the studies throughout this thesis. The remainder of the thesis describes the results of three retrospective studies evaluating three different commercial AI algorithms. The first study assesses the ability of AI to detect interval cancers at the previous screen. The second study investigates the performance of AI as a stand-alone screen reader. The third study evaluates the proportion of cases identified for both high sensitivity rule out and high specificity rule in triage, as well as the proportion of cancers missed at these thresholds. Overall the results of this thesis will inform discussions around the use of AI in the UK breast screening programme as well as the design of future prospective trials. 4 Acknowledgements Firstly, a special thank you to my PhD supervisor Professor Fiona Gilbert. The support, encouragement and guidance you have provided over the course of my PhD has been invaluable. It has been a pleasure to be supervised by such an eminent researcher in the field and your expertise has been pivotal to the success of this project. These past three years have provided me with knowledge and skills as well as increased my confidence that will be useful for the rest of my life. To Richard Black and Dr Nicholas Payne, I continue to be in awe of your endless knowledge of computer related systems. Your help has been immense and I am grateful for everything you have taught me as well as all your contributions to this research. To Dr Yuan Huang, I am so grateful for all your statistical help in various projects over the course of my PhD, as well as your patience when having to repeatedly explain statistical tests to me. To Dr Martin Graves, Dr Andrew Priest, Dr Josh Kaggie and Bahman Kasmai, thank you for your unwavering support and guidance. I would like to thank Dr Ramona Woitek, Dr Iris Allajbeu, Dr Muzna Nanaa, Liana Hough, Dr Angelica I Aviles-Rivero, Dr Lorena Escudero Sánchez, and Sue Hudson for their contributions to various projects over the course of my PhD. Thank you to the whole BRAID trial team. Especially Jaimie Taylor, thank you for always being keen to discuss fantasy football or an NFL game. To all the members of the Cambridge Cohort Database Access Committee: Esme Radin, Dr Arne Juette, Kathryn Taylor, Adam Loveday, thank you for your advice and insights. A special thank you to Carolyn Read and Helen Street, I remember clearly sitting in both your offices three years ago trying to work out how to get the database approved, your guidance, proof reading of documents and general support through this process made this possible. I would like to thank both the Cambridge and Norwich Breast Unit teams, particularly Lisa Tatham Heather Couzens and Mandy Ballantyne who have helped run numerous NBSS queries and KC62 reports for me. To Dr Mary Kasanicki and Dr Mona Alexander thank you for persevering with our collaboration contracts throughout this work and teaching me what contract is. 5 To Dr Gaby Baxter and Dr Dimitri Kessler, thank you for always being there for snacks and a chat or to pull me back from that puzzled look on my face and bounce an idea off. Sarah Perkins, thank you for helping me organise everything and keeping me on track. Thank you to the NIHR Cambridge Biomedical Research Centre (BRC) PPI team, especially Dr Amanda Stranks, as well as the patients and public who have participated in our events. To the companies who collaborated on this work, thank you for generosity in making your algorithms available and the giving of your time. I would like to thank CRUK and the NIHR Cambridge BRC for funding my PhD. Thank you to all the women whose de-identified data was used in this research. I hope this work will lead to improvements in breast cancer screening overtime. It has been a privilege to work in this field and witness the incredible work the NHS Breast Screening Programme carries out. To the Cambridge University Association Football Club second team (aka the Eagles), thank you for giving me the opportunity to play in such a wonderful and welcoming team, and three out of three varsity wins against Oxford is not bad! To the absolute legends that are my closest friends, thank you for keeping me smiling the whole way. To my kind and wonderful sister Katy, thank you for supporting me, listening to ramblings and always knowing how to make me feel better. To my brother Rob, thank you for providing me with the two most adventurous cats to keep me entertained whilst writing. Lastly, to my kind, generous and extraordinary parents, everything I have been able to achieve has been due to your love and support, thank you. 6 Publications Publications arising from this thesis S E Hickman, R Woitek, E P V Le et al. Machine learning algorithms for workflow applications in screening mammography: a systematic review and meta-analysis. Radiology. 2021. https://doi.org/10.1148/radiol.2021210391 S Hickman, G Baxter, F J Gilbert. Adoption of artificial intelligence in breast imaging: evaluation, ethical constraints and limitations. British Journal of Cancer. 2021. https://doi.org/10.1038/s41416-021-01333-w Publications arising from work unrelated to this thesis F J Gilbert, S E Hickman, G C Baxter et al. Opportunities in cancer imaging: risk-adapted breast imaging in screening. Clinical Radiology. 2021. https://doi.org/10.1016/j.crad.2021.02.013 I Allajbeu, S E Hickman, N Payne et al. Automated Breast Ultrasound: Technical Aspects, Impact on Breast Screening, and Future Perspectives. Breast Cancer Screening and Imaging. 2021. https://doi.org/10.1007/s12609-021-00423-1 I Dayan, H R Roth, A Zhong … S E Hickman et al. Federated learning for predicting clinical outcomes in patients with COVID-19. Nature Medicine. 2021. https://doi.org/10.1038/s41591-021-01506-3 E P V Le, Y Wang, Y Huang, S Hickman et al. Artificial Intelligence in Breast Imaging. Clinical Radiology. 2019. https://doi.org/10.1016/j.crad.2019.02.006 7 Presentations Oral S E Hickman, N R Payne, Y Huang et al. A benchmarking study to evaluate the performance of two artificial intelligence algorithms for interval cancer detection in a UK breast screening setting. Radiological Society of North America Conference, Chicago, USA, December 2021. S Hickman, J G Mainprize, R Black et al. Mammographic case conspicuity, a comparison between a radiologist’s assessment and a Masking Index. European Congress of Radiology Conference, virtual, July 2020. S Hickman, Pantelidou M, Black R et al. Masking Risk Index: an evaluation to guide supplemental imaging for breast screening. British Society of Breast Radiology Annual Scientific Conference, Bristol, UK, November 2019. (Prize: best oral presentation) Poster S Hickman, S Hudson, N R Payne et al. The creation of a breast screening image database – The Cambridge Cohort – Mammography East Anglia Digital Imaging Archive (CC-MEDIA). British Society of Breast Radiology Annual Scientific Conference, virtual, November 2021. S Hickman, R A Woitek, Y R Im et al. Independent machine learning algorithms for workflow adaptation in breast screening mammography: a systematic review and meta-analysis. European Congress of Radiology, virtual, March 2021. 8 Key collaborator contributions Code for the analysis in the systematic review and meta-analysis was written by Dr Gaby Baxter. Code written by Sue Hudson, Dr Andrew Priest, Dr Nicholas Payne, and Dr Lorena Escudero Sánchez was used in the creation of the CC-MEDIA database. The set-up of the algorithm testing infrastructure at the University of Cambridge was made possible by work from Richard Black and Dr Nicholas Payne. Three commercial companies provided their artificial intelligence algorithms for analysis as part of this thesis. Dr Yuan Huang provided statical supervision for the analysis performed in Chapters 5-7. 9 Table of contents Declaration ........................................................................................................................ 2 Abstract............................................................................................................................. 3 Acknowledgements ........................................................................................................... 4 Publications ....................................................................................................................... 6 Presentations .................................................................................................................... 7 Key collaborator contributions ........................................................................................... 8 Table of contents ............................................................................................................... 9 List of figures ................................................................................................................... 13 List of tables .................................................................................................................... 16 Commonly used abbreviations ......................................................................................... 19 Chapter 1 – Introduction .................................................................................................. 21 1.1 Breast cancer ...................................................................................................................... 21 1.1.1 Breast cancer overview ..................................................................................................................... 21 1.1.2 Breast cancer classification ............................................................................................................... 21 1.2 Breast cancer screening ....................................................................................................... 24 1.2.1 Breast cancer screening programmes ............................................................................................... 24 1.2.2 Mammography .................................................................................................................................. 26 1.2.3 Mammographic breast density ......................................................................................................... 28 1.2.4 Risk prediction ................................................................................................................................... 30 1.2.5 Interval cancers ................................................................................................................................. 31 1.3 Artificial intelligence in breast cancer screening .................................................................. 31 1.3.1 Introduction to artificial intelligence ................................................................................................. 31 1.3.2 History of computer aided detection systems in breast cancer screening ....................................... 34 1.3.3 Deep Learning applications to breast cancer screening .................................................................... 35 1.4 Thesis aims and outline ....................................................................................................... 37 Chapter 2 – Adoption of artificial intelligence in breast imaging: evaluation, ethical constraints and limitations .............................................................................................. 40 2.1 Introduction ........................................................................................................................ 40 2.2 Evaluation of artificial intelligence in breast imaging ........................................................... 41 2.2.1 Retrospective evaluation ................................................................................................................... 41 2.2.2 Prospective evaluation ...................................................................................................................... 43 2.2.3 Key considerations for clinical evaluation ......................................................................................... 44 2.3 The breast imaging pathway and AI ..................................................................................... 45 2.3.1 Screening ........................................................................................................................................... 45 2.3.2 Risk stratification ............................................................................................................................... 46 2.3.3 Monitoring and prognostication ....................................................................................................... 46 2.4 Ethical and legal constraints ................................................................................................ 47 10 2.4.1 Guidance level ................................................................................................................................... 47 2.4.2 Algorithm level .................................................................................................................................. 47 2.4.3 Who controls the data? ..................................................................................................................... 48 2.4.4 Clinical level ....................................................................................................................................... 49 2.5 Practical challenges and limitations ..................................................................................... 50 2.5.1 Technical level ................................................................................................................................... 50 2.5.2 Clinical level ....................................................................................................................................... 51 2.5.3 Governance level ............................................................................................................................... 51 2.6 Conclusion .......................................................................................................................... 52 Chapter 3 – Machine learning for workflow applications in screening mammography: systematic review and meta-analysis .............................................................................. 53 3.1 Introduction ........................................................................................................................ 53 3.2 Materials and methods ....................................................................................................... 55 3.2.1 Literature search ............................................................................................................................... 55 3.2.2 Study selection .................................................................................................................................. 55 3.2.3 Data extraction .................................................................................................................................. 55 3.2.4 Meta-analysis .................................................................................................................................... 56 3.2.5 Quality assessment ............................................................................................................................ 56 3.2.6 Statistical analysis .............................................................................................................................. 56 3.3 Results ................................................................................................................................ 57 3.3.1 Statistical selection and data extraction ........................................................................................... 57 3.3.2 Quality assessment ............................................................................................................................ 68 3.3.3 Statistical analysis .............................................................................................................................. 69 3.4 Discussion ........................................................................................................................... 70 3.4.1 Limitations ......................................................................................................................................... 72 3.5 Conclusion .......................................................................................................................... 72 Chapter 4 – Developing a mammographic imaging database – The Cambridge Cohort – Mammography East Anglia Digital Imaging Archive ........................................................ 74 4.1 Aims .................................................................................................................................... 74 4.2 Introduction ........................................................................................................................ 74 4.3 Methods ............................................................................................................................. 76 4.3.1 Database approval ............................................................................................................................. 76 4.3.2 Database governance ........................................................................................................................ 76 4.3.3 Patient and public involvement work ................................................................................................ 77 4.3.4 Database sites ................................................................................................................................... 80 4.3.5 Database creation ............................................................................................................................. 81 4.4 Results ................................................................................................................................ 85 4.4.1 Database image content ................................................................................................................... 85 4.4.2 Database content - Interval cancers ................................................................................................. 86 4.4.3 Database content - Screen detected cancers ................................................................................... 88 4.4.4 Database content - Ethnicity ............................................................................................................ 89 4.4.5 Database content - Mammographic breast density ......................................................................... 91 4.4.6 Database content - Histopathological information .......................................................................... 92 4.5 Technical setup of an AI algorithm testing environment ...................................................... 92 4.6 Uses of the database ........................................................................................................... 92 4.7 Discussion ........................................................................................................................... 93 4.7.1 Overall discussion .............................................................................................................................. 93 11 4.7.2 Limitations ......................................................................................................................................... 93 4.8 Conclusion .......................................................................................................................... 94 Chapter 5 – Performance of artificial intelligence algorithms for interval cancer detection ........................................................................................................................................ 95 5.1 Aims .................................................................................................................................... 95 5.2 Introduction ........................................................................................................................ 95 5.3 Methods ............................................................................................................................. 96 5.3.1 Sample size ........................................................................................................................................ 96 5.3.2 Data ................................................................................................................................................... 96 5.3.3 Ground truth ..................................................................................................................................... 98 5.3.4 AI tools ............................................................................................................................................... 98 5.3.5 Thresholds ......................................................................................................................................... 99 5.3.6 Statistical analysis ............................................................................................................................ 100 5.3.7 Reporting ......................................................................................................................................... 101 5.4 Results .............................................................................................................................. 101 5.4.1 Data ................................................................................................................................................. 101 5.4.2 Algorithm results ............................................................................................................................. 103 5.4.3 Combined algorithm results ............................................................................................................ 107 5.4.4 Sub-group analysis ........................................................................................................................... 108 5.4.5 Failure analysis ................................................................................................................................ 111 5.5 Discussion ......................................................................................................................... 112 5.6 Conclusion ........................................................................................................................ 114 Chapter 6 – Performance of stand-alone deep learning algorithms in a UK screening cohort for detection and diagnosis ........................................................................................... 116 6.1 Aims .................................................................................................................................. 116 6.2 Introduction ...................................................................................................................... 116 6.3 Methods ........................................................................................................................... 117 6.3.1 Sample size ...................................................................................................................................... 117 6.3.2 Data ................................................................................................................................................. 117 6.3.3 Ground truth ................................................................................................................................... 119 6.3.4 AI tools ............................................................................................................................................. 120 6.3.5 Thresholds ....................................................................................................................................... 120 6.3.6 Statistical analysis ............................................................................................................................ 122 6.3.7 Reporting ......................................................................................................................................... 123 6.4 Results .............................................................................................................................. 123 6.4.1 Data ................................................................................................................................................. 123 6.4.2 Algorithm results ............................................................................................................................. 126 6.4.3 Scenario D 99.0% specificity auto recall threshold .......................................................................... 131 6.4.4 Combined algorithm results ............................................................................................................ 132 6.4.5 Sub-group analysis ........................................................................................................................... 134 6.4.6 Failure analysis ................................................................................................................................ 138 6.5 Discussion ......................................................................................................................... 140 6.5.1 Overall performance ....................................................................................................................... 140 6.5.2 Further analysis ............................................................................................................................... 141 6.5.3 Limitations ....................................................................................................................................... 142 6.5.4 Future work ..................................................................................................................................... 142 6.6 Conclusion ........................................................................................................................ 143 12 Chapter 7 - Performance of stand-alone artificial intelligence algorithms in a UK screening cohort for high sensitivity and high specificity triage ..................................................... 144 7.1 Aims .................................................................................................................................. 144 7.2 Introduction ...................................................................................................................... 144 7.3 Methods ........................................................................................................................... 146 7.3.1 Data ................................................................................................................................................. 146 7.3.2 Ground truth ................................................................................................................................... 147 7.3.3 AI tools ............................................................................................................................................. 148 7.3.4 Thresholds ....................................................................................................................................... 148 7.3.5 Statistical analysis ............................................................................................................................ 150 7.3.6 Reporting ......................................................................................................................................... 151 7.4 Results .............................................................................................................................. 151 7.4.1 Data ................................................................................................................................................. 151 7.4.2 Rule-out triage – Threshold 1 and 2 ................................................................................................ 155 7.4.3 Rule-out triage – Threshold 3 and 4 ................................................................................................ 159 7.4.4 Rule-in triage ................................................................................................................................... 163 7.4.5 Combined approach ........................................................................................................................ 166 7.4.6 Sub-group analysis ........................................................................................................................... 168 7.4.7 Failure analysis ................................................................................................................................ 173 7.5 Discussion ......................................................................................................................... 174 7.5.1 Overall performance ....................................................................................................................... 174 7.5.2 Further analysis ............................................................................................................................... 175 7.5.3 Limitations ....................................................................................................................................... 175 7.6 Conclusion ........................................................................................................................ 176 Chapter 8 – Contributions, Future Work and Conclusions ............................................... 177 8.1 Contributions to knowledge .............................................................................................. 177 8.2 Future work ...................................................................................................................... 178 8.2.1 AI in the NHS ................................................................................................................................... 178 8.2.2 Retrospective studies ...................................................................................................................... 179 8.2.3 Prospective studies .......................................................................................................................... 179 8.2.4 Future work - AI research questions ............................................................................................... 181 8.3 Conclusions ....................................................................................................................... 182 References ..................................................................................................................... 184 13 List of figures Figure 1-1 – Breast cancer anatomy .................................................................................................... 22 Figure 1-2 – Production of x-rays diagram ........................................................................................... 27 Figure 1-3 – Example of a two-view full field digital mammogram (FFDM) ........................................ 28 Figure 1-4 – Examples of breast imaging-reporting and data system (BI-RADS) 5th edition mammographic breast density categories .......................................................................................... 29 Figure 1-5 – Artificial intelligence (AI) hierarchy of terms ................................................................... 32 Figure 1-6 – Overview of the architecture of a Convolutional Neural Network (CNN) ........................ 33 Figure 1-7 – Application of Deep Learning (DL) Computer Aided Detection (CAD) algorithms to breast cancer screening .................................................................................................................................. 36 Figure 2-1 – Broad and narrow artificial intelligence (AI) applications to breast imaging ................... 40 Figure 3-1 – Multi-time (left) and multi-view (right) point data that are produced by 2D standard- view mammography and can be analysed at different levels ............................................................. 53 Figure 3-2 – Preferred Reporting Items for Systematic Reviews and Meta-analysis for Diagnostic Test Accuracy (PRISMA-DTA) flow diagram ................................................................................................. 58 Figure 3-3 – (a) Prediction model Risk Of Bias ASsessment Tool (PROBAST) and (b) Quality Assessment of Diagnostic Accuracy Studies-2 (QUADAS-2) assessment ............................................. 68 Figure 3-4 – Checklist for Artificial Intelligence in Medical Imaging (CLAIM) assessment ................... 69 Figure 3-5 – Summary Receiver Operating Characteristic (ROC) curves .............................................. 70 Figure 4-1 – Cambridge Science Festival Event questions ................................................................... 78 Figure 4-2 – National patient survey question regarding acceptability of data fields ......................... 79 Figure 4-3 – National patient survey questions regarding commercial involvement .......................... 79 Figure 4-4 – Data flow of the CC-MEDIA data collection ..................................................................... 81 Figure 4-5 – Nomenclature of case de-identification within the CC-MEDIA database ........................ 83 Figure 4-6 – Timeline of mammography data changes over time at Cambridge and Norwich National Health Service Breast Screening Programme (NHSBSP) sites .............................................................. 84 Figure 4-7 – Time to diagnosis (months) for interval cancers (IC) at a) Cambridge and b) Norwich ... 87 Figure 4-8 – Ethnicity data distribution at Cambridge using National Breast Screening System (NBSS) and Electronic Health Record (EHR) EPIC data .................................................................................... 91 14 Figure 4-9 – Breast imaging-reporting and data system (BI-RADS) 5th edition mammographic density distribution for cases in one year (2017) of data at Cambridge with both raw and processed four views mammograms available [n = 18246] ......................................................................................... 91 Figure 5-1 – Standards for Reporting of Diagnostic Accuracy Studies (STARD) flow diagram of cases included and excluded in this study ..................................................................................................... 97 Figure 5-2 – Example of cases included in the study ........................................................................... 98 Figure 5-3 – Proposed workflow image for testing the artificial intelligence (AI) systems as stand- alone readers for interval cancer (IC) detection ................................................................................ 100 Figure 5-4 – Receiver operating characteristic (ROC) curves for all three artificial intelligence (AI) algorithms at each site ....................................................................................................................... 104 Figure 5-5 – Cambridge data testing density plots for each artificial intelligence (AI) algorithm ...... 105 Figure 5-6 – Norwich data testing density plots for each artificial intelligence (AI) algorithm .......... 107 Figure 5-7 – Combined model receiver operating characteristic (ROC) curve on Cambridge data compared to individual artificial intelligence (AI) algorithms (DL-1, DL-2, DL-3) performance ......... 107 Figure 5-8 – Combined model receiver operating characteristic (ROC) curve on Norwich data compared to individual artificial intelligence (AI) algorithms (DL-1, DL-2, DL-3) performance ......... 108 Figure 5-9 – Proportional Euler diagram of each artificial intelligence (AI) algorithms interval cancer (IC) detection ..................................................................................................................................... 111 Figure 5-10 – False negative case, which was not detected by all three commercial artificial intelligence (AI) algorithms ................................................................................................................ 111 Figure 5-11 – True positive case, which was detected by all three commercial artificial intelligence (AI) algorithms ................................................................................................................................... 112 Figure 6-1 – Mediolateral oblique (MLO) views of mammogram artefacts removed from the study ........................................................................................................................................................... 118 Figure 6-2 – Standards for Reporting of Diagnostic Accuracy Studies (STARD) flow diagram of cases included and excluded in this study ................................................................................................... 119 Figure 6-3 – Proposed workflow deployment of a stand-alone computer aided detection and diagnosis (CADe+x) artificial intelligence (AI) algorithm .................................................................... 121 Figure 6-4 – Receiver operating characteristic (ROC) curves per artificial intelligence (AI) algorithm ........................................................................................................................................................... 126 Figure 6-5 – Receiver operating characteristic (ROC) curves per site ............................................... 127 Figure 6-6 – Precision recall curves (PRC) .......................................................................................... 128 15 Figure 6-7 – Individual artificial intelligence (AI) algorithm score distributions normalised from 0-10 ........................................................................................................................................................... 130 Figure 6-8 – Combined model receiver operating characteristic (ROC) curves on Cambridge data . 133 Figure 6-9 – Combined model receiver operating characteristic (ROC) curves on Norwich data ..... 134 Figure 6-10 – Venn diagram – not proportional ................................................................................ 138 Figure 6-11 – Missing case analysis, case missed by both artificial intelligence (AI) and human readers ........................................................................................................................................................... 138 Figure 6-12 – Missing case analysis, case missed by all human readers and detected by all artificial intelligence (AI) algorithms ................................................................................................................ 139 Figure 6-13 – Missing case analysis, case missed by all artificial intelligence (AI) algorithms ........... 139 Figure 7-1 – Standards for Reporting of Diagnostic Accuracy Studies (STARD) flow diagram of cases included and excluded in this study ................................................................................................... 147 Figure 7-2 – Cancer outcomes for study cohort ................................................................................ 148 Figure 7-3 – Proposed workflow deployment approaches for stand-alone artificial intelligence (AI) systems as triage tools ....................................................................................................................... 150 Figure 7-4 – Receiver operating characteristic (ROC) curves for screen detected cancers (SDCs) as cases .................................................................................................................................................. 156 Figure 7-5 – Receiver operating characteristic (ROC) curves for screen detected cancers (SDCs), next round cancers (NRCs) and interval cancers (ICs) as cases ................................................................. 160 Figure 7-6 – Plots for rule out triage thresholds ................................................................................ 163 Figure 7-7 – Plots for rule in triage thresholds – Screen detected cancers (SDCs), next round cancers (NRCs) and interval cancers (ICs) ....................................................................................................... 166 Figure 7-8 – Violin plots for the combined approach of Scenario C and E for both rule in and rule out triage by an artificial intelligence (AI) algorithm ................................................................................ 166 Figure 7-9 – Partial receiver characteristic (pROC) curves ................................................................. 168 Figure 7-10 – Venn diagram – not proportional, for screen detected cancers (SDCs) missed at threshold 1, Scenario B ...................................................................................................................... 173 Figure 7-11 – Missing case analysis, case missed by artificial intelligence (AI) .................................. 173 Figure 7-12 – Venn diagram – not proportional, for a) interval cancers (ICs) and b) next round cancers (NRCs) detected at the 94.0% specificity threshold Scenario E ............................................ 174 16 List of tables Table 1-1 – Breast cancer molecular subtypes classification ............................................................... 23 Table 1-2 – Breast cancer screening programmes and committee recommendations ....................... 25 Table 2-1 – Datasets publicly and privately available for breast imaging ............................................ 42 Table 2-2 – Prospective studies for the use of artificial intelligence (AI) in breast imaging ................ 43 Table 2-3 – Reporting criteria adapted for artificial intelligence (AI) studies ...................................... 44 Table 3-1 – Computer aided triage (CADt) algorithm details and results. Algorithm performance compared to reader performance for all included studies .................................................................. 60 Table 3-2 – Computer aided triage (CADt) test set data characteristics of all included studies .......... 62 Table 3-3 – Computer aided detection (CADe) and Computer aided diagnosis (CADx) algorithm details and results. Algorithm performance compared to reader performance for all included studies ............................................................................................................................................................. 65 Table 3-4 – Computer aided detection (CADe) and Computer aided diagnosis (CADx) test set data characteristics of all included studies .................................................................................................. 67 Table 4-1 – Mammographic imaging database characteristics ............................................................ 75 Table 4-2 – Number of exams per site available with images currently held in the CC-MEDIA database ............................................................................................................................................................. 85 Table 4-3 – Cambridge and Norwich CC-MEDIA database 2011-2020 compared to the KC62 report at both sites ............................................................................................................................................. 86 Table 4-4 – Interval cancers (ICs) at Cambridge and Norwich with imaging data 2011-2020 in CC- MEDIA .................................................................................................................................................. 87 Table 4-5 – Screen detected cancers (SDCs) at Cambridge and Norwich with imaging data 2011-2020 in CC-MEDIA ......................................................................................................................................... 89 Table 4-6 – Ethnicity information from National Breast Screening System (NBSS) and Electronic Health Record (EHR) EPIC data at Cambridge ...................................................................................... 90 Table 5-1 – Artificial intelligence (AI) algorithm characteristics .......................................................... 99 Table 5-2 – Summary of testing dataset characteristics. ................................................................... 102 Table 5-3 – Interval cancer (IC) characteristics by case ..................................................................... 103 Table 5-4 – Interval cancer (IC) characteristics by lesions ................................................................. 103 17 Table 5-5 – Cambridge data testing of three artificial intelligence (AI) algorithms ........................... 105 Table 5-6 – Norwich data testing of three artificial intelligence (AI) algorithms ............................... 106 Table 5-7 – Subgroup analysis of cases using all interval cancer (IC) data from both Cambridge and Norwich sites ..................................................................................................................................... 110 Table 5-8 – Subgroup analysis of lesions using all interval cancer (IC) data from both Cambridge and Norwich sites ..................................................................................................................................... 110 Table 6-1 – Summary of testing dataset characteristics .................................................................... 123 Table 6-2 – Cancer characteristics by lesions and cases .................................................................... 124 Table 6-3 – Interval cancer (IC) characteristics by lesions and cases ................................................. 125 Table 6-4 – Stand-alone artificial intelligence (AI) algorithm application compared to the single first reader – threshold 1 .......................................................................................................................... 129 Table 6-5 – Stand-alone artificial intelligence (AI) algorithm application compared to the single first reader – threshold 2 .......................................................................................................................... 129 Table 6-6 – Artificial intelligence (AI) algorithm (at threshold 2) combined with the single first reader (+/- arbitration where discordance) compared to double reading performance .............................. 130 Table 6-7 – Perturbation analysis when adjusting the specificity threshold for the artificial intelligence (AI) algorithm, then combining with the first reader and final action arbitration decision if there is discordance ........................................................................................................................ 131 Table 6-8 – Artificial intelligence (AI) algorithm (at threshold 2) combined with the single first reader (+/- arbitration where discordance below 99.0% specificity for the algorithm and above 96.6% specificity) with cases auto recalled above the 99.0% specificity threshold (threshold 3) compared to double reading performance ............................................................................................................. 132 Table 6-9 – DeLong’s test comparison results for DL-1, DL-2, DL-3 compared to the Combined model performance on Cambridge data ....................................................................................................... 133 Table 6-10 – DeLong’s test comparison results for DL-1, DL-2, DL-3 compared to the Combined model performance on Norwich data ........................................................................................................... 134 Table 6-11 – Sub group analysis of DL-1, DL-2, DL-3 set at the first reader specificity threshold of 96.6% (threshold 1) for screen detected cancers (SDCs) ................................................................... 135 Table 6-12 – Sub group analysis of DL-1, DL-2, DL-3 set at the first reader specificity threshold of 96.6% (threshold 1) for interval cancers (IC) ..................................................................................... 136 Table 6-13 – Sub group analysis of DL-1, DL-2, DL-3 set at the first reader specificity threshold of 96.6% (threshold 1) for interval cancer (IC) specific categories ........................................................ 137 18 Table 7-1 – Summary of testing dataset characteristics .................................................................... 152 Table 7-2 – Screen detected (SDC) and next round cancer (NRC) characteristics by lesions and cases ........................................................................................................................................................... 153 Table 7-3 – Interval cancer (IC) characteristics by lesions and cases ................................................. 154 Table 7-4 – Double and single first reader performance at both Cambridge and Norwich ............... 155 Table 7-5 – Results at 1) 99.0% sensitivity threshold 1 and 2) 99.9% sensitivity threshold 2 ........... 157 Table 7-6 – Results for DL-1, DL-2 and DL-3 at the 99.0% sensitivity (threshold 1) Scenario B ......... 158 Table 7-7 – Results for DL-1, DL-2 and DL-3 at the 99.9% sensitivity (threshold 2) Scenario B ......... 158 Table 7-8 – Results for DL-1, DL-2 and DL-3 at the 99.0% sensitivity (threshold 1) Scenario C ......... 159 Table 7-9 – Results at 1) 85.0% sensitivity (threshold 3) and 2) results at 70.0% specificity (threshold 4) ........................................................................................................................................................ 161 Table 7-10 – Results for DL-1, DL-2 and DL-3 at the 85.0% sensitivity (threshold 3) Scenario C ....... 161 Table 7-11 – Results for DL-1, DL-2 and DL-3 at the 70.0% specificity (threshold 4) Scenario C ....... 162 Table 7-12 – Scenario D perturbations of specificity with screen detected cancers (SDCs), interval cancers (ICs) and next round cancers (NRCs) as cases ....................................................................... 164 Table 7-13 – Scenario D perturbations of specificity with screen detected cancers (SDCs), interval cancers (ICs) and next round cancers (NRCs) as cases – additional cancers detected. ..................... 164 Table 7-14 – Scenario E perturbations of specificity with screen detected cancers (SDCs), interval cancers (ICs) and next round cancers (NRCs) as cases ....................................................................... 165 Table 7-15 – Scenario E perturbations of specificity with screen detected cancers (SDCs), interval cancers (ICs) and next round cancers (NRCs) as cases – additional cancers detected. ..................... 165 Table 7-16 – Combined approach of Scenario C and E for both rule in and rule out triage by an artificial intelligence (AI) algorithm ................................................................................................... 167 Table 7-17 – Partial area under the receiver operator characteristic (pAUROC) curve results ......... 167 Table 7-18 – Sub group analysis of DL-1, DL-2, DL-3 set at the threshold of 99.0% sensitivity (threshold 1) using Scenario B for the screen detected cancers (SDCs) missed ................................ 169 Table 7-19 – Sub group analysis of DL-1, DL-2, DL-3 set at 96.0% specificity threshold, using Scenario E for the interval cancers (ICs) detected ............................................................................................ 171 Table 7-20 – Sub group analysis of DL-1, DL-2, DL-3 set at 96.0% specificity threshold, using Scenario E for the next round cancers (NRCs) detected ................................................................................... 172 19 Commonly used abbreviations AI Artificial Intelligence AUROC Area Under the Receiver Operating Characteristic Curve AUPRC Area Under the Precision Recall Curve BI-RADS Breast Imaging-Reporting and Data System BSP Breast Screening Programme BRAID Breast Screening – Risk Adaptive Imaging for Density CC Craniocaudal CADe Computer Aided Detection CADx Computer Aided Diagnosis CADt Computer Aided triage CC-MEDIA The Cambridge Cohort – Mammography East Anglia Digital Imaging Archive CLAIM Checklist for Artificial Intelligence in Medical Imaging CNN Convolutional Neural Network CI Confidence Interval DAC Database Access Committee DBT Digital Breast Tomosynthesis DICOM Digital Imaging and Communications in Medicine DL Deep Learning EHR Electronic Health Records FFDM Full Field Digital Mammography FRC Future Round Cancer IC Interval Cancer MRI Magnetic Resonance Imaging ML Machine Learning MLO Mediolateral Oblique NHS National Health Service NRC Next Round Cancer NRIC Next Round Interval Cancer NSC National Screening Committee OMI-DB The Optimam Mammography Image Database 20 pAUC Partial Area Under the Receiver Operating Characteristic Curve PACS Picture Archiving Communication Systems PPI Patient and Public Involvement PROBAST Prediction model Risk Of Bias ASsessment Tool QUADAS-2 Quality Assessment of Diagnostic Accuracy Studies 2 ROC Receiver Operating Characteristics TC Tyrer-Cuzick SDC Screen Detected Cancer STARD Standards for Reporting Diagnostic Accuracy Studies VBD Volumetric Breast Density 21 Chapter 1 – Introduction 1.1 Breast cancer 1.1.1 Breast cancer overview Breast cancer is the most common malignancy diagnosed in women with 2.3 million new diagnoses globally each year1. Approximately 1 in every 8 women will be diagnosed with breast cancer in their lifetime, such that it accounts for 15.2% of all new cancers diagnosed in the UK, with 45,000 women diagnosed each year2,3. It is the leading cause of cancer related death amongst women as well as the fifth leading cause of cancer death world-wide, equating to 685,000 deaths in 20201,3–5. Risk factors for the development of breast cancer include; female sex, age, lifestyle (e.g. alcohol and smoking), family history, genetic mutations (e.g. BReast CAncer gene (BRCA)), increased breast density, history of breast disease, hormone exposure and expression, and radiation exposure 1,6,7. Breast cancer can be detected through two routes. The first is through symptomatic presentation where a woman presents with a painless lump, skin changes or nipple discharge1. The second route is asymptomatic detection as part of a screening programme using imaging, most commonly mammography, leading to the earlier detection of cancer before the onset of symptoms3. Breast cancer is a heterogeneous disease with a diverse range of morphological imaging features as well as biological and molecular tumour sub-types8. The complexity of this disease means a targeted response at each stage of the care pathway, in which imaging has key role, is required to achieve the best outcomes. Survival rates continue to improve with advances in screening, imaging techniques for diagnosis and monitoring as well as the development of novel and targeted therapies for treatment9. The five year survival rate is around 85.0% in the UK, however it is only 26.2% for women diagnosed with stage four disease10,11. The early detection of breast cancer, through methods such as mammographic screening is proven to reduce both morbidity and mortality1,7,12,13. 1.1.2 Breast cancer classification Breast cancer can be characterised by histopathological type, grade, immunohistochemical profile and gene expression, as well as anatomical extent / staging3,7. Correct classification allows for the prediction of response to treatment as well as overall prognostication of survival, by using tools such as Adjuvant online, PREDICT and the Nottingham prognostic index, thus, allowing for the targeted selection of treatment which can range from radiotherapy, chemotherapy, hormone therapy, targeted biological therapy, and surgery3,14. 22 Most breast cancers arise from the epithelial lining of the terminal ductal lobular unit (TDLU) and are invasive cancers, either invasive ductal carcinoma or invasive lobular carcinoma, Figure 1-1. Invasive carcinomas extend beyond the basement membrane into the surrounding tissues and can potentially metastasise. Invasive ductal carcinomas (IDC), or as it is otherwise known invasive carcinoma of no special type (NST), is the most common type (70-80%) of breast cancer, whereas 5- 15% of breast cancers are invasive lobular carcinomas14–17. The precursor to invasive carcinoma is non-invasive carcinoma where the cancer cells have not spread beyond the originating structure and remain ‘in-situ’3,6. Figure 1-1 – Breast cancer anatomy. Showing the different anatomical and histological structures, as well as the types of cancers that arise from these structures. DCIS: Ductal carcinoma in situ, LCIS: Lobular carcinoma in situ, TDLU: Terminal ductal lobular unit. Adapted from Harbeck et al3 and Feng et al6. Histological type classification is made using the tumour cell type, architectural features and the immunohistochemical profile14. The World Health Organization (WHO) classification of tumours series fifth edition, published in 2019, divides breast cancers into the following two main histological type categories; invasive breast carcinoma of NST and invasive breast carcinomas of special type18,19. Invasive breast carcinoma of NST includes; pleomorphic, oncocytic, lipid-rich, glycogen-rich clear cell, sebaceous, osteoclast-like stromal giant cells, or carcinomas with choriocarcinomatous or melanotic patterns, as well as medullary features which are considered tumour-infiltrating lymphocyte rich invasive breast carcinoma of NST (TIL-rich IBC-NST). Invasive breast carcinomas of special type includes: lobular, tubular, cribriform, metaplastic, apocrine, mucinous, papillary, and micropapillary18,20. Tumours can also be of mixed type, such that they contain multiple subtypes and 23 the proportion of each should be reported16. Histological type alone does not provide enough information regarding the true heterogeneity seen in breast cancers and thus the additional classification categories are used for prognostication to determine treatments and predict survival. Tumour grade provides information regarding the degree of differentiation of the tumour cells from normal breast epithelial cells17. The histological grading system used by breast pathologists is the Nottingham Grading System (NGS), which was developed by Elston and Ellis and modified from the Scarff-Bloom Richardson grading system14,21. The NGS grading system assesses three components of tumour morphology (in invasive cancers only): tubule formation, nuclear pleomorphism, and mitotic count17. Each component is scored from 1-3 and adding these scores together gives the total count which relates to the overall grade (grade 1 = scores 3-5, grade 2 = scores 6-7, and grade 3 = scores 8- 9)22. Grade 3 tumours are often larger and grow more rapidly leading to a worse prognosis17. St. Gallen International Expert Consensus molecular subtype definitions, classify breast cancer into five categories based on results of immunohistochemistry for the following markers: oestrogen receptor (ER), progesterone receptor (PR), human epidermal growth factor 2 receptor (HER2 / ERBB2) as well as Ki-67 which is a proliferation marker protein3,23–26. The five categories are outlined in Table 1-1. Luminal A is the most common subtype and has the best prognosis out of the five categories. Subtype ER PR HER2 Ki-67 Prognosis Luminal A + + - Low Good Luminal B (HER2 -) Luminal B (HER2 +) + + + + - + High High Intermediate Poor Triple negative (Basal) - - - High Poor HER2-enriched - - + Moderate-High Poor Normal-like + + - Low Intermediate Table 1-1 – Breast cancer molecular subtypes classification. +: Positive, - : Negative. Adapted from Dai et al23, Harbeck et al3 and Feng et al6. An anatomical staging classification system published by the American Joint Committee on Cancer (AJCC) uses the extent of the primary tumour (Tis to T4), regional lymph nodes (N0 to N3) and metastases (M0 or M1), resulting in a TMN status which relates to the five stage categories (0-IV)27. The eighth edition, published in 2017 of the AJCC classification system incorporated changes to account for the prognostic stage, which includes; tumour grade using the NGS, biomarkers, and multigene panels (e.g. Oncotype DX)27–29. The incorporation of this prognostic information can result in a ‘stage migration’, such that triple negative cancers (ER -, PR -, HER2 -) or grade 3 cancers would be ‘upstaged’ and HER2 positive or grade 1 cancers would be ‘downstaged’. Future classification, using techniques such as next generation sequencing, will allow for the identification of additional breast cancer sub-types to further tailor treatment approaches30. 24 1.2 Breast cancer screening 1.2.1 Breast cancer screening programmes Population based screening programmes are designed around the ten Wilson and Jungner principles, published by the WHO in 1968 and subsequent modifications of these principles31. National screening programmes are conducted to identify certain diseases in an asymptomatic population, for earlier diagnosis and to facilitate prompt treatment32. Breast cancer screening aims to “maximise the success of treatment, reducing mortality from breast cancer”33. Most European countries as well as many countries across the globe have either a national or regional population breast cancer screening programme24. Though it is recognised that these types of organised screening programmes are limited to high-income countries34. The ‘Marmot review’ 2012 and recent meta- analyses of randomised control trials estimate a 15-30% reduction in mortality due to mammographic screening35,36. A study published by Duffy et al in 2020 found in a Swedish screening population a 34% reduction in 10-year mortality through participation in screening which was independent to changes in treatment37. However, the risk of harm from screening such as false positives resulting in subsequent increase in patient anxiety, overdiagnosis (estimated at 11-19%), and overtreatment, whilst difficult to truly gauge, needs to be balanced against the benefits of screening36,38,39. This research thesis primarily considers the UK National Health Service Breast Screening Programme (NHSBSP). Following the recommendations of the ‘Forrest Report’ 1986, the NHSBSP commenced in 1988, with women aged 50-64 years old screened every three years with a one view mediolateral oblique (MLO) screen film mammogram read by a single reader40–42. The NHSBSP has evolved over time and now uses two-view full field digital mammograms (FFDM) as well as double reading of each mammogram. Women aged 50-70 years old are invited to attend every three years at one of the seventy-five screening units across the UK43. Women can also self-refer after the age of 70 years old and continue with screening every three years. This deviates from the current European Commission Initiative on Breast Cancer (ECIBC) guidelines, where screening is recommended for women age 50- 69 years old every two years, women aged 70-74 every three years and every two to three years for women aged 45-49 years old38,44,45. Guidance from the American College of Radiology (ACR) and Society of Breast Imaging (SBI) differs too. ACR SBI recommend annual mammography is to start at the age of 40 years old46. A summary of the variations between different screening programmes and committee recommendations are shown in Table 1-2. 25 Interval Age Modality UK43 Triennial 50-70+ FFDM Sweden47 Biennial 40-74 FFDM Netherlands48 Biennial 50-75 FFDM Norway49 Biennial 50-69 FFDM Australia50 Opt-in 40-49 FFDM Biennial 50-74 FFDM Opt-in 74+ FFDM China51 Annual-Triennial 20-39 Examination (self-exam monthly) Annual-Biennial Annual 40-69 FFDM + *USS Examination (self-exam monthly) Annual 70+ Examination (self-exam monthly) ECIBC44,45 Biennial / Triennial 45-49 FFDM Biennial 50-69 FFDM Triennial 70-74 FFDM ACR46 Annual 40+ FFDM USPSTF52 Individual decision 40-49 FFDM Biennial 50-74 FFDM CTFPHC53 Shared decision 40-49 FFDM Biennial / Triennial 50-74 FFDM Table 1-2 – Breast cancer screening programmes and committee recommendations. Detailing the recommendations for the frequency, age and modality of screening from different screening programmes and screening committee recommendations. Adapted from Clift et al39. ACR: American College of Radiology, CTFPHC: Canadian Task Force on Preventive Health Care, ECIBC: European Commission Initiative on Breast Cancer (ECIBC), FFDM: Full field digital mammography, USS: Ultrasound, USPSTF: US Preventive Services Task Force. *In patients with dense breasts only. Approximately 2.5 million women aged 50-70 years old are invited for screening as part of the NHSBSP each year, with 1.8 million women attending screening (~71% uptake) and 66,000 (~3.7%) being recalled to attend an assessment clinic33,54. The NHSBSP has set a standard to ensure the number of women recalled to assessment is not too high through maximising specificity. The recall rate is set at an acceptable level of 10% for prevalent (first round of screening) and 7% for incident (not first round of screening) screens, as well as an achievable level of 7% for prevalent and 5% for incident screens33. Each year ~15,000 cancers are diagnosed through screening (0.8% women screened) and the NHSBSP has set an age standardised detection ratio for invasive cancer of ≥ 1.00 for acceptable and ≥ 1.40 for achievable, in women age 50-70 years old33,54. The screening programme aims to detect clinically significant cancers to reduce mortality, such that it is important screening programmes detect small cancers, which are defined by the NHSBSP as an invasive cancer < 15mm. The NHSBSP standard is set at an age standardised detection ratio of ≥ 1.00 for acceptable 26 and ≥ 1.40 for achievable regarding the detection of small invasive cancers. Approximately 51.9% of invasive cancers detected through screening are small33,54. In the NHSBSP each mammogram is read on 5-megapixel monitors by two expert readers55. Every reader must report 5,000 mammograms each year and also undertake the PERsonal perFORmance in Mammographic Screening (PERFORMS) assessment to monitor reader standards56. Double reading can either be blinded (independent) such that the readers are unable to see the other reader’s decision or unblinded (dependent) so that the readers are able to see each other’s decision. Where there is discordance between readers arbitration / consensus reading takes place by a third reader or group of readers. Double reading is shown to pick up an additional 9% of cancers compared to single reading55,57. However, there is a national shortage of screen readers in the UK with this shortage expected to increase over the next five years58. A women’s life time risk of developing breast cancer is ~11%59. In the UK breast screening is currently adapted for those with a high risk (> 30% lifetime risk) of breast cancer, where women are offered annual magnetic resonance imaging (MRI) and / or mammography depending on their age38. Moderate risk (17-30% lifetime risk) women are also offered annual mammography59. Further risk adaptation using breast density, polygenic risk scores as well as risk prediction models is currently being investigated and discussed later in this chapter. 1.2.2 Mammography A mammogram is an image of the breast acquired using low energy x-rays and mammography is the primary imaging technique used world-wide for breast cancer screening. The process of x-ray production in mammography is as follows; electrons are released from a filament via a process of thermionic emission and are then accelerated away from the negatively charged cathode across the vacuum tube towards the positively charged anode. The tube voltage is the potential difference between the anode and cathode which causes the acceleration of the electrons across the tube towards the anode when they are emitted60. The electrons then hit the rotating anode target (which can be made out of a number of materials e.g. tungsten, molybdenum, or rhodium) which causes the emission of x-rays through Bremsstrahlung radiation61. The anode is rotated to dissipate the heat generated by the bombardment of electrons. These x-rays are then directed towards the patient via a window in the lead shielding. Low and high energy photons are removed via a filter as well as the beam is focused using a collimator, as shown in Figure 1-2. The x-ray beam is attenuated differently by different tissues in the body giving contrast in the resulting image. X-rays are attenuated to a greater degree by denser fibroglandular tissue than by fatty tissue such that dense fibroglandular tissue appears whiter in the x-ray image. The same is true of many lesions and microcalcifications which is why mammography is an appropriate imaging modality for screening. The target / filter 27 combination and the exposure factors (tube voltage, tube current and exposure time) can be adjusted to increase the image quality whilst maintaining an acceptable level of dose for each woman. The recommended dose per mammogram according to the National Diagnostic Reference Levels (NRLs) is 2.5 mGy / mean glandular dose62. Figure 1-2 – Production of x-rays diagram. Showing the different components of an x-ray machine for mammography. Adapted from Radiology Cafe60. A mammogram is made up of the craniocaudal (CC) and MLO views of the right and left breast, creating a two-view mammogram63. Mammography is held to high technical quality standards to ensure the images are of adequate quality to allow for interpretation and minimising errors64. These standards include ensuring there are no skin folds, blurring or artefacts included in the image as well as ensuring that the whole breast is included and the nipple is in profile in at least one view. Methods to ensure the whole breast is included entail using the posterior nipple line (PNL) and making sure the inframammary angle is included in the MLO view, as shown in Figure 1-3. Common mammographic signs of cancer are a; irregular ill-defined mass, spiculated mass, microlobulated mass, asymmetry, fine linear pleomorphic microcalcification, and architectural distortion. 28 Figure 1-3 – Example of a two-view full field digital mammogram (FFDM). Image quality markers are shown for the posterior nipple line, inframammary angle and the nipple in profile. CC: craniocaudal, MLO: mediolateral oblique, PNL: posterior nipple line. 1.2.3 Mammographic breast density Mammographic breast density is a radiographic representation of fibroglandular tissue (fibrous connective tissue (stroma), and glandular tissue (terminal ductal lobular units)) to fatty tissue proportion in the breast, where density is represented by areas that are radiopaque38,65. Variations in density between individuals occur due to genetic predisposition, ethnicity as well as due to changes in weight, nutrition, hormone exposure / expression, and age66,67. It is not only the amount and distribution, but also the heterogeneity, and texture of the fibroglandular tissue that is important68. Breast density can be measured from mammographic images using visual methods (ACR Breast Imaging-Reporting and Data System (BI-RADS) 5th edition lexicon (2013) / Visual Analogue Scale (VAS) / Wolfe classification (1976) / Boyd classification (1995) / Tabar classification (1997)), which are subjective, thus there is inter and intra-reader variability in reporting68–71. When using the BI-RADS lexicon the greatest discordance is reported in the middle two categories (b and c)72,73. The greatest proportion of the screening women’s mammographic breast density is also represented in these middle two categories of BI-RADS breast density (b and c)74,75. Four different FFDM images demonstrating the four BI-RADS breast density categories are shown in Figure 1-4. 29 Figure 1-4 – Examples of breast imaging-reporting and data system (BI-RADS) 5th edition mammographic breast density categories. a) Almost entirely fatty, b) scattered areas of fibroglandular density, c) heterogeneously dense, d) extremely dense. Alternatively, breast density can be measured using semi-automated or fully automated systems, which provide a more consistent output76. Planimetry, semi-automated thresholding techniques (Cumulus and Medena) and fully automated systems (VolparaTM and QuantraTM) can be used with varying levels of human interaction to provide quantitative measures of mammographic breast density68,77. Fully automated systems, such as VolparaTM and QuantraTM, provide density as either an area or volumetric breast density (VBD) measure. These quantitative measures can then be mapped into the BI-RADS density categories. Previous studies have shown variable agreement between automated systems and radiologists density assessment (k = 0.46-0.57)78. In addition, these measurements can be affected by positioning, radiographic factors such as kVp (tube voltage) and mAs (tube current and exposure time) as well as the incorporation of nonstandard views38,76. Most automated systems require raw (“for processing”) FFDM data, which is not kept routinely due to storage space requirements. A raw image is proportional to the x-ray attenuation detector signal. The raw FFDM is then processed to create “for presentation” images by the mammography vendors algorithm68. New deep learning (DL) density systems have started to use the processed FFDM images to calculate mammographic breast density79–82. Lehman et al demonstrated good agreement (k = 0.85; 95% CI 0.84-0.86) and acceptance (90%) with radiologists when implementing a new DL density algorithm in clinical practice, reviewing > 10,000 mammograms82,83. When this model was then externally validated there was a high rate of agreement by both academic (94.9%) and community (90.7%) radiologists84. Increased breast density reduces the sensitivity of mammography as overlapping fibroglandular tissue can obscure the detection of a breast cancer (known as “masking”). Sensitivity reduces from 75.0-98.0% to 30.4-66.0% from the highest to lowest BI-RADS categories (a to d)85,86. Breast density 30 is also an independent risk factor for developing breast cancer87,88. A systematic review and meta- analysis of forty-two studies demonstrated the relative risk of developing breast cancer was 2.9 and 4.6 in women with a Percentage Mammographic Density (PMD) of 50-74% and ≥ 75% respectively, relative to women with PMD < 5%87,89. VAS (OR 4.4 (95% CI 2.7-7.0)), Densitas % (OR 2.17 (95% CI 1.41-3.33)), Volpara % (OR 2.42 (95% CI 1.56-3.78)) and BI-RADS (OR 2.3 (95% CI 1.9-2.8)) measures have shown to be strong predictors of breast cancer risk78,79. Legislation passed by the USA Congress in 2019 requires the mandatory reporting of density as part of the USA breast screening programme90,91. Women classified as having dense breasts (BI-RADS c or d) in USA screening are recommended to discuss with their doctor if they should undergo additional imaging, as a cancer could have potentially been obscured by the dense breast tissue38. In 2022 the European Society of Breast Imaging (EUSOBI) recommended that women “should be informed about their breast density” and that women aged 50-70 with “extremely dense breasts” should be offered screening breast MRI “every 2 to 4 years”92. 1.2.4 Risk prediction Opportunities for risk adapted screening by using different measures for risk stratification as well as different imaging modalities or screening frequencies for cancer detection are currently under investigation38,93. Methods for risk stratification include using breast density alone, which involves triaging the women in the highest density categories (c and d) for supplemental imaging with more sensitive imaging modalities (e.g. MRI or ultrasound)94. Alternatively mammographic breast density can be incorporated into in to risk prediction models (e.g. Tyrer-Cuzick (TC)) to increase the predictive power; TC + BI-RADS OR 1.55 (95% CI 1.33-1.80), TC + Volpara VBD OR 1.40 (95% CI 1.21- 1.61) vs TC alone 1.27 (95% CI 1.14-1.40))95,96. Yala et al demonstrated their DL model (Mirai), which uses the mammographic imaging data only and no additional risk prediction fields, achieved a one- year cancer risk prediction C-index of 0.75-0.84 at seven separate sites in four continents when used alone compared to the TC C-index of 0.62, which was tested on data from only one USA site93. Furthermore, using the data from one USA site and thresholding Mirai at the TC specificity of 85.2%, Mirai achieved a sensitivity of 39.7% (95% CI 32.9%-46.5%) compared to TC which achieved a sensitivity of 22.9% (95% CI 15.9%-29.6%)93. However, Mirai was both developed and tested only on Hologic mammograms and so further generalisability testing using different mammographic vendors is required to account for variability of post-processing93. Polygenic risk scores can be calculated through sequencing a pre-defined panel of single nucleotide polymorphisms97. The incorporation of polygenic risk scores into risk prediction models (e.g. TC or Breast and Ovarian Analysis of Disease Incidence and Carrier Estimation Algorithm (BOADICEA)) resulted in improved risk stratification accuracy (area under the receiver operating characteristic 31 curve (AUROC) 0.691 to 0.697)98, however it can also lead to overestimation of risk in certain risk categories (e.g. high risk expected to observed number of cases (E/O) 1.54 (0.81-2.29)). A collaborative approach using breast density, polygenic risk scores, and risk prediction models / DL models is likely to further increase the accuracy of screening risk statification38,94,96,97,99. With increasing accuracy the feasibility and cost-effectiveness of risk stratified screening also improves38. 1.2.5 Interval cancers Interval cancers (ICs) are defined as those occurring between the screening round (“a breast cancer diagnosed in the interval between scheduled screening episodes in women who have been screened and issued with a normal screening result”)38,100,101 . An estimated 6,000 ICs occur in the NHSBSP each year with the average time to diagnosis of 644 days, such that the highest proportions are diagnosed in the second (42.0%) and third years (39.0%) after screening102,103. Updated Public Health England (PHE) guidance in 2017, requires all ICs to be reviewed in order to ascertain whether or not a cancer had been missed at the original screen read33. A score is applied (1) normal / benign (77.0%), (2) uncertain (16.0%), and (3) suspicious (7.0%), with a ‘duty of candour’ requiring all patients to be informed of a suspicious finding33. ICs are often of higher grade and lesion size compared to screen detected cancers and therefore have a worse prognosis103. Increased mammographic breast density is associated with increased risk of IC development101,104–106. ICs are a key measure of the performance of a screening programme and the NHSBSP has set an acceptable IC rate target of 0.65/ per 1000 women screened in the first 12 months, 1.40/ per 1000 12-24 months and 1.65/ per 1000 24-36 months, which has increased to reflect the changes in incidence overtime33,100,101. Therefore, in total a rate of 3.7/ per 1000 ICs is expected in the NHSBSP. 1.3 Artificial intelligence in breast cancer screening 1.3.1 Introduction to artificial intelligence Alan Turing first proposed the question, “Can machines think?”, in 1950107. The term Artificial Intelligence (AI) is thought to have first been used in 1956 at the Dartmouth Summer Research Project on Artificial Intelligence108. Progression in the field of AI has recently accelerated due to the availability of large datasets, sufficient computing power as well as a growing interest and funding to develop algorithms to automate everyday tasks. AI is an umbrella term for Machine Learning (ML) and DL disciplines as show in Figure 1-5. AI is the theory and development of computer systems able to perform tasks normally requiring human intelligence, such as visual perception and decision- making. AI is strictly defined by ISO/IEC TR 24028:2020 as the “capability of an engineered system to acquire, process and apply knowledge and skills”109,110. ML is a ‘sub-field’ of AI, where algorithms learn and improve autonomously through the provision of data, without ‘explicit programming’111. 32 Examples of traditional ML techniques include; support vector machines, k nearest neighbours, principal component analysis, and decision trees. DL is a subset of ML that uses multiple algorithms working in a neural network architecture with many layers to extract high level features from data and carry out hierarchical learning112. Figure 1-5 – Artificial intelligence (AI) hierarchy of terms. Image used with permission of the National Breast Imaging Academy e-LfH programme. DL has been applied to computer vision tasks using Convolutional Neural Networks (CNNs) achieving good performance. The 2012 ImageNet Large Scale Visual Recognition Challenge (ILSVRC) is seen as the catalyst for this innovation where the AlexNet algorithm reduced the error rate in an image recognition task to 15.3%, improving the error rate compared to the team that came second, whose achieved an error rate was 26.2%113. CNN algorithms are increasingly being applied to every day image recognition tasks such as self-driving cars, facial recognition, and automated text translation. A typical CNN consists of multiple layers. The first layer is the input layer. Next there are the convolutional, detector, pooling and fully connected layers, which are the hidden layers. Lastly, there is the output layer. In the input layer an image is provided to the algorithm. Next in the convolution layers a kernel passes over the image to extract high level features111. An activation function can then be applied in the detector layer, such as a rectified linear unit (ReLU) where all negative values in the feature map are replaced with a zero-value adding nonlinearity to the data. The feature map is then passed to the pooling operation (e.g. max pooling, sum pooling or average pooling) for feature reduction / down sampling, providing translation invariance. This map is then passed to the fully connected layer where a classifier function (e.g. softmax activation function) is applied. There can be hundreds of hidden layers in a CNN112,114. The output from the algorithm is then provided in the output layer as a classification result. Iterative adjustments of the algorithm 33 take place through a loss error function and back-propagation processes in training. An overview of a CNN architecture is shown in Figure 1-6. Figure 1-6 – Overview of the architecture of a Convolutional Neural Network (CNN). Adapted from National Breast Imaging Academy – Computer-Aided Detection (CAD) and Artificial Intelligence (AI) module. Radiology is a digitally advanced field of medicine as well as being a mainly visual-based specialty, with ~45 million radiological images reported each year in England (the most common of which is plain film x-rays with ~1.86 million reported annually)115. Thus, radiological image interpretation is a prime candidate for the application of DL to aid in automating radiology tasks. CNNs have been used in detection, diagnosis, and segmentation-based tasks in radiology111. Medical images, such as mammograms, do differ from everyday images. They are of higher resolution, for example mammograms contain between 2600 x 2000 pixels, and the area to be detected (disease) in medical images is a relatively small area of the total image. Moreover, mammograms are more complex than natural images due to the high variability in patterns, the difference in features and task requirements, making this a challenging task116,117. Imaging data for training is becoming increasingly available, such as the ImageNet database which contains over 15 million labelled natural images, however there is still a limitation in the availability of ethically approved curated medical image datasets118. To overcome this limitation algorithms can first be pre-trained on datasets such as the ImageNet, or other publicly available imaging datasets, and then re-trained on representative medical imaging data through transfer learning. Algorithm training can take place by supervised, semi-supervised, or unsupervised approaches requiring varying levels of data annotation. In supervised learning, detailed lesion level annotations and labels are provided, whereas in unsupervised learning no annotations or labels are provided and the algorithm itself identifies the pertinent image features from which to classify the image. Semi-supervised learning provides a 34 hybrid approach111,114. Such labels also provide the “ground truth” when testing algorithms performance. This ground truth is seen as the absolute outcome of a case, and can consist of expert radiologist annotation, time follow-up or histopathological outcome. van Leeuwen et al reported that there are over 100 Conformité Européenne (CE) marked AI products for a radiological application in 2021, however only 36 had peer-review evidence, and of the available publications 49% of studies were performed “independently from the vendor”109. Kim et al reported that only 6% of published studies relating to the evaluation of AI algorithms were performed by external validation, such that the AI was tested on data from a separate institution (geographical) or time period (temporal) from the training data119. Thus, further unbiased evidence provided from large external studies conducted independently of the commercial vendor are needed across the radiology AI field. 1.3.2 History of computer aided detection systems in breast cancer screening Research into the use of Computer Aided Detection (CAD), also known as CADe systems, in medical imaging commenced in the 1960s, with the first CAD mammography system (Hologic R2 (Image Checker M1000)) receiving FDA clearance in 1998120. Traditional CAD systems, based on hand- crafted features, provided prompts for radiologist such as a D symbol for calcification and a * symbol for mass to mark areas of increased suspicion in order to reduce reader oversight, acting as a “second-look”. These initial systems had high sensitivity for calcifications (98.0-100%) and could also detect masses (88.0-92.0% sensitivity), but few could detect features of asymmetry or distortion121,122. A high number of false positive prompts, due to low specificity, resulted in reader fatigue, distraction, and loss in confidence of CAD systems. Thus, overall performance when using a CAD system is dependent on the decision-making process of the reader, the accuracy of the system, and the interaction between the two123,124. There is also the possibility of over reliance on the system leading to a loss in synergy between the computer system and human reader required to maintain a high level of sensitivity, as lesions could be overlooked if not marked125. In 2008 between 74.0-91.4% of USA mammograms were read using a CAD based system in conjunction with a single reader, and an updated survey in 2016 found 92.3% of screening centres used CAD systems126,127. However, the adoption of CAD systems has been limited across other screening programmes in the world, including the NHSBSP, due to the effect of increasing recall rates and thus deemed lack of cost effectiveness128. Lehman et al investigated the use of CAD in the USA screening system, Breast Cancer Surveillance Consortium, between 2003 and 2009 and demonstrated a reduction in sensitivity when using CAD, from 87.3% without CAD to 85.3% with CAD as well as an increase in recall rate when 495,818 mammograms were interpreted with CAD and 129,807 without CAD, by 271 radiologists at 66 facilities125. A pooled analysis of ten studies (2001-2008), Taylor et al (2008), 35 demonstrated that single reading plus CAD vs single reading resulted in a significant increase in recall rates (OR 1.10, 95% CI 1.09 - 1.12, P < 0.001), and that double reading with consensus / arbitration resulted in a significant improvement in recall rates compared to single reading plus CAD129. CAD systems demonstrated a varied performance for cancer detection, and in the pooled analysis as part of the review, no statistically significant difference in cancer detection rate was found (OR 1.04, 95% CI 0.96-1.13, p = 0.35)129. CAD systems also face the same reduction in sensitivity due to increased mammographic density that human readers incur and CAD prompts have been shown to increase overall reading time by approximately 10-20 seconds130–132. 1.3.3 Deep Learning applications to breast cancer screening The traditional CAD systems are now being superseded by DL CAD algorithms which have improved sensitivity and specificity for the detection of cancers133. The majority of DL algorithms with Federal Drug Agency (FDA) or CE mark approval are for clinical decision support system (CDSS) applications, similar to traditional CAD systems where the algorithm supports the reader by providing prompt suggestion to locate the cancer. However, there are multiple other applications of the latest DL CAD systems for mammography interpretation to aid readers and improve the efficiency and efficacy of breast screening, which are shown in Figure 1-7. These include the use of DL CAD systems as stand- alone readers for DL CAD triage (CADt), to prioritise work lists and pre-populate image reports for normal studies to improve programme efficiency. Studies have shown that these DL CADt systems can operate as stand-alone readers to both rule out a high proportion of normal cases (17.0%- 60.0%) whilst missing a small proportion of cancer cases (0.0-7.0%), as well as rule in a small proportion of cases (1.0-5.0%) highly suspicious of cancer (13.0%-32.0% next round (NRCs) and ICs) for further assessment134,135. In addition, DL CAD detection and diagnosis (CADe+x) algorithms could operate as stand-alone systems to replace a reader in a double reading system. This approach has the potential to improve both efficiency as well as efficacy, as these systems could replace one reader as well as reduce the rate of ICs. A study by Lång et al found 11.2% of ICs could be detected at the screening mammogram when the DL CADe+x algorithm was set at a 4.0% recall rate. As outlined earlier in this chapter (section 1.2.3 and 1.2.4), DL density algorithms could potentially be used to risk stratify the population for adapted screening and the application of supplemental imaging. These DL algorithms could either be used alone or in combination with other risk information, such as mammographic breast density, polygenic risk scores and other risk prediction models. 36 Figure 1-7 – Application of Deep Learning (DL) Computer Aided Detection (CAD) algorithms to breast cancer screening. Image used with permission of the National Breast Imaging Academy e-LfH programme. The UK National Screening Committee (NSC) report 2021 on the “Use of artificial intelligence for mammographic image analysis in breast cancer screening – Rapid review and evidence map” did “not recommend using AI in the NHSBSP”136. This was due to: a lack of evidence relating the accuracy of AI in clinical practice, the varying reported performance of AI in different settings, the lack of UK based evidence, lack of quality evidence, and lack of evidence pertaining to AI performance for different types of breast cancer as well as performance in different groups of women (e.g. “different ethnic groups”)136. All studies included in the report were of high risk of bias using the QUADAS-2 tool for assessment and the authors recommended the importance of external validation using geographically different datasets as well as pre-specified test thresholds to limit bias. All evidence provided in the report was retrospective and the report highlighted that a number of studies used enriched cancer cohorts for testing. Enriched cohorts consist of an increased cancer proportion which is “atypical of a screening population”, this was defined in the report as a cancer percentage of more than 3.0%. A repeat review was recommended for 1-3 years’ time from the date of publication to review the latest evidence. The 2016 “Digital Mammography Dialogue on Reverse Engineering Assessment and Methods” (DM DREAM) challenge, although not conducted using UK data, did provide an external validation study on a large representative cohort, from two different countries and screening programmes, for the testing of multiple DL algorithms137. The DREAM challenge included 126 different teams, who each developed their own DL CADe+x algorithm for the prediction of cancer development within the next 12 months. The curated datasets consisted of 144,231 and 166,578 screening FFDMs, with prior 37 exams and clinical information available, from a USA (Kaiser Permanente Washington (KPW)) and Swedish (Karolinska Institute (KI)) screening site respectively137. 1.1% of the mammograms in the KPW and KI datasets were cancers and no image level annotations were available. Each image was assigned a binary image label ground truth from histopathology results (cancer) or follow-up of ≥ 12 months (normal)133. The KPW dataset as well as other publicly available datasets were used for model training. The top twenty algorithms were evaluated using the KI dataset. The top performing algorithm on the KPW dataset also achieved the top performance on the KI dataset demonstrating the generalisability of this algorithm. No challenge algorithm outperformed reader performance when the threshold was set at single reader sensitivity (77.1%), with the top performing algorithm achieving a specificity of 88.0% compared to the single reader specificity of 96.7%. Hybridisation of the best performing algorithms into the challenge ensemble method, achieved a specificity of 92.5% (at the single reader sensitivity (77.1%))137. It is important to identify the limitations of a DL CAD algorithms to enable the radiologist to be vigilant prior to implementation in clinical workflow. The latest systems are still susceptible to the limitations that both traditional CAD and human readers face, such as the reduction in sensitivity with increasing breast density138. The true impact of these latest systems on reading time, recall rates, biopsy rates, cost effectiveness and cancer detection are unknown and prospective evaluations are required to fully assess the clinical impact of DL CAD algorithms on breast cancer screening targets. It is also pertinent that DL CAD algorithms do not exacerbate the potential harms of screening such as overdiagnosis, overtreatment and false positive recalls to assessment which lead to patient anxiety. Lastly, the gaps in evidence highlighted by the UK NSC report 2021 are required to be addressed before DL CAD algorithms are introduced into the NHSBSP. 1.4 Thesis aims and outline The focus of this thesis is the evaluation of AI algorithms (specifically DL CAD algorithms) for breast cancer screening applications. The developments in DL means that AI algorithms have been created for numerous mammography screen reading tasks. However, more evidence is needed to determine the best way to evaluate and monitor these AI algorithms to ensure acceptable performance for deployment into breast screening programmes as well as which applications of AI algorithms are most suitable for breast cancer screening in different countries. In this thesis, three different commercial AI algorithms are evaluated using a large retrospective dataset from two NHSBSP sites for three different proposed AI algorithm applications in breast cancer screening. For consistency each AI algorithm is assigned a unique identifier (DL-ID) for this thesis which remains consistent throughout all chapters. 38 The main aims of this thesis are: 1. To investigate the performance of AI algorithms for stand-alone reader applications in breast cancer screening, through a systematic review and meta-analysis, to determine the current performance achieved, the datasets used in testing, as well as gaps in reported evidence. 2. To develop a representative UK screening mammographic imaging database that can be used for retrospective benchmark testing of AI algorithms. 3. To investigate the performance of three AI algorithms for the detection of ICs in breast cancer screening. 4. To benchmark the performance of three AI algorithms to be used as a stand-alone reader as well as in collaboration with human readers for breast cancer screening. 5. To evaluate the performance of three AI algorithms to triage low priority cases that do not require human reading as well as high priority cases that can bypass reading to enhanced assessment, whilst maintaining an acceptable sensitivity and specificity. Chapter 1 provided an overview of breast cancer and breast cancer screening as well as the main image screening technique of mammography. In addition, the history of CAD systems in breast cancer screening programmes and the advances in DL methods that underpin the latest AI algorithm approaches to breast cancer screening were discussed. Chapter 2 covers the ethical, legal and regulatory challenges surrounding the use of AI algorithms in breast cancer screening and the development of medical imaging databases required to evaluate AI algorithm performance. Chapter 3 is a systematic review and meta-analysis of the stand-alone applications of AI algorithms in breast cancer screening. The diagnostic performances of different AI algorithms are compared and the databases used for testing, study methodology and reporting quality are evaluated. Chapter 4 outlines the construction and contents of a mammographic medical imaging database which is subsequently used in Chapters 5-7 for testing the performance of different AI algorithms. Chapter 5 presents the results from an experiment to evaluate the performance of AI algorithms for IC detection. This chapter also investigates the use of different thresholding methods to identify the operating point for each AI algorithm. Chapter 6 details the results from an experiment to investigate the performance of AI algorithms for detection and diagnosis as a stand-alone reader compared to human reader performance. It also evaluates the sub-groups of cancers detected by each AI algorithm. Chapter 7 describes the results from an experiment to use AI algorithms in a triage-based task for both normal and highly suspicious case identification. In addition, the cancers missed by each algorithm are reviewed. 39 Chapter 8 summarises the research presented in this thesis and its implications. This is followed by a section on the recommended direction of future work. 40 Chapter 2 – Adoption of artificial intelligence in breast imaging: evaluation, ethical constraints and limitations This chapter explores how artificial intelligence (AI) is being applied and evaluated in breast imaging. Key ethical and legal challenges at the algorithm, data and clinical levels which need to be considered for the implementation of AI in everyday breast screening are discussed. The barriers and limitations currently facing this field from a technical, clinical and governance perspective are also outlined. Contents of this chapter have been published in British Journal of Cancer139. 2.1 Introduction In breast oncology, a multidisciplinary team approach is essential, with imaging playing a key role in the care pathway for the screening, diagnosis, staging, monitoring and follow-up of malignancies. Novel imaging techniques of increasing complexity have resulted in longer reporting times. This, coupled with a shortage of radiologists and exponential growth in imaging requests, has led to an increasing demand on radiology departments. Recently, there has been a huge interest in using Artificial Intelligence (AI) for breast imaging to address these pressures, in a speciality where timing is critical and resources are finite140. The term AI covers both machine learning and deep learning141. It is the advances in deep learning for image interpretation that have resulted in the massive growth in interest for use in breast imaging112. AI applications can be broken down into two categories, Figure 2-1. Figure 2-1 – Broad and narrow artificial intelligence (AI) applications to breast imaging. The first category is “broad AI”, which lends itself to the administrative and organisational tasks within the imaging pathway. These systems can be used to replace repetitive and routine tasks such as appointment booking, contrast adjustment and image quality checks. The second category is 41 “narrow AI”, which covers computer-aided detection (CADe), diagnosis (CADx), and triaging worklists (CADt) as well as predicting treatment response and segmenting lesions112. These AI systems can be used as aids for clinicians or be used autonomously. Ultimately these AI solutions aim to improve the patient’s outcomes as well as the healthcare system’s efficiency. The latest advances in computer processing and the increased availability of data have been pivotal for developing AI-CAD (CADe and CADx) systems142,138. It is important to remain vigilant to the potential bias and ethical questions that arise when using this technology as well as the challenges of incorporating such systems into pre-existing workflows143,144. These overarching challenges need to be explored in order to facilitate discussion and drive engagement by clinicians, computer scientists, responsible national agencies and National Health Service (NHS) Trusts145. This article reviews how AI has been applied and evaluated using breast imaging as an exemplar. We then consider the ethical and legal challenges at the algorithm, data and clinical levels. Lastly, we discuss the barriers and limitations currently facing this field from a technical, clinical and governance perspective. 2.2 Evaluation of artificial intelligence in breast imaging 2.2.1 Retrospective evaluation Retrospective testing on internal or external datasets is essential when assessing new AI tools for clinical imaging142,146. An algorithm is often trained and tested on an internal dataset which has been divided into an 80:20 split146. This means that the training data is not used to test the algorithm otherwise this would result in bias and an overestimation in performance147. Ideally external datasets consisting of new unseen data which has not been used for algorithm development are used to ascertain the generalisability of an algorithm in different populations with images from different manufacturers (see Ethical and legal constraints – Algorithm level for more information)146,148. It is also important to distinguish between testing that is conducted internally (by the AI developers) and externally (by an independent institution). External testing can limit bias and also allow for the comparison of multiple algorithms with similar applications149. Data that is representative of the population, structured, annotated and ready to use is limited, existing in only a small number of institutions, Table 2-1150. New imaging portals and repositories, such as the Health Data Research Innovation Gateway, have been set-up to try to address these data gaps and are key to developing a data ecosystem to meet the demand151. Principles such as FAIR (Findability, Accessibility, Interoperability, and Reusability), aim to guide data extraction as well as long-term management and sharing, in order to obtain the “maximum benefit” from datasets152. 42 However, a balance must be found in this ecosystem between the implementation of FAIR principles and the often-strict controls put in place by Information Governance teams and ethics committees when creating imaging repositories. Dataset Country Year of studies Modality Number cases Number images The Mammographic Image Analysis Society Digital Mammogram Database (MIAS)153 UK 1994 SF-MG 161 322 Curated Breast Imaging Subset of the Digital Database for Screening Mammography (CBIS-DDSM)154 USA 1999 (updated 2016) SF-MG 1566 10239 Investigation of Serial Studies to Predict Your Therapeutic Response with Imaging and moLecular Analysis (ISPY1 (ACRIN 6657))155 USA 2002-2006 MRI 222 386528 InBreast156 Portugal 2008-2010 FFDM 115 410 Cohort of Screen-Aged Women (CSAW)157 Sweden 2008-2015 FFDM 499807 >2000000 The OPTIMAM Mammography Image Database (OMI-DB)150 UK 2010-2019 FFDM 151403 >2000000 New York University Breast Cancer Screening Dataset (NYU BCSD v1.0)158 USA 2010-2017 FFDM 141473 1001093 Breast Cancer Digital Repository (BCDR)159 Portugal NA SF-MG FFDM 1010724 3703 3612 The Cancer Genome Atlas Breast Invasive Carcinoma (TCGA-BRCA)160 USA NA MRI MG 139 230167 Table 2-1 – Datasets publicly and privately available for breast imaging. FFDM: Full Field Digital Mammography, MG: Mammography, MRI: Magnetic Resonance Imaging, NA: Not Available, SF: Screen Film. The performance of an algorithm can be compared against two outcomes, 1) the ground truth and 2) the radiologist’s performance146,147. The ground truth or “gold standard” is seen as the ‘absolute’ outcome of a case (for example cancer or no cancer) but variations of the ground truth between healthcare systems occur due to differences in standard of care guidelines, histopathology reporting criteria, imaging procedures conducted (e.g. use of Magnetic Resonance Imaging (MRI) versus ultrasound) and screening frequency (e.g. range from 12-36 months). The radiologist’s performance sets a “clinically relevant threshold” for AI performance to be compared against and is essential to understand the potential impact of using such systems in real-time workflows (for example double reading in the UK breast screening programme)148,161,162. However, in screening when using the 43 radiologists assessment as the gold standard, there is potential to introduce bias in favour of the radiologist, where only those patients recalled by the radiologist can be diagnosed by the AI. When trying to prove the superior performance of AI compared to radiologists, interval cancers need to be included in testing sets. Experienced radiologists’ reports should also be included to allow for the comparison against representative programme reader performance, and not just prove that the AI is superior to average or non-specialist performance. Algorithms need to meet or exceed these thresholds in order to show a potential benefit before their adoption into healthcare systems is considered. 2.2.2 Prospective evaluation Whilst testing on retrospective datasets provides a ‘snapshot’ of possible performance, the nuances of medical pathways cannot be underestimated. Prospective testing in real-time is essential to fully understand the influence of AI on human performance and the interaction between the two142. There are few prospective studies on the use of AI in radiology, Table 2-2, with a recent systematic review only reporting one randomised trial registration and two non-randomised prospective studies in radiology163. AI Country Imaging modality Stage of care pathway Estimated completion Trial ID (ClinicalTrials.gov) Samsung (Seoul, South Korea) S-Detect™ China Ultrasound Diagnosis February 2020 NCT03887598 Unknown China Mammography Detection & Diagnosis November 2020 NCT03708978 Unknown Russia Mammography (+ others) Detection December 2020 NCT04489992 Unknown China ABUS Screening August 2025 NCT04527510 Kheiron (London, UK) Mia™ UK Mammography Screening Unknown Unknown – part of the AI Award144,164 Table 2-2 – Prospective studies for the use of artificial intelligence (AI) in breast imaging. ABUS: Automated Breast Ultrasound. To ensure the clarity of reporting results from these studies, pre-existing reporting standards have been adapted and include the Consolidated Standards of Reporting Trials-AI (CONSORT-AI), Standard Protocol Items: Recommendations for Interventional Trials-AI (SPIRIT-AI) and the Checklist for Artificial Intelligence in Medical Imaging (CLAIM)165–167. The Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis or Diagnosis-Machine Learning (TRIPOD-ML) and 44 Standards for Reporting Diagnostic Accuracy Studies–AI (STARD-AI) are also currently under development, Table 2-3168,169. Publication date Application Number of items Link CONSORT -AI165 2020 Randomised trials 25 original 14 new https://www.equator- network.org/reporting- guidelines/consort-artificial-intelligence/ SPIRIT -AI166 2020 Clinical trial protocols 51 original 15 new https://www.equator- network.org/reporting-guidelines/spirit- artificial-intelligence/ CLAIM167 2020 AI studies in radiology 42 https://pubs.rsna.org/doi/full/10.1148/r yai.2020200029 TRIPOD -ML168 Pending Clinical prediction model evaluation - https://www.tripod-statement.org STARD -AI169 Pending Diagnostic accuracy studies - - Table 2-3 – Reporting criteria adapted for artificial intelligence (AI) studies. Performance of AI is often measured in terms of sensitivity, specificity, area under the reciver operating characteristic curve (AUROC) and computation time (time taken to process data). Where AI is used by a radiologist, the effect on performance is measured in the same way (sensitivity, specificity, and AUROC) with the additional measure of reading time by the radiologist. The AUROC provides a summary estimate of diagnostic accuracy, taking into account both the sensitivity and specificity to demonstrate how well the algorithm can differentiate between cancer or not cancer across all thresholds170. It provides a measure between 0 and 1, where a higher score means a better classification146. However, the AUROC is subject to certain pitfalls. It is not “intuitive” to interpret clinically, and theoretically algorithms with different sensitivities and specificities can have the same AUROC170. Therefore, alternative measures such as “net benefit” have been proposed as well as routine reporting of sensitivity and specificity, which allow for direct clinical comparison170. Lastly, for both the algorithm alone and when used by the clinician, the effect on nationally reported standards (e.g. cancer detection rate, recall rates, tumour size and lymph node status) should be evaluated as part of prospective studies146. 2.2.3 Key considerations for clinical evaluation Screening AI systems could be cost-effective by improving early detection of important “killer” cancers (higher grade) potentially improving long term survival. However, the substantial investment of AI development, IT infrastructure, and continuous monitoring need to be costed, therefore cost- effectiveness requires careful evaluation147,171. The ease of integrating AI into pre-existing hospital 45 systems, such as radiology information systems and Picture Archiving and Communications Systems (PACS), health-records and administrative systems, is a another key consideration162,171. Wider measures for clinical evaluation to also include are patient acceptability and effect on uptake of screening programmes as well as the training required for radiologists to be able to use and interpret AI tools54. Continuous monitoring to ensure adherence to national standards needs to be in place to observe both static and adaptive (‘learning on the fly’) AI when used in real-time workflows (see Ethical and legal constraints – Algorithm level for more information)162,172. Each hospital could have an infrastructure to evaluate and monitor algorithms, but this is unlikely to be feasible in many hospitals due to the data storage requirements and lack of technical expertise and resources to set up such an environment. A centralised testing system at designated centres using pre-set national standard thresholds for different AI algorithm applications, would be a more sustainable approach. As outlined above, the steps in the evaluation pathway of AI are clear, requiring retrospective, prospective and continuous real-time testing. However, the caveats of testing such as how to access suitable datasets and defining “clinically relevant thresholds” still need to be agreed. In the UK NHSX has set-up ‘AI Labs’ to begin conducting centralised and standardised testing procedures173,174. 2.3 The breast imaging pathway and AI 2.3.1 Screening AI has been used in radiology since the 1990’s with initial CADe tools in mammographic screening prompting readers to look again at areas of concern in the image128. More recent AI systems can now meet and exceed the performance of radiologists for stand-alone cancer detection in screening mammography, achieving a sensitivity from 0.562 to 0.819 with a specificity of 0.843 to 0.966 (set at first reader specificity)138,149. However, this is not the case for all national screening programmes137. In a retrospective international crowdsource competition, the performance of multiple algorithms was compared on a standardised test set from Sweden. An ensemble algorithm was built by concatenating the eight best individual performing algorithms, which was shown to outperform the top single algorithm, but not the clinicians performance137. In the UK 2.2 million mammograms are taken each year and read by two radiologists, putting a high demand on an already stretched workforce54,140. The majority of screening mammograms are normal54,175. A more efficient method is sought whilst maintaining current cancer detection and recall standards. AI can now reliably triage ‘normal’ mammograms (47% to 60%), which would mean that these would not need to be reviewed by two or possibly even one radiologist134,135. Whilst estimated to only miss up to 7% of cancers, the CADt algorithms could drastically improve the 46 efficiency of breast screening. However, questions remain around what an acceptable miss rate would be for algorithms when used in routine screening. AI tools previously used for mammography have been adapted for other screen imaging techniques such as Digital Breast Tomosynthesis (DBT), which has longer reading times that can be decreased by around 50% using AI176. MRI is used for the screening of high-risk women, particularly those with a familial risk of breast cancer or BRCA1/BRCA2 carriers. Deep learning algorithms can find visual patterns in images and have been used to detect and diagnose breast cancer to produce a fully automated MRI AI-CAD system177–179. 2.3.2 Risk stratification Screening can be tailored according to a woman’s breast cancer risk. Risk factors for developing breast cancer include breast density, family history, lifestyle factors (e.g., alcohol and smoking), genetic mutations, hormone exposure and expression180,181. Breast cancers can also go undetected due to dense breast tissue obscuring the view of a cancer on a mammogram, called ‘Masking’68. AI density measures can provide quantitative scores or category scores such as BI-RADS, which can provide a more consistent interpretation than a radiologist68,182. It may be possible for the latest density tools to detect women who are at the highest risk of ‘masking’ and more likely to develop a cancer that could progress to later stage disease68. Automated breast density can also be incorporated into existing prediction models (BOADICEA and Tyrer-Cuzick) to improve performance and assist in the implementation of targeted screening as well as the use of supplemental imaging182. The ‘Measurement Challenge’ aims to compare automated density measures which have been shown to overcome the inconsistencies in human reporting as well as being able to predict breast cancer risk183. 2.3.3 Monitoring and prognostication MRI is routinely used in the monitoring of response to neoadjuvant chemotherapy, with patients imaged before, during, and after treatment. Deep learning algorithms have been implemented to evaluate pathological complete response to chemotherapy using post-treatment MRI with an AUROC of 0.98184, which could affect the extent of post-treatment surgery, or potentially reduce the need for surgical excision at all. A number of studies have used deep learning to identify features from pre-treatment MRI that are predictive of response in an unsupervised fashion185–187. Early prediction of response to different types of chemotherapy could avoid unnecessary toxicity and cost from ineffective treatment as well as enable a more personalised approach to treatment. AI has also been used in prognostication to predict recurrence (Oncotype DX recurrence score) from MRI188. However, given the moderate accuracy of these techniques (0.77-0.93), further work is required before their integration into clinical practice. 47 The evidence base for the performance and possible applications of AI to breast imaging is rapidly evolving. Systems acting as stand-alone readers show promise in decreasing workload, whilst systems to predict treatment response could guide tailored treatment strategies. In addition, systems to identify those at greatest risk of a cancer being missed or developing cancer may aid in the application of a targeted screening approach. 2.4 Ethical and legal constraints 2.4.1 Guidance level The Department for Health and Social Care, and international collaborations such as the Global Partnership on Artificial Intelligence, have developed guidance for implementing digital technology including AI189. They highlight the need for oversight and continued patient involvement to guide the development of “human-centric” AI which is essential to maintain the trust of the public, and avoid a repeat of previous controversies such as inappropriate data sharing190–192. 2.4.2 Algorithm level There is a danger of innate latent bias built into certain systems, especially if these have been developed on datasets that underrepresent certain populations (with a lack of diversity in age, ethnicity and socioeconomic background) and therefore lack the ability to generalise193. This could be further compounded by the limited diversity within the scientific workforce itself which under represents the “interests and needs of the population as a whole”194. Outcomes based on pre- existing inequalities could be exacerbated by the skewed outcome being fed back into the algorithm, creating negative reinforcement, thus limiting the fairness of an algorithm193. This can lead to algorithmic decisions that amplify discrimination and health inequalities. The data used in testing should therefore encompass a representative relevant population and the components of the dataset used explicitly reported alongside the results. A recent paper provides an example of such documentation, where an AI-CAD mammography algorithm trained on data from South Korea, USA and UK primarily using data from GE machines, achieved the best performance compared to other algorithms (sensitivity (81.9%) at the reader specificity (96.6%)), when tested on data from Sweden on only Hologic machines, demonstrating generalisability149. Algorithms also have the ability to ‘learn on the fly’, that over time become more biased due to ‘performance drift’, thus potentially limiting their generalisability172,194. ‘Learning on the fly’ could potentially be beneficial to adjust algorithms to the local systems in which they are being used but this will also require close observation through regular audits to monitor for detrimental ‘performance drift’147,162. Transparency around how an algorithm reaches a decision, its architecture and source code availability is often limited by intellectual property clauses to protect proprietary information174. The 48 opaqueness of an algorithm’s deduction can be clarified by using saliency maps, which highlight (e.g. heatmap) the part of the image which the algorithm has used to make its decision, ensuring that the algorithm is using at the correct part of the image to make its clinical deduction and not “noise” in the image such as a clip, artifact or label195. Initial checks built into the algorithm, ensuring the image is of sufficient technical quality from which to deduce an interpretation similar to the checks performed by radiologists, is also an important step for robust interpretation. A reliable algorithm providing consistent, clear and reproducible results, so as not to cause ambiguity in decision making, is key to improving confidence in these systems. 2.4.3 Who controls the data? In the UK there is an understanding that NHS Trusts will govern, control and use patient data in an anonymised format to conduct research for patient benefit143,196. There is also an understanding that patient data will be protected and overseen by Information Governance teams at NHS Trusts144,173. Extracting data from the fragmented silos of the NHS remains a challenging task due to the lack of interoperability between systems197. Data relating to an individual’s health is defined as ‘special category’ data and requires additional procedures and safeguards including data minimisation, proportionality, and necessity198–200. Data from which an individual can be recognised is termed Personal Identifiable Data (PID). This data is often pseudonymised or de-identified for healthcare research to remove identifiers and replace them with a new random identifier (e.g., Trial ID), ensuring privacy is upheld201. Where consent from individuals for data use cannot be feasibly obtained, provisions are in place to obtain access to PID in order to create large datasets202. Regulation has emphasised the importance of Patient and Public Involvement (PPI) when using patient data for research, especially in the context of unconsented data use202. Feedback provided by PPI can be used to enhance the communication between the public and healthcare sector, particularly around the distribution of a data notification and objection mechanism174,202. Studies carried out by organisations such as the Welcome Trust show that the public acknowledge a lack of understanding and hesitancy regarding the uses of health data, particularly when data is shared with and accessed by commercial companies203. National data opt-outs, proposed as part of the Caldicott Review (2016), give patients the option for their data to not be processed204. Recently the National Data Guardian opened a consultation to revisit the seven Caldicott principles that guide the use of PID and to ensure that public ‘expectations’ should be considered when using confidential information205. However, additional steps need to be taken to inform and educate the public around data use in healthcare so they can be empowered to explore these options. 49 The expected economic trade-off within the NHS in terms of financial payment, shareholding position or fees for product procurement should be outlined as part of a national policies. Allowing for the potential benefits from sharing valuable NHS data when collaborating with the commercial sector to be realised147,189. It is important to ensure this benefit is fairly distributed across the whole of the NHS to avoid widening gaps in available resources at different Trusts145,197. Linked data across multiple fields such as imaging, genetic and clinical records are of increasing importance for the development of risk prediction models for both prognosis and treatment response. Higher accuracy has been achieved by algorithms when multiple data types are used in training to provide ‘rich’ risk factor information206. Conversely, an understanding of how much data is too much data is required. For example, linking genetics, demographics, home monitoring, smart watch data may mean data is no longer de-identified. In addition, it must be understood that even data collected in large quantities may still be unrepresentative due to a the lack of access to healthcare and ability to participate in research for different populations194. Data provenance, whilst currently not at the forefront of discussions, could become an increasingly tangled web to unwind. Individual Trust data that is currently being used for training algorithms could at the same time be incorporated into the development of centralised evaluation datasets, resulting in a concealed overlap. The ability to track data back to the source and see all of its uses since it left the source via a flag-based system is needed. However, such systems do not currently exist and would not be easy to integrate, let alone to apply to data which has already been processed. 2.4.4 Clinical level Clinical acumen must not be lost. AI and clinicians must work in tandem so that if one system fails (e.g. AI) the safety-net of the other system (e.g. radiologists) is in place to avoid harm. However, when AI systems operate alongside clinicians there is a possibility of the clinician becoming over dependent and automation bias to occur145,193. In addition, radiologists might become distracted by prompts from AI, increasing reading time and potentially adversely affecting reader performance125. Where these systems are designed to act independently, human supervision via ‘pit-stop’ analysis of a select cohort of patients, in an audit like fashion, is essential in order to maintain patient safety. The logging and reporting of errors is a potential area of AI automation where human oversight required for the monitoring of AI will necessitate vast amounts of time and resources. Nonetheless in time automation might replace certain aspects of entire jobs. This is juxtaposed against the creation of jobs in the field of healthcare informatics, to create datasets and facilitate the incorporation of AI into hospitals174,191. A potential overarching benefit from automation could be 50 that more time is freed up for clinician interaction with patients and interventions such as image guided biopsies. A broader question exists around notifying patients when AI is used in making diagnostic and treatment decisions. Will a patient feel worse if a cancer is missed by an AI tool compared to a human reader? Another consideration is that in certain healthcare systems the prediction of cancer risk could impact patient insurance policies as well as patient mental health by causing anxiety. Therefore, prior to calculations such as the risk of developing a disease, should the patient have to approve this analysis following counselling by a healthcare professional, similar to procedures currently provided for genetic testing? Overall, these ethical and legal dilemmas should not be underestimated and the provision of guidance from national agencies to tackle these, taking into account views from patients, commercial companies and clinicians, is essential. 2.5 Practical challenges and limitations 2.5.1 Technical level Whilst the NHS has state-of-the-art scanners and treatments, it is also still reliant on certain record systems that are paper-based. Thus, technological advancement is a pivotal challenge facing the NHS to allow for the integration of new technology and the flexibility for exporting data on a mass scale207. Modifications to IT capabilities and digitisation of records is vital and should allow for communication and coordination between Trusts207,208. The NHS is also a tightly sealed system; however, companies will need access to update and modify their algorithms. Conversely, caution is needed when opening up systems due increasing the vulnerability to “cyber-attacks”209. How this external access is overseen and governed is a current technical and logistical challenge. While the majority of data processing within the NHS at present occurs onsite, ‘big data’ processing for image analysis requires the procurement of Graphical Processing Units (GPUs) at Trusts or within cloud-based systems, which may entail the processing of data offsite207. In addition, capacity for larger data storage is needed for the curation of datasets and the storage of additional image analysis provided by algorithms. A lack of clarity still exists around suitable environments and encryption for data storage as well as the level of de-identification required. When de-identifying imaging data it is necessary to retain data that is essential for image viewing, such as the private Digital Imaging and Communications in Medicine (DICOM) tags, whilst ensuring all PID is removed210. As imaging becomes more advanced it is important to ensure that patients cannot be re-identified via the possibilities of image reconstruction, such as reconstructing facial features from Computer Tomography (CT) or MRI head scans. 51 2.5.2 Clinical level A new multidisciplinary team will need to be developed and trained including clinical scientists and informaticians to work with clinicians to incorporate AI analysis into care decisions143,211. Advancing and generating new technical expertise will require access to training programmes and retention of highly skilled staff who currently re-locate to industry174,212. Programmes such as the NHS Digital Academy are designed to upskill healthcare professionals in areas of digital health as well as leadership and management as part of a national learning programme143,211. The training of radiologists is also set to change with the recent incorporation of AI into the national curriculum213. An openness from commercial companies to disclose the limitations of their algorithms and training radiologists how to interpret these is vital145,194. The use of AI itself to train radiologists or even provide continuous performance monitoring of radiologists are possibilities that need further exploration. Conversely whether the adoption of such technology will require radiologists to reach a higher level of performance to keep ahead of AI, is subject to ongoing speculation. 2.5.3 Governance level Worldwide healthcare systems are moving forward at great pace to try utilise this technology with national funding efforts to develop an AI healthcare ‘ecosystem’. In the UK, this has been facilitated by collaborations from the Accelerated Access Collaborative and NHSX with the formation of the NHS AI labs173,174. The same two bodies have also partnered with the NIHR (National Institute for Health Research) for the provision of an AI Award, to spur investment into promising commercial companies164. The recently published NHSX ‘Buyers Guide’ provides a much needed resource for Trusts when procuring AI technology147. A proposed checklist also published alongside the buyer’s guide gives Trusts a procedure to help ensure vital steps of due diligence are taken, such as setting up insurance cover. However, the overall cost benefit of implementing such systems is limited in its evidence base and more robust evidence is needed to ensure systems are cost-effective. The legal accountability of algorithms has been at the forefront of healthcare professionals’ questions, as no clear guidance has been produced189. Discussions around the use of AI alongside a radiologist point towards the ultimate responsibility lying with the clinicians, but no specifics have been detailed as to how this would fit with NHS indemnity144,145. For both clinical decision support systems working alongside the radiologist and independent stand-alone systems, further guidance as to the accountability of the companies who developed the algorithm and NHS Trusts using the AI is needed. Reviews of “accidents” and “near misses” arising from the use of AI should be included in department discrepancy meetings. How this is then fed back to companies, to facilitate algorithm improvement, needs to be thought through before such events occur. 52 2.6 Conclusion There are many steps to be taken by an array of national agencies, professional bodies and individual NHS Trusts before AI will become common place in breast oncological imaging to help mitigate the growing pressures facing radiology. Whilst promise is shown with algorithms across a range of imaging modalities reaching and in certain cases exceeding human performance, and even performing tasks not feasible for an individual, independent prospective testing against national benchmarks is needed. Technical integration and upskilling the healthcare workforce is essential for AI adoption. The different ethical and legal dilemmas at the algorithm, data and clinical level should continue to be discussed and guidance updated for healthcare professionals to follow. Further research is needed not only to understand the health economic implications and testing required to ensure that systems are working by meeting the required performance thresholds, but also that latent bias is avoided. Lastly, the legal accountability should be clearly stated for companies and healthcare professionals when using such systems. 53 Chapter 3 – Machine learning for workflow applications in screening mammography: systematic review and meta-analysis Advances in computer processing and improvements in data availability have led to the development of machine-learning (ML) techniques for mammographic imaging. This chapter systematically evaluates the literature for the performance of stand-alone ML applications for the screening mammography workflow. Retrospective studies demonstrate the performance of stand- alone ML applications in screening mammography can reach reader performance and provide a mechanism for case triage, which merits investigation with prospective studies. Contents of this chapter have been published in Radiology133 and presented at the European Congress of Radiology 2021 [abstract number - #C- 14869]. 3.1 Introduction There are now more than five Food and Drug Administration approved algorithms for mammographic interpretation, primarily to be used as clinical decision support systems214. Research has demonstrated that these machine-learning (ML) computer-aided detection (CAD) algorithms can reach and even exceed clinician performance, providing an independent definitive output (case level decision) on 2D standard-view mammography (mediolateral oblique and cranial caudal) data, Figure 3-1112,215. This could allow for ML stand-alone computer-aided detection (CADe) and computer-aided diagnosis (CADx), or, when ML algorithms are set at a high sensitivity, for the automated case-based computer-aided triage (CADt) of mammograms within the screen reading workflow216. Figure 3-1 – Multi-time (left) and multi-view (right) point data that are produced by 2D standard-view mammography and can be analysed at different levels. 54 Many countries have implemented breast screening to detect cancer at an earlier stage, albeit with differing screening processes, such as single reading in the USA and double reading in many European countries, with screening starting at varied ages (40-50 years) and differing intervals between screening (annual, biannual and triannual)45,74,175,217. Mammography remains the most common imaging modality used, although its cost-effectiveness is debated due to false-positive findings, overdiagnosis, and false—negative findings (interval cancers)36,218. Human readers (for example radiologists and reporting radiographers in the UK) are under increasing pressure due to increasing workloads, demands from busy clinics, strict screening program targets as well as staff shortages140. Alternatives to double reading of mammograms are being sought to further alleviate pressure, including single reading using CAD prompts, stand-alone ML algorithms with a second reader or CADt triage with various reader combinations215. Studies investigating the use of traditional CAD mammography systems demonstrated no significant improvement in reader performance and, although sensitivity was similar to that of double reading, given the increase in recall rates these systems were deemed not cost-effective125,128. Additional limitations of traditional CAD systems include; high rates of false-positive prompts, limited reproducibility of prompts, increased reading time as well as a CAD preference for calcification detection over soft-tissue masses and architectural distortion219,220. Traditional CAD systems were trained using handcrafted features extracted from human delineations. The latest ML methods can use pre-trained deep learning networks and automatically delineated cancer regions via iterative interactive software to rely upon learned features, and have the potential to overcome the limitations of traditional CAD systems. However, how these new ML systems should be used in real- time workflows is still unclear. One route could be to improve efficiency of the workflow by operating as stand-alone systems. Although the performance expected by such stand-alone ML applications in a screening workflow is yet to be agreed upon, a system should meet a “clinically relevant threshold”161. In general, recall rates should not be increased due to the huge impact on workload, thus algorithms with lower specificity would require human intervention to reduce recalls139,161. Therefore, making a definitive decision on whether current systems reach the standard required for routine workflow use is challenging. We conducted a systematic review and meta-analysis to investigate whether or not ML algorithms (CADe and CADx) are as sensitive and specific as radiologists in detecting breast cancer in subjects undergoing screening mammography. In addition, we evaluated the application of stand-alone ML algorithms (CADt) used in breast cancer screening for mammography interpretation and the impact of ML algorithms if adopted into clinical practice. Furthermore, we aimed to identify areas of bias and gaps in the reported evidence. Appendix 1 contains a glossary of terms. 55 3.2 Materials and methods This systematic review and meta-analysis was reported in accordance with the Preferred Reporting Items for Systematic Reviews and Meta-analysis for Diagnostic Test Accuracy (PRISMA-DTA) guidance221. The review protocol was registered on PROSPERO (CRD42019156016), Appendix 2. 3.2.1 Literature search Digital literature databases including Ovid-EMBASE, Ovid-MEDLINE, Scopus, Web of Science, and CENTRAL were searched from January 2012 to September 2020, with the final search conducted on September 3, 2020 to include the advancements in ML algorithms for medical image interpretation and increased mammographic data availability142,215. Hand searches of included article references, a gray literature search of computer science databases (DBLP computer science bibliography, ACM Digital Library, and IEEE Xplore Digital Library), and a search of a pre-print literature database (arXiv) were also conducted for the same time period. The search strategy is detailed in Appendix 3. 3.2.2 Study selection To limit bias, all publication types and all study designs were included, with no language restriction or dataset age limit applied. Eligibility criteria included women imaged using mammography for screening or diagnosis of breast cancer and a ML algorithm applied as stand-alone workflow application (CADe and CADx or CADt) with sufficient information reported for the performance of stand-alone ML algorithms and reader performance, or the simulated impact on reader performance and workflow to allow for comparison. Any ground truth (e.g., histopathology) was accepted. Because data are available at multiple levels, Figure 3-1, we included algorithms only if they provided an interpretation at the case or exam level to enable comparison with clinician performance as reported in screening programmes. Two independent reviewers undertook the initial title and abstract screening (SEH., a physician with 2 years’ experience, then one of EPVL, CL, YRI., medical students) with discordance arbitration by a third reviewer (EPVL, CL, YRI) with independent full text review (SEH and RW., a radiologist with 11 years’ experience) and discordance arbitration by a third reviewer (FJG., a senior radiologist with over 30 years’ experience). 3.2.3 Data extraction A pre-designed data-extraction spreadsheet was used by the reviewers (SEH and RW) and checked by a third reviewer (AIAR., a computer scientist with 4 years’ experience), Appendix 4. Results were only extracted for studies where algorithm performance was compared to readers or the impact on reader workflow and performance was reported. If studies reported multiple stand-alone algorithms, results for all algorithms were extracted. 56 3.2.4 Meta-analysis For the meta-analysis, CADe and CADx algorithm performance was evaluated by adapting the method described in Liu et al142. The primary meta-analysis compared the best performing algorithm of each study, at the test stage using screening mammography data, with the performance of readers. Details of the primary meta-analysis study selection are available in Appendix 5. The secondary meta-analysis extended the primary meta-analysis and compared the performance of all reported algorithms and readers in all stand-alone CADe and CADx studies which used external datasets (for addressing the generalisation capabilities of the techniques), with no limitations of ground truth. 3.2.5 Quality assessment Risk of bias and quality assessment of all included studies took place using Quality Assessment of Diagnostic Accuracy Studies-2 (QUADAS-2)222,223 and Prediction model Risk Of Bias ASsessment Tool (PROBAST)224 by two reviewers (SEH and RW), with discussion between reviewers to resolve discordance. Signalling questions for QUADAS-2 were adapted for ML studies. PROBAST questions were adapted using the technique in Nagendran et al163. However, as our review focused on mammography ML, applicability was assessed in all fields except the predictor field. The Checklist for Artificial Intelligence in Medical Imaging (CLAIM)167 guide was used by two reviewers (SEH and AIAR), with discussion between reviewers to resolve discordance. An overall reporting score for all parameters was generated as well as for eight key fields identified, and common areas under-reported were documented. 3.2.6 Statistical analysis All statistical analyses were implemented in R (version 4.0.3; R Project for Statistical Computing, Vienna, Austria)225 using the ‘mada’226 and ‘boot’227 packages. Normal and benign exams were combined and 2x2 contingency tables were created by calculating true-positive, true-negative, false-positive, and false-negative findings from the reported dataset characteristics and sensitivity and specificity provided, ensuring there was an integer (whole) number of cases. The heterogeneity of the included studies in the quantitative analysis was measured using the I2 and Cochrane Q test, where high heterogeneity was defined by I2 > 50% and p < 0.05 for Cochrane Q test. The estimated pooled sensitivity, specificity, and area under the receiver operator characteristic (AUROC) curve were calculated for both readers and ML algorithms using a bivariate random effects model by Reitsma et al228 with 95.0% CIs. Bootstrapping with 100 iterations was used to generate 95.0% CI for AUROC and a t-test was used to compare the ML algorithm and reader sensitivity and specificity, with a p-value < 0.05 indicating a significant difference. Summary receiver operating characteristic 57 plots were constructed for both primary and secondary analyses for pooled reader and ML algorithm performance. 3.3 Results 3.3.1 Statistical selection and data extraction A PRISMA diagram, Figure 3-2, demonstrates the study inclusion process. The search of electronic literature databases and computer science databases returned 7629 records. Removal of duplicates resulted in 4318 records. 4286 records were excluded following the screening of titles and abstracts, the remaining 32 full texts were reviewed, and 14 articles were included in the qualitative review. References of included studies can be found in Appendix 6. From the included 14 articles, 8 studies reported a stand-alone CADe and CADx algorithm performance, and 7 studies reported the use of a CADt system. 1 article reported on both stand- alone CADe and CADx and CADt. 5 studies for stand-alone CADe and CADx provided enough information to be included in the primary meta-analysis and 6 studies for the secondary meta- analysis, (algorithm [n = 17] and reader [n = 15]). The included articles were published between 2017 and 2020, with 3/14 (21%) articles published on a pre-print platform (arXiv). A total of 16 algorithms including 12 unique algorithms were included in this review, with 2 algorithms reported multiple times using different versions. All included studies were conducted retrospectively. Generalizability was demonstrated in 4 studies where algorithms were tested on datasets from a different country to the training dataset. All datasets used for reader comparison testing were private. 8/14 (57%) articles evaluated algorithms on external datasets only, with a further 2/14 (14%) articles using both internal and external datasets. Cancer prevalence within testing datasets varied from 0.6% to 50.0% and the total testing dataset size ranged from 240 exams to 113,663 cases (*cohort size simulated using bootstrapping). The comparator of readers ranged in number (4-101), experience (1-44 years), and specialization (general or breast) for all studies. The algorithms code was available in 9/14 (64%) articles. Commonly used architectures included ResNet, RetinaNet and MobileNet, which are all a type of convolutional neural network. This included algorithms that were commercially available in 6/14 (43%) articles or where code was available on a public repository in 3/14 (21%) articles. 58 Independent CADt studies reported that between 17%-91% of normal mammograms could be identified, while missing 0%-7% of cancers, Tables 3-1 and 3-2. For CADe and CADx tasks, 8 studies reported the algorithms' AUROCs between 0.69 and 0.96, Tables 3-3 and 3-4. Figure 3-2 – Preferred Reporting Items for Systematic Reviews and Meta-analysis for Diagnostic Test Accuracy (PRISMA-DTA) flow diagram. For studies included in the identification, de-duplication, screening, and data-extraction stages of this review. CADe: computer-aided detection, CADx: computer-aided diagnosis, CADt: computer-aided triage, ML: Machine Learning, WOS: Web of Science. *Studies could have been excluded for multiple reasons. 59 Reference a) Journal Machine Learning Technique Task Data Partition Level Sample Size Development Images (Case) [Exams] Retrospective / Prospective Testing Internal / External Testing Test Threshold Test N Reader (Experience) Test Reader Country Format Test Validation Method % Normal Triage (Work-load Reduction) Evaluation (% Missed Cancers) Code Available (Location) CA Dt Yala 2019229 Radiology DL: ResNet18 Triage normals Patient Level Total = 238 117 (63 852) Training = 212 276 (56 831); Validation = 25 841 (7 021) Retrospective Internal “Minimum probability score of a radiologist TP assessment on validation set” 23 (1 - 31 years) USA single Hold-out method 19.3% (1.1%) Sensitivity: 90.1% (172 of 191; 95% CI: 86.0%, 94.3%); Specificity: 94.2% (24 814 of 26 349; 95% CI: 94.0%, 94.6%) GitHub (https://githu b.com/yala/O ncoNet_publi c) McKinney 2020138 Nature DL: ensemble ResNet-(V2-50 and V1-50), MobileNetV2, RetinaNet Triage normals Patient Level UK: Training = (13 918); Validation = (62 866) USA: Training = (55.0% of 22 225); Validation = (15.0% of 22 225) Retrospective *Internal / External UK NPV 99.9% USA NPV 99.9% UK 51 (5 – 20+ years) USA (1 – 30 years) UK double USA single Hold-out method UK 41.0% USA 35.0% ML (vs Reader): ΔAUROC = +0.115 (CI 0.06- 0.18, p < 0.001); FP reduction (5.7% and 1.2%); FN reduction (9.4% and 2.7%) (USA and UK) NA Balta 2020230 Proceedings of SPIE *DL Unclear Architecture Commercial System Transpara (v 1.6.0) *Triage normals to single reading Patient Level *Unclear The commercial system was directly used Retrospective *Internal / External 7 6 Germany double External Validation (32.6%) (0.0%) ML decreased: recall rate 11.8% (p < 0.001); PPV 10.5% (p < 0.001) Commercially available (https://scree npoint- medical.com/ in-practice/) Dembrower 2020134 Lancet Digital Health *DL Unclear Architecture Commercial System Lunit (v 5.5.0) Triage normals Patient Level Training [170 230] Retrospective External NA NA Sweden double External Validation > 60.0% Missed cancer at 60.0%, 70.0%, 80.0%: (0.0% 0.3% (CI 0.0- 4.3) 2.6% (CI 1.1- 5.4)) Commercially available (https://www .lunit.io) 60 Table 3-1 – Computer aided triage (CADt) algorithm details and results. Algorithm performance compared to reader performance for all included studies. a) screening mammograms, b) screening mammograms used from screening recalled cases c) screening and diagnostic mammograms. AUROC: Area Under the receiver operating characteristic curve, CV: Cross Validation, DL: Deep Learning, FN: False Negative, FNR: False Negative Rate, FP: False Positive, FPR: False Positive Rate, ML: Machine Learning, N: number, NPV: Negative Predictive Value, NA: information not available, PPV: Positive Predictive Value, TN: True Negative, TP: True Positive, v: Version. * caveat or another reported format. CA Dt b) Kyono [2] 2018231 arXiv DL: InceptionResN etV2, Multi- Task Learning *Triage all cases Patient Level (Training 90.0%, Validation 10.0%) (100.0% = 7 162) Retrospective Internal Least patients seen by radiologist without degrading radiologists FPR or FNR *Detail provided in Kyono [1] 2019 *Single multi- reader Hold-out method (42.8%) Cohen's Kappa = 0.716; F1 - Score = 0.757; TP = 120; TN = 803; FP = 41; FN = 36 NA Kyono [1] 2019232 Journal of the American College of Radiology DL: *Inception ResNetV2 Multi-Task Learning Triage normals Patient Level *Unclear Training = (5 060) + 8/10 fold training + 1/10 fold validation out of (2 000) Retrospective Internal NPV > 99.0% > 30 (> 2 years) *Single multi- reader 10 – fold CV 34.0% (CI: 25.0%- 43.0%) Low prevalence: 91.0% (CI: 88.0%- 94.0%) *NPV < 99.0% NA c) Rodriguez-Ruiz [2] 2019135 European Radiology *DL Unclear Architecture Commercial System Transpara (v 1.4.0) Triage normals Patient Level *Unclear data partition for Training and Validation out of [189 000] Retrospective *Internal / External 5 2 101 (52.0% USA, 48.0% Europe) further detail provided in Rodriguez-Ruiz [1] 2019 Single multi- reader External validation Threshold of 5 = 47.0% Threshold of 2 = 17.0% Threshold of 5 = (7.0%) Threshold of 2 = (1.0%) Commercially available (https://scree npoint- medical.com/ in-practice/) 61 Reference a) Test Database Test Data Internal / External Test Data Country Test Data N Centres Test Data Year of Studies Test Data N Images (Cases) [Exams] Test Data N Cancer Images (Cases) [Exams] Test Data Vendor Test Data SF / FFDM Test Data Processed Test Data Screen / Diagnostic Mammograms Test Data Age of Patients Test Data Density Test Data Ground Truth CA Dt Yala 2019229 Private Internal USA 1 2009 - 2016 26 540 (7 176) 191 (187) (2.6%) Hologic FFDM Processed Screen (mean 57.8 years) (SD ± 10.9) Yes HP / FU > 1 year McKinney 2020138 OPTIMAM (Private) + Northwestern Memorial Hospital (Private) Internal UK, USA UK 2 USA 1 UK 2012 - 2015 USA 2001 - 2018 UK: (*25 856) USA: (*3 097) UK: (*414) (1.6%) USA: (*686) (22.2%) Hologic, GE, Siemens FFDM Processed Screen NA Yes *USA only HP / FU > 1 year Balta 2020230 Private External* Germany 1 2018 [17 895] (114) (0.6%) Hologic, Siemens FFDM NA Screen NA NA HP / no FU Dembrower 2020134 CSAW (Private) External Sweden 1 2009 - 2015 (7 364) (simulated 75 534) (547) (0.7%) Hologic FFDM NA Screen 40 – 74 (median 53.6) (IQR 15.4) Yes HP / FU > 2 years 62 Table 3-2 – Computer aided triage (CADt) test set data characteristics of all included studies. a) screening mammograms, b) screening mammograms used from screening recalled cases c) screening and diagnostic mammograms. CSAW: Cohort of Screen Aged Women, DBT: Digital Breast Tomosynthesis, FFDM: Full Field Digital Mammography, FHx: Family History, FU: Follow-up, HP: Histopathology, N: number, NA: information not available, OPTIMAM: OPTIMAM Medical Image Database, SD: Standard Deviation, SF: Screen Film, TOMMY: TOMosynthesis with digital MammographY. *caveat or another reported format. CA Dt b) Kyono [2] 2018231 TOMMY (Private) Internal UK 6 NA (1 000) (156) (15.6%) NA FFDM Processed Screen (Recalled for assessment and FHx) 40 - 73 Yes *HP / 3 x reader review of 2D and DBT Kyono [1] 2019232 TOMMY (Private) Internal UK 6 NA *Unclear 1/10 fold out of (2 000) (300) (15.0%) NA FFDM Processed Screen (Recalled for assessment and FHx) 40 - 73 Yes HP / 3 x reader review of 2D and DBT c) Rodriguez-Ruiz [2] 2019135 Private External Seven countries, further detail provided in Rodriguez- Ruiz [1] 2019 NA NA [2 654] [653] [24.6%] GE, Hologic, Sectra, Siemens FFDM Processed *Both (50.0% screen, 50.0% clinical) Detail provided in Rodriguez- Ruiz [1] 2019 NA HP / FU > 1 year 63 Reference a) Journal Machine Learning Technique Task Decision N Development Images (Cases) [Exams] Retrospective / Prospective Testing Internal / External Testing Test N Reader (Experience) Test Reader Country Format Test Validation Method AUROC ML vs Reader Sensitivity ML vs Reader Specificity ML vs Reader Code Available (Location) CA De a nd C AD x Geras 2017116 arXiv DL: Customised CNN SAID Per case Training 721 186 [164 224] Validation 108 276 [24 552] Retrospective Internal 4 Single - multi- reader Hold-out method macAUC 0.688 vs 0.704 NA NA GitHub (https://gith ub.com/nyu kat/BIRADS_ classifier) Lotter 2019233 arXiv DL: (ResNet-50 + RetinaNet) SAID Per case (97 769) Retrospective *Internal / External 5 (2 - 15 years) Single - multi- reader External validation + Bootstrapping †Test 1 ML: 0.95 (CI 0.92, 0.97) Test 2 ML: 0.77 (CI 0.71, 0.82) †Test 1.+14.2% (CI 9.2%-18.5%, p < 0.001) ML over Reader Test 2. +17.5% (CI 6.0%- 26.2%, p < 0.001) ML over Reader †Test 1. +24.0% (CI 17.4%-30.4%, p < 0.001) ML over Reader Test 2. +16.2% (CI 7.3%- 24.6%, p < 0.001) ML over Reader NA Rodriguez-Ruiz [3] 2019234 Radiology *DL Unclear Architecture Commercial System Transpara (v 1.3.0) SAID Per case *Unclear data partition for Training and Validation out of [18 000] Retrospective *Internal / External 14 (11 specialists, 3 – 25 years) Single - multi- reader External validation 0.89 vs 0.87 (p = 0.33) 83.0% (CI 81.0%-85.0%) *Reader only 77.0% (CI 75.0%-79.0%) *Reader only Commerciall y available (https://scre enpoint- medical.co m/in- practice/) 64 CA De a nd C AD x McKinney 2020138 Nature DL: ensemble ResNet-(V2-50 and V1-50), MobileNetV2, RetinaNet SAID Per case UK: Training (13 918); Validation (62 866) US: Training (55.0% of 22 225); Validation (15.0% of 22 225) Retrospective *Internal / External UK 51 (5 - 20+ years) USA (1 – 30 years) Reader study 6 (4 – 15 years) UK double USA single Reader study single – multi- reader Hold-out method + External validation ML UK: AUROC = 0.89 (CI 0.87 - 0.91) USA (w/training UK+US): AUROC = 0.81 (CI 0.79-0.83) †(UK training only: AUROC = 0.76 (CI 0.73-0.78)) Reader study ML vs Reader: ΔAUROC = +0.115 (CI 0.06-0.18, p < 0.001) †(+8.1%, p < 0.001) ML improvement % over Reader range [min- max]: [0.0-9.4] †(+3.5%, p = 0.02) ML improvement % over Reader range [min- max]: [-3.4-5.7] *NA Schaffter 2020137 JAMA Open Network DL: CEM Ensemble (8 networks including VGG, Faster- RCNN) DL: Customised VGG network SAID Per case KPW (59 923) [100 974] + DDSM + Other datasets (e.g. OPTIMAM) Retrospective External USA screen readers Sweden screen readers USA single Sweden double (*reporte d single first reader) Hold-out method + External validation KPW (CEM) 0.90 (Top performing) 0.86 †KI (CEM) 0.92 (Top Preforming model) 0.90 KPW *Reader sensitivity 85.9% †KI *First reader 77.1%, Reader consensus 83.9% KPW (CEM) 76.1% (Top performing) 66.3% vs 90.5% †KI (CEM) 92.5% (Top performing model) 88.0%, 81.2% vs *†First reader 96.7%, Reader consensus 98.5% GitHub (https://gith ub.com/Sag e- Bionetworks /DigitalMam mographyEn semble) 65 CA De a nd C AD x Salim 2020149 JAMA Oncology DL: (1) ResNet34 (2) MobileNet (3) Unknown SAID Per case (1) 752 000 (2) 239 000 (3) 112 000 Retrospective External Sweden screen readers 25 1st reader, 20 2nd reader Sweden double External validation + Bootstrapping (1) 0.96 (2) 0.92 (3) 0.92 ML: (1) †81.9% (p = 0.03) (p = 0.11) (2) 67.0% (3) 67.4% vs †First reader 77.4%, Reader consensus 85.0% ML: (1) †96.6% (2) 96.6% (3) 96.7% vs †First reader 96.6%, Reader consensus 98.5% NA b) Rodriguez-Ruiz [1] 2019235 Journal of the National Cancer Institute *DL Unclear Architecture Commercial System Transpara (v 1.4.0) SAID Per case *Unclear data partition for Training and Validation out of [189 000] Retrospective *Internal / External 101 *95 for sensitivity and specificity (1 – 44 years) Single - multi- reader External validation †0.84 (CI 0.82-0.86) vs 0.81 (CI 0.79 -0.84) †75.0%– 86.0% vs 76.0%–84.0% †49.0% – 79.0% *Clinician specificity Commerciall y available (https://scre enpoint- medical.co m/in- practice/) c) Kim 2019236 Lancet Digital Health DL: (ResNet-34) Commercial System Lunit SAID Per case Total [166 968] Training [152 693] Validation [14 275] Retrospective *Internal / External 14 (7 specialists, > 6 months) Single - multi- reader External validation 0.94 (CI 0.92–0.97) vs 0.81 (CI 0.77–0.85, p < 0.001) 88.8% vs 75.3% (p < 0.001) 81.9% vs 72.0% (p = 0.002) Commerciall y available (https://ww w.lunit.io) Table 3-3 – Computer aided detection (CADe) and Computer aided diagnosis (CADx) algorithm details and results. Algorithm performance compared to reader performance for all included studies. a) Screening mammograms, b) screening and diagnostic mammograms c) mammography and ultrasound used for screening. CEM: Challenge Ensemble Method, CI: Confidence Interval, CV: Cross Validation, DL: Deep Learning, DDSM: Digital Database for Screening Mammography, KPW: Kaiser Permanente Washington, N: number, NPV: Negative Predictive Value, NA: information not available, OPTIMAM: OPTIMAM Medical Image Database, SAID: Stand-alone AI Detection, v: Version. * caveat or other reported format. † The results of studies included in the primary meta-analysis. 66 Reference a) Test Database Test Data Internal / External Test Data Country Test Data N Centres Test Data Year of Studies Test Data N Images (Cases) [Exams] Test Data N Cancer Images (Cases) [Exams] Test Data Vendor Test Data SF / FFDM Test Data Processed Test Data Screen / Diagnostic Mammogram s Test Data Age of Patients Test Data Density Test Data Ground Truth CA De a nd C AD x Geras 2017116 NYU (Private) Internal USA 5 2010 - 2016 [500] NA NA FFDM Processed Screen 19 - 99 (mean 57.2) (SD 11.6) NA *BIRADS score (0, 1, 2) Lotter 2019233 Private External USA 1 2011 - 2014 Test 1. (“Index”) [285] Test 2. (“Pre-index 12-24 month prior”) [274] Test 1. [131] [46.0%] Test 2. [120] [43.8%] NA FFDM Processed Screen NA NA HP / FU > 1 year Rodriguez-Ruiz [3] 2019234 Private External USA, Europe USA 1 Europe 1 USA 2013 – 2017 Europe 2014 - 2015 [240] [100] [41.7%] Hologic, Siemens FFDM Processed Screen 39 – 89 (mean 61.0) Yes HP / FU > 1 year McKinney 2020138 OPTIMAM (Private) + Northwestern Memorial Hospital (Private) *Internal / External USA, UK UK 2 USA 1 UK 2012 - 2015 USA 2001 - 2018 UK: (25 856) USA: (3 097) Reader study USA (*500) UK: (414) (1.6%) US: (686) (22.2%) Reader study USA (*125) (25.0%) GE, Hologic, Siemens FFDM Processed Screen NA Yes *USA only HP / FU > 1 year Schaffter 2020137 KPW (Private) KI (Private) *Internal / External USA, Sweden KPW 1 KI 2 KPW NA KI 2008 - 2012 KPW (25 657) [43 257] KI (*68 008) [166 578] KPW (283) (1.1%) KI (*780) (1.1%) NA FFDM Processed Screen KPW 40 - 74 (mean 58.4) (SD 9.7) KI 40 - 74 (mean 53.3) (SD 9.4) NA HP / FU > 1 year 67 Table 3-4 – Computer aided detection (CADe) and Computer aided diagnosis (CADx) test set data characteristics of all included studies a) Screening mammograms, b) screening and diagnostic mammograms c) mammography and ultrasound used for screening. CSAW: Cohort of Screen Aged Women, FFDM: Full Field Digital Mammography, FU: Follow-up, HP: Histopathology, KI: Karolinska Institute, KPW: Kaiser Permanente Washington, N: number, NA: information not available, NYU: New York University, OPTIMAM: OPTIMAM Medical Image Database, SD: Standard Deviation, SF: Screen Film. *caveat or other reported format. CA De a nd C AD x Salim 2020149 CSAW (Private) External Sweden 1 2008 - 2015 (8 805) (Simulated 113 663) (739) (Simulated 0.7%) Hologic FFDM Processed Screen 40 - 74 (median 54.5) Yes HP / FU > 2 years b) Rodriguez-Ruiz [1] 2019235 Private External Sweden, UK, Netherlands , USA, Italy, Spain, Austria NA NA [2 652] [*2 389] *for sensitivity and specificity [653] [24.6%] [*610] [24.6%] GE, Hologic, Sectra, Siemens FFDM Processed *Both (Some unilateral only) 30 - 92 Yes HP / FU > 1 year c) Kim 2019236 Private *External South Korea 2 2009 - 2018 [320] [160] [50.0%] GE, Hologic FFDM NA Screen (*including US) (mean 53.2) (SD 10·0) Yes *Mammograp hy / USS detected + HP 68 3.3.2 Quality assessment The PROBAST and QUADAS-2 tools were applied to all included articles in this review, and summary results of assessments are shown in Figure 3-3 and Appendix 7. Applying both tools identified a high risk of bias for analysis, as well as high bias and applicability concerns for the index test, participants and patient selection, Figure 3-3. Reasons for high bias and applicability include 8/14 (57%) articles with cancer-enriched cohorts, 5/14 (36%) articles that tested the algorithm on an internal dataset, and 3/14 (21%) articles that did not pre-set the algorithm threshold in CADt studies. According to PROBAST assessment, articles were reported to have an overall low (7%), unclear (7%), and high risk (86%) of bias. Figure 3-3 – (a) Prediction model Risk Of Bias ASsessment Tool (PROBAST) and (b) Quality Assessment of Diagnostic Accuracy Studies-2 (QUADAS-2) assessment. For 14 included articles, each category is represented as a percentage of the number of articles that have high, low, or unclear levels of bias. Critical appraisal of the reporting quality in the 14 included articles using the 42 parameters of CLAIM, found scores ranging from 22 to 34, with an average total score of 30/42 (71%). The points 69 most commonly under-reported included robustness or sensitivity analysis, methods for explainability or interpretability, and protocol registration. Methods for explainability (e.g., saliency maps) to provide transparency of the algorithm’s deduction were reported in 3 articles. Only 50% of articles reported all eight key fields, Figure 3-4. Figure 3-4 – Checklist for Artificial Intelligence in Medical Imaging (CLAIM) assessment. Results for 14 articles included in this review across 8 key categories identified from the checklist. A score of 1 was provided if complete information was provided, and 0 where no information was provided. The x-axis indicates the percentage of articles in the review which included information about the eight key categories detailed in the y- axis. 3.3.3 Statistical analysis Low heterogeneity was found for both algorithms and readers in the primary and the secondary analyses (I2 0.0%-0.6% and Cochrane Q test p = 0.45-0.78). An estimated 185,252 cases from 3 countries with > 39 readers were included in the primary meta- analysis. The pooled summary estimates for sensitivity, specificity, and AUROC were 75.4% (95% CI 65.6-83.2), 90.6% (95% CI 82.9-95.0), and 0.89 (95% CI 0.84-0.98), respectively for ML algorithms. For readers, the pooled sensitivity, specificity, and AUROC were 73.0% (95% CI 60.7-82.6), 88.6% (95% CI 72.4-95.8), and 0.85 (95% CI 0.78-0.97), respectively, Figure 3-5. The differences in sensitivity and specificity were not statistically significant, p-value = 0.11 and 0.40 respectively. Algorithms performance thresholds were set at the reported reader sensitivity / specificity in 4 studies. When including all available results from CADe and CADx studies conducted using external datasets that provided a direct comparison between ML algorithms and readers for a secondary meta- analysis, the pooled sensitivity, specificity, and AUROC was 80.4% (95% CI 75.5-84.6), 82.1% (95% CI 72.7-88.8), and 0.86 (95% CI 0.84-0.90) for algorithms. For readers the pooled sensitivity, specificity, and AUROC was 78.5% (95% CI 73.8-82.5), 82.6% (95% CI 69.2-90.9), and 0.84 (95% CI 0.81-0.88), Figure 3-5. The differences in sensitivity and specificity were not statistically significant, p-value = 0 10 20 30 40 50 60 70 80 90 100 Data sources Data preprocessing steps How data were assigned to partitions Level at which data partitions are disjoint Detailed description of model Details of training approach Method of selecting the final model Metrics of model performance CLAIM - Eight key catagories Complete No information 70 0.70 and 0.73 respectively. Summary tables and additional information are available in Appendix 8- 11. Figure 3-5 – Summary Receiver Operating Characteristic (ROC) curves (a) in 5 studies for the included algorithm and b) reader results reported for the top performing machine learning (ML) algorithm tested on an external data set, compared to reader performance for computer-aided detection (CADe) and computer-aided diagnosis (CADx) applications, with a ground truth of > 1 year follow-up and histopathology. Primary meta- analysis) (c) For 17 algorithm reported results and d) 15 reader reported results from included studies for CADe and CADx applications tested externally. Seconday meta-analysis). The black line represents sROC, the blue line represents confidence interval, the red dot represents the summary estimate, and the black crosses represent the individual results. 3.4 Discussion We found the performance of mammography screening algorithms is reaching equivalence to readers in stand-alone CADe and CADx tasks. Comparing our results to two recently published reader performance studies demonstrated that while the pooled sensitivity of algorithms (75.4%) was higher than that of pooled readers (73.0%) and single reading in Sweden (73.0%)175, it was inferior to both single reading in USA (86.9%)74 and double reading with consensus in Sweden (85.0%)175. The a) c) b) d) 71 pooled specificity of algorithms (90.6%) was superior to pooled readers (88.6%) and single reading in USA (88.9%)74, but inferior to both single (96.0%) and double reading with consensus in Sweden (98.0%)175. Therefore, further improvements are needed to make sure ML systems meet the ‘clinically relevant thresholds’ of current reader performance and screening programme targets. Our findings are similar to a systematic review and meta-analysis comparing deep learning applications across all medical imaging to “health-care professionals”, who came to a similar conclusion and highlighted the importance of continued external testing142. Algorithms are also performing tasks not feasible by readers such as high-volume normal case triage, with no detrimental change when reader performance was extrapolated in an adapted screening workflow (using machine only reading of cases assigned to be normal as an alternative to single or double reading)215. However, the acceptable “miss” rate for a system, similar to the interval cancer targets, should be agreed and specified for machine only reading of normal mammograms before clinical adoption. The biggest barrier may be public understanding of the concept of acceptable “misses”. No prospective studies have yet been reported, many studies are still conducted with retrospective internal testing, and few studies are conducted by an independent party where multiple algorithms are cross-compared using external datasets149. In addition, most of the studies used enriched cancer cohorts for testing, which do not include the class imbalance of cancers to healthy controls in screening. Thus, these datasets may not provide a realistic representation from which to infer model performance in clinical implementation limiting generalisability, clinical applicability and feasibility of workflow translation. Our findings highlight the need for well-designed prospective randomised and non-randomised controlled trials to be conducted across different breast screening programmes. These prospective studies should include representative case proportions, to replicate the class imbalance in screening, with readers of varying experience interacting with ML algorithm outputs within the clinical workflow. This will allow performance to be assessed as well as technological feasibility, reading time, reader acceptability and effect on reader performance139. Prospective studies investigating ML applications for mammographic screening are currently underway in the UK, Norway, Sweden, China and Russia with results pending237–239. Most articles were from 2019 onwards, reflecting the exponential growth in publications since major milestones such as the ImageNet118 and DREAM112,240 challenges. Although the computer codes were available in 64.0% of articles, only 21.0% of code was available on an open-source platform. However, the provision of code alone does not result in a deployable model including training weights as well as the threshold at which the algorithm performance was determined, limiting 72 reproducibility and transparency241,242. Large datasets were used for testing but the majority of these are private, which limits the ability to replicate results. Two commonly used tools for bias assessment found high risk of bias due to cancer enriched cohorts, use of internal datasets as well as due to the algorithm threshold in triage studies not being pre-set. Therefore, these results may not be applicable and generalisable to all breast screening populations223. We applied a specific Artificial Intelligence (AI) medical imaging reporting guideline (CLAIM), to critically appraise AI medical imaging literature. It should be noted that CLAIM was published after more than half of the articles in this review were published. Therefore, we have not presented the results of each individual study but have used this as a foundation to find underreported areas within the current literature, as well as confirm the applicability of CLAIM for ML mammography studies167. 3.4.1 Limitations The meta-analysis was limited by both the small number of eligible studies and because the contingency tables were constructed using reported sensitivity, specificity, total cases and malignant cases to provide estimated integers (whole numbers) for calculating true-positives, true-negatives, false-positives, and false-negatives. The primary meta-analysis included studies where reader performance did not reach the level reported in national screening standards, therefore it is possible that the relative improved performance of ML algorithms is overestimated, and the performance of readers is underestimated as part of this analysis. The primary analysis also used only the highest performing (based on test performance) algorithm if multiple algorithms were tested, and therefore may be slightly biased towards the selected algorithms. The secondary meta-analysis incorporated multiple algorithms and readers from the same study, on the same population, which could potentially lead to overrepresentation. Therefore, the results from the meta-analysis should be interpreted with caution. Lastly, for the secondary meta-analysis both screening and diagnostic mammograms were included in studies, as well as in one study women were screened using mammography and ultrasound, both of which would impact on the expected performance metrics. 3.5 Conclusion There is a growing evidence base that stand-alone ML performance is comparable to reader performance and that ML can undertake triage tasks at a volume and speed not feasible for human readers. Although only retrospective trials have been conducted, the potential for algorithms to perform at the level of or even exceed the performance of a reader within the real-time breast screening workflow is realistic. However, further robust prospective data is critical to understanding where algorithm thresholds are set and are required to examine the interaction between human 73 readers and algorithms, as well as the effect on reader performance and patient outcomes over time. 74 Chapter 4 – Developing a mammographic imaging database – The Cambridge Cohort – Mammography East Anglia Digital Imaging Archive 4.1 Aims In this chapter the design, construction, governance, and content of The Cambridge Cohort – Mammography East Anglia Digital Imaging Archive (CC-MEDIA) is outlined. The CC-MEDIA database aims to provide extensive clinical metadata and multiple imaging episode data, fulfilling a need for a large-scale, external representative UK breast screening dataset for benchmarking. This allows for reproducible, independent testing and feedback of Artificial Intelligence (AI) models being developed for breast cancer screening. As a result of this work setting up the CC-MEDIA database, the methods and procedures developed have informed the University of Cambridge and Cambridge University Hospitals NHS Foundation Trust research ethics procedures regarding medical imaging databases. Contents of this chapter have been presented at the 2021 British Society of Breast Radiology Annual Scientific Conference243. 4.2 Introduction Medical imaging has become an integral part of patients care, with 45.2 million imaging tests taking place between September 2018 to September 2019 in the UK115. The visual field of radiology lends itself to AI research, for both the visual task at hand as well as due to the abundance of medical imaging data available. High quality medical imaging databases are therefore increasingly important for AI algorithm development and testing. The 2022 Goldacre report highlighted the importance of development of Trusted Research Environments (TRE) to fully utilise the digital data available within the National Health Service (NHS). The report acknowledged the current complex and convoluted ethical approval and governance required as well as the necessary expertise to build large reusable datasets244. Initiatives across medical imaging research have led to the development of large imaging databases for chest x-ray (MIMIC-CXR), MRI knee (MRNet) CT head (RSNA 2019 brain haemorrhage challenge), and mammography (CSAW)157,245–247. Early mammographic imaging databases date back to 1994 but only contained a very small volume of digitised screen film images, from the USA and UK248,249. Recent mammographic imaging databases contain data in Full Field Digital Mammography (FFDM) Digital Imaging and Communications in Medicine (DICOM) format, from countries around the world at a larger scale of > 1,000,000 images157,158,250. The increase in data availability in the last ten years has contributed to the development of more accurate algorithms as well as allowing for the reproducibility and generalizability testing of AI algorithms in different screening programmes. 75 Mammographic imaging databases now contain a diverse range of mammography machine manufacturers, screening programmes (annual, biennial, triennial), and countries (UK, Sweden, USA, Spain, Portugal), as detailed in Table 4-1. Dataset Country Year SF / FFDM Cases Density Age Annotations MIAS248 UK 1994 SF 161 (I = 322) (C = ~52) Y NA Y DDSM249 USA 1998- 1999 SF 2620 (I = 10480) (C = 914) Y NA Y CBIS-DDSM 251,252 USA 2016 SF 1644 (I = 10239) (C = 758) Y NA Y InBreast 156,253 Portugal 2008- 2010 FFDM 115 (I = 410) (C = NA) Y NA Y BCDR-FM 254,255 Portugal + Spain 2009- 2013 SF 1010 (I = 2702) (C = NA) Y 20-90 Y BCDR-DM 254,255 Portugal + Spain 2009- 2013 FFDM 724 (I = 3612) (C = NA) Y 27-92 Y OMI-DB250 UK 2011- 2020 FFDM + DBT + MRI 172282 (I > 3000000) (C = 8586) Y 30-84 Y CSAW157 Sweden 2008- 2015 FFDM 499807 (I > 2000000) (C = 10582) NA 40-74 Y NYU BCSD v1.0158 USA 2010- 2017 FFDM 141473 (I > 1000000) (C = 1221*) Y 16-99 Y EMBED256 USA 2013- 2020 FFDM + DBT 115910 (I > 3000000) (C = 3733) Y > 18 Y Table 4-1 – Mammographic imaging database characteristics. BCDR: Breast Cancer Digital Repository, CBIS- DDSM: Curated Breast Imaging Subset of DDSM, CSAW: Cohort of Screen-Aged Women, C: Cancers, DDSM: Digital Database for Screening Mammography, DBT: Digital breast tomosynthesis, DM: Digital mammography, EMBED: EMory BrEast imaging Dataset, FFDM: Full field digital mammography, FM: Film mammography, I: Images, MIAS: The Mammographic Image Analysis Society Digital Mammogram Database, NYU BCSD: New York University Breast Cancer Screening Dataset, NA: Not available, OMI-DB: The Optimam Mammography Image Database, SF: Screen film, Y: Yes. *breasts not cases. Different levels of data are available for mammography images (case level, exam level, per breast level, per image level) with certain datasets also providing image level annotations either via bounding box or pixel level regions of interest. The ground truth of the data is an important component of medical imaging databases. Routinely in breast screening AI testing the ground truth 76 is set at two levels. For a case to be defined as ‘normal’ there has to be a sufficient time interval with an outcome of a normal screen, and for ‘cancers’ there should be a histopathological diagnosis outcome. The NHS Breast Screening Programme (BSP) is a three yearly (triennial) breast screening programme, inviting women aged 50-70 to participate and using a two-view FFDM. Due to NHSBSP age extension trial (AgeX), running from 2011 to 2016, women aged 47-49 years old were also invited to screening. In addition, women aged more than 70 years old can self-refer into the screening programme54,257. The NHSBSP is carried out at seventy-five sites across the country within the NHS system that allows for women to be tracked over time and linkage between different data sources using personal identification numbers (NHS number). All mammograms are double read by two expert readers (e.g. radiologists, consultant radiographers, and breast clinicians) either independently or dependently. The CC-MEDIA database captures the true distribution of the NHSBSP by consecutively collecting screening mammograms for women aged more than 47 years old who attended screening at two NHSBSP sites between 2011 and 2020. Thus, facilitating the independent testing of multiple AI algorithms, for different breast screening applications using large, representative cohorts with extensive follow up to allow for accurate ground truth identification. 4.3 Methods 4.3.1 Database approval Ethical approval for this database was obtained from the Health Research Authority (HRA) Confidentiality Advisory Committee (CAG), HRA Research Ethics Committee (REC) and Public Health England (PHE) Research Advisory Committee (RAC). IRAS Reference – 258761. • HRA REC - reference 20/LO/0104 – approval date 03/04/2020 • PHE RAC - reference BSPRAC_090 – approval date 03/04/2020 • HRA CAG - reference 20/CAG/0009 – approval date 11/06/2020 A formal agreement was put in place between Cambridge and Norwich hospital trusts relating to the use of data within this database. Consent was not obtained from individual patients for the creation and use of this database, as the data that is retained within the final Trial Database is in a de- identified format and Section 251 approval was received from the HRA CAG committee. The database has received initial 5-year approval until 2025 and yearly reports are submitted to the ethics committees to maintain support. 4.3.2 Database governance The database is overseen by The Cambridge Cohort Database Access Committee (DAC). The DAC ensures that the management of the databases is in line with current regulations for data 77 governance and patient safety. The DAC includes; the principal investigators, representatives from the breast screening units and data managers at both sites, research governance leads at the university and hospital, as well as a lay member. The DAC is responsible for reviewing applications for data access by internal and external sources as well as determining the terms of access given. As the data is unconsented and sensitive, data is not routinely released to external institutions. If requests for processing using a small volume of data is approved by the DAC, e.g. for the specific purpose of company AI tool validation on Philips data, a data sharing agreement is put in place and the small volume of data [n = ~100 cases] is re-anonymised and transferred via a secure process (e.g. secure file transfer protocol (SFTP)). 4.3.3 Patient and public involvement work Throughout the development of this database extensive patient and public involvement (PPI) work has been undertaken to ensure the views of the patients included in this database and those of the general public are taken into account regarding the management and use of their data. The feedback received from our PPI events has improved the way we explain to people how their patient data is used as part of this research. It also improved explanations regarding how data moves from the hospital to the university, who has access to the data, and how the data is then used in a de- identified format (where all the information that could be used to identify an individual has been removed or amended). Exploring patient acceptability of different aspects of data use has helped ensure we are working both within the public’s expectations as well as in line with national ethical requirements. All the PPI work was carried out with the support and guidance of the National Institute for Health and Care Research (NIHR) Cambridge Biomedical Research Centre (BRC) PPI team. The first PPI activity was a formal review by the NHIR Cambridge BRC PPI panel of the project lay summary (15/11/2019). The NHIR Cambridge BRC PPI panel is a group of around 60 members of the general public from Cambridge and the surrounding areas who are interested in research. They provide their thoughts and opinions on research projects based on their own personal experiences. Seven panel members reviewed the lay summary and provided feedback. This included clarifying the terms used in our patient facing material to make these more accessible, for example providing an explanation around the term “de-identified”. They also raised queries around data flows, and commercial involvement, and how commercial companies will access the data. Following on from this initial activity a Cambridge Science Festival public forum was held on 9th March 2020 to gain insights into the public’s views on “Harnessing Big Clinical Data In Medicine. Can AI Improve Breast Cancer Screening?”. Thirty-seven members of the public attended, 58.0% were female and 42.0% male, of which 53.0% were aged 18-29, 22.0% aged 30-49, and 26.0% were aged more than 50 years 78 old. Throughout the session responses to questions were collected from the audience using TurningPoint software (which uses interactive clickers to record anonymised results). This interactive feedback helped to understand how certain terms were phrased and where further explanation should be provided for the complex flow of data in this research. The audience was asked about the acceptability of different organisations having access to the database, such as the university, hospital, and commercial companies. There was a high proportion of agreement for the university and hospital with 63.9% of the audience strongly agreeing and 16.7% agreeing, however a mixed picture for the commercial companies with only 18.8% strongly agreeing and 28.1% agreeing, Figure 4-1. Figure 4-1 – Cambridge Science Festival Event questions. a) “Would you support the use of your medical images by a hospital or university (in a fully anonymised format, stored in a secure location) to be used for developing algorithms without your consent?”, b) “Would you support the use of your medical images by a commercial company (in a fully anonymised format, stored in a secure location) to be used for developing algorithms without your consent?” Based on the questions outlined in Figure 4-1, further work was carried out to clarify the role of commercial companies in this research. Such that de-identified data would only be released in small proportions to external companies, and that the data would be held securely within the University of Cambridge so that the algorithms are brought to the data and only those with approved access could see and use the data. A glossary of commonly used terms was developed for our project following this event to be used for future PPI communication as well as an anonymised report which summarised results from our question-and-answer sessions was submitted HRA REC, HRA CAG and PHE RAC for initial ethical approval. A national patient survey called “The AI Survey - The use of patient data in breast cancer screening artificial intelligence research” was conducted in October 2021. The survey was disseminated through the NIHR Cambridge BRC team, Independent Cancer Patient Voices (ICPV), Breast Cancer Now, Addenbrookes Cancer Patient Partnership Group, and Cancer Research UK, to patients; eligible for breast screening, those who have previously attended breast screening, or have been previously a) b) 79 diagnosed with breast cancer. The survey was hosted using the university Qualtrics platform from 27/10/2020 to 31/01/2021. In total 46 responses were received. The survey highlighted areas for further improvement surrounding the terms used and layout of the patient facing material. We were able to demonstrate the improved clarity of the updated lay summary. In addition, the acceptability of the data fields collected in the database, Figure 4-2. Patients were very likely to accept the use of all fields to be collected in the database. However, 2.0%-6.1% of patients were unlikely or very unlikely to accept the use of family history or additional healthcare information e.g. information relating to other health conditions such as medication. Figure 4-2 – National patient survey question regarding acceptability of data fields. “How likely are you to accept the secure storage and use for research (algorithm testing and development) of each field of your de- identified healthcare data?”. Figure 4-3 – National patient survey questions regarding commercial involvement. “Would you accept the use of your healthcare data (securely stored in a de-identified format without your consent) for algorithm testing research in the following circumstances:” 80 However, we acknowledged the lay summary was felt to be too long and so we produced a shorter version with all the key information still included. Similar results to those found at the science festival public forum were found in the survey regarding “Would you accept the use of your healthcare data (securely stored in a de-identified format without your consent) for algorithm testing research in the following circumstances”, Figure 4-3. Further supporting the acceptance of bringing the algorithms to data approach that was taken in this research. Follow-up small discussion forum groups were held via Zoom in January and February 2021 for people who contacted the research team following the survey. These discussions highlighted that there is hesitancy regarding commercial involvement in this research, mainly regarding concerns over data privacy. However, there was also an understanding of the need to involve commercial companies to enable this type of research to progress, and for the implementation of such technology within the NHS. Panel members noted the increased public awareness and acceptance of commercial collaborations within healthcare research, following the work to develop vaccines during the Covid-19 pandemic. The work involving commercial companies was further clarified in patient facing documents, such as which information commercial companies will have access to and how this access would be controlled. Those who attended the Zoom events kindly helped in further developing the updated versions of patient facing material and all documentation was made available on the University’s departmental website. 4.3.4 Database sites Two sites in East Anglia, England, participated in the creation of the CC-MEDIA imaging database: • Cambridge (Cambridge University Hospitals NHS Foundation Trust, including Cambridge Breast Unit) • Norwich (Norfolk and Norwich University Hospitals NHS Foundation Trust, including Norwich Breast Unit) The average round length for screening at both sites is 34-36 months. Neither site participated in the AgeX trial, however screening was offered to those aged 47 years and older in the region within the study time period. Cambridge breast screening implements double reading of all mammograms, with the second reader able to see the outcome from the first reader, thus reading is dependant. Arbitration takes place for all cases recalled as well as for cases where there is discordance between the two initial readers, with a panel of up to four readers. During the Covid-19 pandemic Cambridge breast screening was paused twice, once from 23/03/2020 to 16/07/2020, and secondly from 11/01/2021 to 22/02/2021. Norwich breast screening uses double reading of all mammograms, with the second reader not being able to see the outcome from the first reader, thus reading is independent. Arbitration takes 81 place only for cases where there is discordance between readers, with an average panel size of five readers. During the Covid-19 pandemic Norwich breast screening was paused from 20/03/2020 to 16/07/2020. When using the database; the first reader was used as the independent reader to be combined with the algorithm, arbitration was determined if there was a disagreement between readers, and trainee readers were replaced with trained readers to allow comparison using both sites’ data. 4.3.5 Database creation The database consists of cases age greater than or equal to 47 years old, who attended screening between 2011 to 2020. Data was collected at two NHSBSP sites (Cambridge and Norwich) to create a centrally stored database for external testing of multiple AI algorithms. Patients who attended routine screening (triennial) as well as patients who attended high risk screening and subsequently were transferred to routine screening were incorporated into this database. The main sources of information were obtained from; picture archiving and communication system (PACS) for Digital Imaging and Communications in Medicine (DICOM) image data as well as DICOM header and tags, National Breast Screening System (NBSS) for breast screening metadata and Electronic Health Records (EHR) – EPIC / LARDR – for additional clinical metadata, Figure 4-4. A screening episode is defined as the anything that occurs from the time a woman is invited to screening, the screen itself and any assessments, as well as diagnoses and treatments that occur as a consequence of screening. Outcomes for all case episodes were followed up using data from the NBSS until April 2022. Figure 4-4 – Data flow of the CC-MEDIA data collection. DICOM: Digital imaging and communications in medicine, EHR: Electronic health records, HPC RFS: High performance computing research file store, NBSS: National breast screening system, NHSBSP: National health service breast screening programme, PACS: Picture archiving and communication system. Adapted from Halling-Brown et al250. 82 An environment was set up within the NHS firewall at each hospital site to facilitate the transfer of images from PACS to a secure store (using code developed by Dr Andrew Priest, Medical Physicist at Cambridge University Hospital NHS Foundation Trust) which entailed; a static IP address, MATLAB (The Mathworks, Inc., Natick, Massachusetts, United States. http://www.mathworks.com, version 2019a) with the Image Processing Toolbox, and the DCMTK tool kit (echoscu, findscu, movescu, storescu) (version 3.6.2 and 3.6.4)258. Cases were identified from NBSS using existing and new project specific Crystal Report queries (developed by Sue Hudson, PAS Consulting London). The personal identifiers (NHS number and study date) from NBSS were then used to query PACS and then retrieve the DICOM image data into the on-hospital-site secure store. DICOM imaging data included the standard two-view processed (“for presentation”) mammogram screening images as well as available additional views and raw (“for processing”) data. All the images were stored in a compressed Joint Photographic Experts Group (JPEG) lossless format. All image data, including the DICOM header and tag information was de-identified by adapting the basic profile provided in DICOM PS3.15, such that all identifiable information was removed (Appendix 12)259. Caution was taken when handling latent identifiers, such as date of screening, in order to ensure anonymity was achieved whilst retaining longitudinal information. Additional NBSS Crystal Report queries were used to extract clinical metadata from NBSS for the whole screening episode. The clinical metadata fields from NBSS provided a ground truth for each case. The clinical metadata was de-identified using Python (Python Software Foundation, http://www/python.org, version 3.8)260 based scripts (developed by Dr Lorena Escudero Sánchez, Research Associate at University of Cambridge) and Excel (Microsoft Corporation, https://office.microsoft.com/excel) functions within the on-site secure store. The de-identification processes for both the image data and clinical metadata occurred prior to the transfer of any data from the hospital site. The study nomenclature allows for the easy tracking and re-joining of data for analysis, where each case is assigned a trial ID (case ID), exam ID and image ID, Figure 4-5. 83 Figure 4-5 – Nomenclature of case de-identification within the CC-MEDIA database. The site code varied for Cambridge (CC05 or CC06) and Norwich (CC07 or CC08). The trial code was randomly assigned for each case. The exam code increased sequentially for each episode. Series number was taken from the DICOM series tag. MGP: Mammographic processed, MGR: Mammographic raw. LCC: Left craniocaudal, RCC: Right craniocaudal, RMLO: Right mediolateral oblique, LMLO: Left mediolateral oblique. All data was then encrypted in the on-hospital-site secure store, using Advance Encryption Standard (AES)-256 encryption, and subsequently transferred to the University of Cambridge high performance computing (HPC) research file store (RFS). A look up key store remained at each site within a separate area in the on-hospital-site secure store to allow for additional data linkage whilst building the database. Following the completion of the database this look up key store will be securely held by the principal investigator at each site. Once the image data was transferred the DICOM headers and tags were extracted from the image data using the DCMTK dcmdump utility (version 3.6.5)258. The DICOM dump data was then adapted into an image metadata file using MATLAB code (developed by Dr Nicholas Payne, Research Associate at University of Cambridge). The DICOM metadata files were then stored in the HPC RFS alongside the de-identified DICOM image data and clinical metadata. When collecting the image data firstly all interval cancers (ICs) and screen detected cancers (SDCs) from both sites were collected from 2011-2020. Secondly a set of cases that were age and year matched to the ICs in the database were extracted. All cases from specified year cohorts were consecutively collected, following the completion of the cancer retrieval (ICs and SDCs). The most recent year cohort with a complete follow-up time period 84 was collected first, hence starting with 2017 from both Cambridge and Norwich. Subsequently year cohorts were collected in a specific order at each site to avoid overlap with existing databases (The Optimam Mammography Image Database (OMI-DB)) where data had previously been collected. Data was collected from Cambridge by the OMI-DB database team between 2012-2016250. Thus to date the six sets of data have been extracted are: • CC05 – Cambridge ICs and year and age matched normal controls 2011-2020 • CC06 – Cambridge SDCs 2011-2020 • CC06 – Cambridge year cohorts 2017-2018 • CC07 - Norwich ICs and year and age matched normal controls 2011-2020 • CC08 – Norwich SDCs 2011-2020 • CC08 – Norwich year cohorts 2014-2018 Figure 4-6 – Timeline of mammography data changes over time at Cambridge and Norwich National Health Service Breast Screening Programme (NHSBSP) sites. SF: Screen film, OMI-DB: The Optimam Mammography Image Database, FU: Follow-up, FFDM: Full field digital mammography, NBSS: National breast screening system. *Only at Cambridge site. When using the image cases in studies, first the cohort was identified using the clinical metadata file and then all images for the cases were copied and unencrypted in a separate area on the secure HPC RFS store to retain the completeness of the original data. Figure 4-6 details the important changes at the database sites over the study time period. Due to the change from SF mammograms to FFDM in 2011/2012 at both sites there was limited availability of image data over this time period. In addition, raw data was only collected at Cambridge and only between 2014 and 2019. 85 4.4 Results 4.4.1 Database image content Image data collection started on 11/12/2020 and is ongoing. The information reported in this chapter is up to date as of 05/05/2022. Total Norwich 2014 Norwich 2015 Norwich 2016 Norwich 2017 Cambridge 2017 Norwich 2018 Cambridge 2018 Exams 27214 28926 25915 22936 18803 26901 21218 Images 116013 122878 107043 94492 151917 110185 171521 Manufacture GE 27210 (99.99%) 28926 (100%) 25915 (100%) 22936 (100%) 297 (1.6%) 26901 (100%) 372 (1.8%) Philips 0 (0.0%) 0 (0.0%) 0 (0.0%) 0 (0.0%) 18506 (98.4%) 0 (0.0%) 20846 (98.2%) Fujifilm 4 (0.01%) 0 (0.0%) 0 (0.0%) 0 (0.0%) 0 (0.0%) 0 (0.0%) 0 (0.0%) FFDM Raw FFDM images 154 (0.13%) 259 (0.21%) 212 (0.20%) 243 (0.26%) 75972 (50.01%) 0 (0.0%) 85735 (49.99%) Processed FFDM images 115859 (99.87%) 122619 (99.79%) 106831 (99.80%) 94249 (99.74%) 75945 (49.99%) 110185 (100%) 85786 (50.01%) Breast Implants Implant images 768 (0.7%) 1247 (1.0%) 1482 (1.4%) 1169 (1.2%) 1894 (1.3%) 2958 (2.7%) 251 (1.2%) Age at Screening 47-49 3033 (11.2%) 2613 (9.0%) 2307 (8.9%) 106 (0.5%) 42 (0.2%) 14 (0.05%) 96 (0.5%) 50-59 10854 (39.9%) 11699 (40.4%) 11553 (44.6%) 10841 (47.2%) 9857 (52.4%) 11887 (44.2%) 10759 (50.7%) 60-69 10211 (37.5%) 12390 (42.8%) 10242 (39.5%) 9441 (41.2%) 7659 (40.7%) 11080 (41.2%) 8235 (38.8%) 70+ 3116 (11.5%) 2224 (7.7%) 1813 (7.0%) 2548 (11.1%) 1245 (6.6%) 3920 (14.6%) 2128 (10.0%) Cancers Normal 26882 (98.8%) 28670 (99.1%) 25619 (98.8%) 22662 (98.8%) 18551 (98.7%) 26560 (98.7%) 20947 (98.7%) SDC 208 (0.8%) 152 (0.5%) 198 (0.8%) 189 (0.8%) 158 (0.8%) 225 (0.8%) 188 (0.9%) IC 124 (0.5%) 104 (0.4%) 98 (0.4%) 85 (0.4%) 94 (0.5%) 116 (0.4%) 85 (0.4%) Table 4-2 – Number of exams per site available with images currently held in the CC-MEDIA database. Interval cancers were diagnosed within 40 months of screening. FFDM: Full field digital mammography, IC: Interval cancer, SDC: Screen detected cancer. In total the core database (CC06 and CC08) contains 323,438 images, 40,021 exams, and 39,982 cases from Cambridge, and 550,611 images, 131,892 exams, 87,046 cases from Norwich. Thus in 86 total the database contains 874,049 images, 171,913 exams, 127,028 cases of which 1,318 are SDC cases, and 706 were IC cases, Table 4-2. Out of all the exams in the database 82,190 (47.8%) have one instance, 89,606 (52.1%) have two instances, with only a very small proportion having 3 (75 (0.04%)), 4 (32 (0.02%)) and 5 (10 (0.01%)) instances. The age range of cases in the cohort was 47-95 years old (median = 59 years old). An annual report is created by each screening programme called the KC62 (The NHS Breast Screening Programme Central Return Data Set), which details “information on women invited for Breast Screening, the outcome of the Breast Screening and further information on each cancer detected”261. Comparing the volume of cases to the distribution of KC62 data from both sites shows that a similar distribution was collected to the true distribution of cases which attended for screening. Thus the database was representative of the screening carried out at each NHSBSP site, Table 4-3. NHSBSP Screened262 Cambridge KC62 Screened Cambridge CC-MEDIA Norwich KC62 Screened Norwich CC-MEDIA 2011-2012 1940603 17134 - 25900 - 2012-2013 1970955 17475 - 25798 - 2013-2014 2079271 19590 - 26823 7912* 2014-2015 2105454 21972 - 26070 26133 2015-2016 2161268 19370 - 29150 29065 2016-2017 2199342 18389 4979* 25584 25573 2017-2018 2138434 19035 18970 22471 22286 2018-2019 2234514 20830 16072* 27371 20923* 2019-2020 2123589 15144 - 23675 - Total exams 18953430 168939 40021 232842 131892 Table 4-3 – Cambridge and Norwich CC-MEDIA database 2011-2020 compared to the KC62 report at both sites. The KC62 reports programme performance from 01/04/YYYY to 31/03/YYYY at each NHSBSP site. KC62 data is taken from Table-T of the annual KC62 report which reports the sum of tables A-F2; first invite for routine screening, routine invitation to previous non-attenders, return invitation to previous attenders (last screening within 5 years and last screen more than 5 years), short term recall, self / GP referrals for women not previously screened or previously screened (last screen within 5 years or last screen more than 5 years previously). *Fields that have incomplete year data. 4.4.2 Database content - Interval cancers ICs are a key measure of screening programme performance. The acceptable IC rate set by the NHSBSP is 3.7/1000 women screened101,103. ICs can occur anytime from the last negative screen to 40 months post screen as defined by the NHSBSP. Figure 4-7 shows the time to diagnosis at Cambridge and Norwich by months. IC image data available within the CC-MEDIA database is shown in Table 4-4. 87 Figure 4-7 – Time to diagnosis (months) for interval cancers (IC) at a) Cambridge and b) Norwich. Cambridge n (%) Norwich n (%) Total exams n 611 561 Total images n 3937 2350 Year of Screening 2010-2011 2* (0.3%) 0* (0.0%) 2011-2012 26* (4.3%) 0* (0.0%) 2012-2013 75* (12.3%) 24* (4.3%) 2013-2014 92 (15.1%) 100 (17.8%) 2014-2015 106 (17.3%) 104 (18.5%) 2015-2016 86 (14.1%) 104 (18.5%) 2016-2017 62 (10.1%) 94 (16.8%) 2017-2018 85 (13.9%) 76 (13.6%) 2018-2019 53* (8.7%) 48* (8.6%) 2019-2020 23* (3.8%) 11* (2.0%) 2020-2021 1* (0.2%) 0* (0.0%) Age at Screening 47-49 67 (11.0%) 54 (9.6%) 50-59 262 (42.9%) 206 (36.7%) 60-69 230 (37.6%) 229 (40.8%) 70+ 52 (8.5%) 72 (12.8%) Manufacture GE 23 (3.8%) 557 (99.3%) Philips 553 (90.5%) 0 (0.0%) Hologic 15 (2.5%) 4 (0.7%) Sectra 20 (3.3%) 0 (0.0%) FFDM Raw FFDM images 1478 (37.5%) 0 (0.0%) Processed FFDM images 2459 (62.5%) 2350 (100%) Implants Implant images 17 (0.4%) 28 (1.2%) Table 4-4 – Interval cancers (ICs) at Cambridge and Norwich with imaging data 2011-2020 in CC-MEDIA. Interval cancers were diagnosed within 40 months of screening, leading to 5 cases excluded from Cambridge and 4 cases from Norwich that were diagnosed > 40 months. FFDM: Full field digital mammography. *Fields that have incomplete year data. a) b) 88 As shown in the table there is good coverage of image data availability from 2013 to 2018. In addition four different mammographic machine vendors are included over this time period, however the majority are Philips at Cambridge and GE at Norwich. Information regarding IC rates is provided in this database from the NBSS local site data. Ethical approval has been obtained to apply for additional information from the Screening History Information Management system (SHIM) and National Cancer Registry (NCRAS) in the future. 4.4.3 Database content - Screen detected cancers SDCs that are recalled at the screening episode and diagnosed at the assessment clinic, where a triple assessment is carried out of; clinical examination, further imaging (e.g. ultrasound), and biopsy. It is estimated that in the triennial NHSBSP, SDCs occur at a rate of 8/1000 women screened54,263. The SDC image data that is available for each site within the CC-MEDIA database is shown in Table 4-5. As shown in the table there is good coverage of SDC data from 2013 to 2020 at both sites. In addition five different mammographic machine vendors are included over this time period, however the majority are Philips at Cambridge and GE at Norwich. 89 Cambridge KC62 n (%) Cambridge n (%) Norwich KC62 n (%) Norwich n (%) Total exams n 1539 1286 2179 1551 Total images n - 8327 - 6528 Year of Screening 2010-2011 123 (8.0%) 1* (0.08%) 202 (9.3%) 0* (0.0%) 2011-2012 148 (9.6%) 52* (4.0%) 204 (9.4%) 0* (0.0%) 2012-2013 131 (8.5%) 129 (10.0%) 186 (8.5%) 52* (3.4%) 2013-2014 148 (9.6%) 143 (11.1%) 180 (8.3%) 168 (10.8%) 2014-2015 162 (10.5%) 168 (13.1%) 201 (9.2%) 203 (13.1%) 2015-2016 166 (10.7%) 163 (12.7%) 188 (8.6%) 187 (12.1%) 2016-2017 144 (9.4%) 148 (11.5%) 201 (9.2%) 197 (12.7%) 2017-2018 162 (10.5%) 161 (12.5%) 198 (9.1%) 200 (12.9%) 2018-2019 184 (12.0%) 196 (15.2%) 249 (11.4%) 247 (15.9%) 2019-2020 97 (6.3%) 96 (7.5%) 202 (9.3%) 207 (13.3%) 2020-2021 74 (4.8%) 29* (2.3%) 186 (8.5%) 90* (5.8%) Age at Screening Cambridge n (%) [n = 1286] Norwich n (%) [n = 1551] 47-49 74 (5.7%) 69 (4.5%) 50-59 469 (36.5%) 509 (32.8%) 60-69 589 (45.8%) 717 (46.2%) 70+ 154 (12.0%) 256 (16.5%) Manufacture GE 33 (2.6%) 1548 (99.8%) Philips 1190 (92.5%) 0 (0.0%) Hologic 22 (1.7%) 3 (0.2%) Siemens 1 (0.08%) 0 (0.0%) Sectra 40 (3.1%) 0 (0.0%) FFDM Raw FFDM images 3133 (27.6%) 13 (0.2%) Processed FFDM images 5194 (62.4%) 6515 (99.8%) Implants Implant images 45 (0.5%) 25 (0.4%) Table 4-5 – Screen detected cancers (SDCs) at Cambridge and Norwich with imaging data 2011-2020 in CC- MEDIA. The KC62 reports programme performance from 01/04/YYYY to 31/03/YYYY at each NHSBSP site. KC62 data is taken from Table-T of the annual KC62 report which reports the sum of tables A-F2; first invite for routine screening, routine invitation to previous non-attenders, return invitation to previous attenders (last screening within 5 years and last screen more than 5 years), short term recall, self / GP referrals for women not previously screened or previously screened (last screen within 5 years or last screen more than 5 years previously). FFDM: Full field digital mammography. *Fields that have incomplete year data. 4.4.4 Database content - Ethnicity Ethnicity information is sparsely available within the NBSS output from Cambridge and no information was available from Norwich, Table 4-6. A similar volume of ethnicity data availability from NBSS was found when searching EPIC the EHR system at Cambridge, Figure 4-8. This limited availability of data meant it was not possible in the studies detailed in Chapters 5-7 to evaluate AI tools for bias relating to ethnicity. In addition, as the data included in this study is only from two 90 sites in East Anglia, England, it is not representative of the UK population. This is further outlined in the 2011 Census of 25 million households in England and Wales where it was reported “people from the White ethnic group were more likely to live in the South East than any other region”264. Cambridge NBSS [n = 40021] Cambridge EHR EPIC [n = 83662] A = White – British 18200 (45.5%) 45424 (54.3%) B = White – Irish 212 (0.5%) 381 (0.5%) C = White – Any other White background 770 (1.9%) 2329 (2.8%) D = Mixed – White and Black Caribbean 27 (0.07%) 32 (0.04%) E = Mixed – White and Black African 19 (0.05%) 19 (0.02%) F = Mixed – White and Asian 66 (0.2%) 78 (0.09%) G = Mixed – Any other Mixed background 35 (0.09%) 129 (0.2%) H = Asian or Asian British – Indian 125 (0.3%) 350 (0.4%) J = Asian or Asian British – Pakistani 31 (0.08%) 75 (0.09%) K = Asian or Asian British – Bangladeshi 21 (0.05%) 55 (0.07%) L = Asian or Asian British – Any other Asian background 122 (0.3%) 396 (0.5%) M = Black or Black British - Caribbean 54 (0.1%) 140 (0.2%) N = Black or Black British - African 59 (0.2%) 181 (0.2%) P = = Black or Black British – Any other Black background 6 (0.01%) 61 (0.07%) R = Other ethnic groups – Chinese 176 (0.4%) 420 (0.5%) S = Other ethnic groups – Any other group 105 (0.3%) 305 (0.4%) Z = Not stated 257 (0.6%) 5193 (6.2%) Missing 19736 (49.3%) 28094 (33.6%) Table 4-6 – Ethnicity information from National Breast Screening System (NBSS) and Electronic Health Record (EHR) EPIC data at Cambridge. NBSS: National breast screening system, EHR: Electronic health record. 91 Figure 4-8 – Ethnicity data distribution at Cambridge using National Breast Screening System (NBSS) and Electronic Health Record (EHR) EPIC data. a) NBSS, b) EPIC EHR. Ethnicity codes are provided in Table 4-6. NBSS: National breast screening system. 4.4.5 Database content - Mammographic breast density Density is not routinely reported by readers in the NHSBSP. However, the Breast Imaging-Reporting and Data System (BI-RADS) 5th edition density score was obtained for all cases in the CC-MEDIA database. Raw DICOM data was processed by Volpara (research version - VolparaResearch32_L30Enabled_v2, Wellington, New Zealand) to generate the Volumetric Breast Density (VBD) of each case. The VBD was then converted in Volpara Density Grade, which is consistent with BI-RADS 5th edition. Processed DICOM data was processed by one of the AI algorithm systems (DL-3) used in this research to generate the BI-RADS 5th edition density score for each case. Figure 4-9 shows the distribution of 1 years’ worth (2017) of data from Cambridge where both raw and processed data was available. Figure 4-9 – Breast imaging-reporting and data system (BI-RADS) 5th edition mammographic density distribution for cases in one year (2017) of data at Cambridge with both raw and processed four views mammograms available [n = 18246]. a) Volpara raw density distribution, b) DL-3 processed density distribution. Cases with breast implants were removed from this dataset. Demonstrating a similar distribution in the Volpara population mammographic density distribution as per previous publications75,265. Whereas the density distribution from DL-3 using processed data shifted the population distribution to the left providing overall lower density assessments for cases. b) a) a) b) 92 4.4.6 Database content - Histopathological information The clinical metadata collected from each site included the invasive status (ICD-10 code), histological grade (assigned using Nottingham grading system), and histological size for cancer cases. Where there were gaps in data, the missing data was collected by hand from histopathology reports on the EHR systems at each site. Histopathological information was taken from the surgical pathology report where available. If the surgical histopathology was unavailable the core biopsy histopathology was used. Information regarding the use of neoadjuvant chemotherapy and hormone therapy was not available alongside this information and so the histopathological size and grade could differ at the time of diagnosis for some cases. Furthermore cancers diagnosed during the Covid-19 pandemic were treated with an increase use of hormone therapy whilst the availability of operations was limited, this would also have an impact on the histopathological size and grade of cancers. 4.5 Technical setup of an AI algorithm testing environment An AI algorithm testing environment was setup at the University of Cambridge (developed by Richard Black, Medical Physicist at Cambridge University Hospitals NHS Foundation Trust). Two computers were available in this environment with the following technical setup: • System 1 – OpenSUSE Leap 15.3 operating system, 12 central processing units (CPU), 32 GB random access memory (RAM). • System 2 - OpenSUSE Leap 15.3 operating system, 56 CPU, 1024 RAM, 3 NVIDIA Quadro RTX 8000 graphics cards. On both systems the following software was installed to allow for company installation as well as data processing; Teamviewer, Docker, dcmtk 3.6.5, libvirtd v7.1.0, and qemu-kvm v5.2.0 (virtualisation). 4.6 Uses of the database To date the database has been used for the following research applications. Those applications with an asterisk (*) next to them are the applications detailed in the remaining chapters of this thesis, the remaining applications are part of ongoing work by other researchers. • *Benchmark existing AI algorithms for interval and next round cancer detection – Chapters 5, 6 and 7 • *Benchmark existing AI algorithms for stand-alone cancer detection – Chapter 6 • *Benchmark existing AI algorithms for screening triage – Chapter 7 • *To assess the relationship between AI algorithm accuracy and mammographic breast density – Chapters 5, 6 and 7 93 • To evaluate the accuracy of AI algorithm prompt location for the detection of cancer • To evaluate breast density tools for both raw and processed data • To evaluate breast cancer screening risk stratification tools • To evaluate the impact of prior image availability on AI algorithm performance 4.7 Discussion 4.7.1 Overall discussion Developing a large multi-site mammographic imaging database is a complex task, involving numerous governance, approvals and technical setup requirements. The involvement of patients and the public in the setup highlighted the importance of clear communication regarding access and processing of data as well as the acceptability of using data without consent and with commercial collaborators for this type of research. In addition, the formation of the DAC means the data is treated with a high level of governance oversight from staff with expertise at both sites to ensure the security and correct use of the data in research. The systematic collection of a large representative cohort for breast cancer screening provides an extensive resource for AI algorithm benchmarking as well as for feedback to AI companies regarding their performance to allow for the further development of algorithms. The availability of SDCs as well as ICs and next round cancers (NRCs) over the ten-year study time period allows for the robust assessment of algorithms for the detection of cancers as well as the potential for the earlier detection of cancer. Using this database AI algorithms can be tested for numerous applications including stand-alone detection and normal case triage in a UK screening setting. Another advantage of the database is the inclusion of raw data at one site, allowing for the calculation of mammographic breast density which is not routinely reported within the NHSBSP. This database is of similar size to recently developed databases in the UK, USA and Sweden, and overcomes the limitation of early mammographic databases which were small in size and only contained screen film mammography. 4.7.2 Limitations However, this database is limited to East Anglia and thus not representative of the entire UK population in terms of demographics. Furthermore, there is limited availability of ethnicity information at both sites to provided sufficient data for subgroup analysis to evaluate AI algorithms performance in order to detect bias. In addition, this database does not have any image level / pixel level annotations at present and so it is not possible to evaluate the precision of AI algorithm prompt locations which are provided alongside continuous case score outputs. Lastly, the overlap with OMI- DB is required to be taken into account when selecting cases for algorithm testing, by removing 94 cases identified as being extracted into OMI-DB, as these cases may have been used for model training. 4.8 Conclusion The CC-MEDIA database is a large 127,000 case mammographic medical imaging database that is representative of the NHSBSP in case distribution. The clinical metadata available provides a robust method to identify the ground truth for different cases cohorts when testing various applications of AI algorithms in breast cancer screening. The governance of the database by the DAC ensures the security of the data and that robust protocols are followed when sharing data. Collecting data from a ten-year period provides sequential screening information which is vital for testing numerous applications of AI algorithms for breast cancer screening. However, this data is limited to one region of the UK only and thus is not completely representative of diverse UK population in terms of ethnicity and socio-economic factors. 95 Chapter 5 – Performance of artificial intelligence algorithms for interval cancer detection 5.1 Aims In this chapter the performance of three commercial AI algorithms is investigated for the detection of interval cancers, using an enriched dataset from two UK screening sites. This study evaluated the potential benefit from AI algorithms for the earlier detection of breast cancer. Interval cancer literature and screen programme standards were used to pre define thresholds for the AI algorithms operating points and all algorithms were tested independent of the commercial vendor. The results from this study will help the planning of both retrospective and prospective studies for the use of AI algorithms as stand-alone readers. Contents of this chapter have been presented at the Radiological Society of North America conference 2021 [abstract ID - #2021-SP-12762-RSNA] and accepted for presentation at the European Congress of Radiology 2022 [abstract number - #12040]. 5.2 Introduction Breast cancer screening programmes aim to detect breast cancer at an earlier stage when the cancer is asymptomatic, which has been shown to improve both morbidity and mortality outcomes266. Interval cancers (ICs) occur in the time period between screening rounds. In the UK, operating a triennial programme, the acceptable IC rate is set at 3.7/1000 women screened101,103. Overall the survival outcomes of ICs are worse than screen detected cancers102. It is estimated ~77% ICs could not be seen at screening (normal / benign), ~16% have minimal signs (uncertain) and ~7% were visible (suspicious)103. Duty of candour is defined as a healthcare professionals responsibility to be “honest with patients and people in their care when something that goes wrong with their treatment or care causes, or has the potential to cause, harm or distress”, thus all ICs classified as false negative (suspicious) in the UK programme at the IC audit are required to be disclosed to patients103,267. There are numerous reasons ICs are not be detected at screening. These include not present at time of screening and developed in the interval, low sensitivity of mammography (especially in dense breasts due to masking), cancer radiological appearance (this can be either a stable appearance or the signs can be minimal on mammography), and perception or interpretation error (either not seen or seen and dismissed)268,269. Artificial intelligence (AI) algorithms for detection and diagnosis tasks (CADe+x) have demonstrated good performance for screen detected cancers (SDCs)133. However, as highlighted in the 2021 UK National Screening Committee (NSC) report the use of AI systems for IC detection is scarce, 96 especially for UK data136. Lång et al, tested one AI algorithm (Transpara v1.5.0) using a dataset of 429 ICs from five years of Swedish screening data, and found the AI algorithm could detect 11.2% of potentially visible cancers at the previous screen at a 4.0% recall rate270. In addition, 28.4% of minimal sign or false negative cancers were correctly located by the AI prompts. Of the ICs detected at a risk score 10 (the highest category score) 23.0% patients died or they developed stage IV disease and thus were clinically significant cancers270. Larsen et al tested an updated version of the AI algorithm (v1.7.0) used in Lång et al, and applied it to a large Norwegian dataset of more than 47,000 women, containing 205 ICs271. Larsen et al found 44.9% of ICs were detected at a risk score of 10 (a 10.0% recall rate), and 30.7% at a 5.8% recall rate. Hinton et al applied a ResNet50 architecture algorithm to a dataset of 182 ICs diagnosed within 12 months of screening, with an age and race matched screen detected cancer dataset of 173 cancers, from nine years of US screening. They found an accurate classification of 74.0% for ICs and 77.0% for SDCs272. Dembrower et al, tested an AI algorithm (Lunit v5.5.0) using a dataset of seven years of Swedish screening data with 7364 women which included 200 ICs. They found 12.0-27.0% of ICs had the highest 1.0-5.0% of scores, and the AI score was shown to be a better predictor than automated breast density (LIBRA) for the detection of ICs, OR 2.01 [95% CI 1.98-2.18] and 1.59 [95% CI 1.50-1.68] respectively134. Other studies have also included ICs within their datasets but have either not reported the separate performance for ICs or the dataset was small in size138,273. This study aimed to provide evidence for the use of AI algorithms for IC detection with UK screening data. In addition, this study aimed to evaluate three commercial AI algorithms using the same large unseen dataset to carry out independent performance benchmarking. 5.3 Methods 5.3.1 Sample size The required sample size for this study was calculated using the method described in Arkin et al, to determine the minimum number of cases required to estimate the true performance of an algorithm for benchmarking274. As described in the literature it is estimated 23.0% of ICs were visible at the previous screening (false negatives – suspicious / uncertain) and therefore a reference proportion of 20.0% and 30.0% was used103. Applying these reference proportions and a 95.0% confidence interval, between 246 - 323 cancers were required for this study. 5.3.2 Data Patient data was obtained from the existing CC-MEDIA database described in Chapter 4, where data was collected from two National Health Service Breast Screening Programme (NHSBSP) sites 97 (Cambridge and Norwich) under existing ethical approval (HRA REC 20/LO/0104, HRA CAG 20/CAG/0009, PHE RAC BSPRAC_090). Women age greater than or equal to 47 years old who attended screening at either site were included. IC cases were identified using the existing cancer registry (CREGX) query on the National Breast Screening System (NBSS) from January 2011 to December 2020 at Cambridge, and January 2011 to May 2021 at Norwich. A python (Python Software Foundation, http://www/python.org, version 3.8)260 script was used to query a database of all women screened at each site, to randomly select three age and screening year matched controls to every IC case. The two-view screening Full Field Digital Mammography (FFDM) images for each case were used. Cases were excluded where they did not include the full four views, and as per each companies’ manufacturer protocol images containing an implant, pacemaker or other device were excluded. Cases were also excluded following a discussion with Public Health England (PHE) if the IC was not a primary breast cancer (e.g. mesothelioma, melanoma, colorectal cancer metastasis). IC radiological classifications were taken from the original screen reader IC audit. Where histopathological data was missing, this was hand searched for using Electronic Health Records (EHR) at each site, for further detail please see Chapter 4 Section 4.4.5. The case selection process is shown in the Standards for Reporting of Diagnostic Accuracy Studies (STARD) diagram in Figure 5-1275. Figure 5-1 – Standards for Reporting of Diagnostic Accuracy Studies (STARD) flow diagram of cases included and excluded in this study. FFDM: Full Field Digital Mammogram, IC: Interval cancer, NHS: National Health Service, OMI-DB: The Optimam Mammography Image Database, PHE: Public Health England, PACS: Picture Archiving and Communication System. 98 5.3.3 Ground truth The ground truth for an IC case was a confirmed histopathological diagnosis, within 40 months of screening, as per the NHSBSP definition101. ICs were classified by radiologists as part of the routine IC audit using the NHSBSP definitions, which were updated in August 2017101: • Satisfactory - Normal – “normal or benign mammographic features” and “readers found no reason to recall”. • Satisfactory with learning points - Uncertain – “seen with hindsight, difficult to perceive, not obviously malignant” and “not all readers would recall. Case may provide learning”. • Unsatisfactory – Suspicious – “appearance is obviously malignant” with “all readers reviewing the images agree that they would recall. Woman should have been recalled”. A case was classified as ‘normal’ if there was a routine recall from screening, more than 912 days after their initial screen, and no breast cancer was detected in this time period. Figure 5-2 provides an overview of the cases included and examples of different IC cases included. Figure 5-2 – Example of cases included in the study. a) Interval cancer cases selected from CC-MEDIA cohort were matched at a ratio of 1:3 with normal cases based on year of screen and age at screen. Only cases from 2011-2019 were included due to the follow-up time period required of 912 days and so only cases up until 2019 could be included as screening data was available until end of 2020 at Cambridge and mid 2021 at Norwich, b) an example of a normal / benign classified interval cancer case, c) an example of an uncertain interval cancer case, and c) an example of a suspicious interval cancer case. 5.3.4 AI tools Three commercial AI algorithms were independently tested. Each tool was installed within the University of Cambridge research environment, and companies did not have access to their tools 99 during testing or the results from the study. Details of each algorithms training, required input and output as well as operating system are outlined in Table 5-1. Tool DL-1 DL-2 DL-3 Training Screening Programme Readers - Frequency (%UK) Double – triennial Double – biennial Single – annual (10.0%) Double – biennial (0.0%) Double – triennial Single - biennial Single – annual (4.3%) OMI-DB Yes No Yes Training Cases n >200000 >200000 >150000 Training Cases Age Range 40-74 50-70 18-90 Training Cancers SDC / IC / NRC SDC NA Training Cases Vendors (%) Hologic (80.0%) GE (10.0%) Siemens (10.0%) Hologic (41.0%) GE (5.0%) Siemens (36.0%) Philips (< 1.0%) Fuji (7.0%) Agfa (6.0%) Kodak (4.0%) Hologic (32.3%) GE (65.6%) Siemens (1.7%) Philips (0.3%) Data Processed FFDM Processed FFDM Processed FFDM OS Lunix Lunix Lunix Output Case level Continuous score (0-10)* Case level Continuous score (0-10)* Case level Continuous score (0-10)* Table 5-1 – Artificial intelligence (AI) algorithm characteristics. FFDM: Full field digital mammogram, GE: General Electric, IC: Interval cancer, NA: Not available, NRC: Next round cancer, OS: Operating System, OMI-DB: The Optimam Mammography Image Database, SDC: Screen detected cancer. *Output scores were adjusted to the same 0-10 scale. 5.3.5 Thresholds Three different methods for identifying thresholds were used in this study and were all based on using the AI algorithms at either a 96.0% specificity (NHSBSP consensus specificity) or 30.0% sensitivity (estimated visible IC rate), for use as stand-alone system for IC detection, Figure 5-3.b. The first threshold is the ‘pre-specified specificity / sensitivity’ (threshold 1) where the tools are operated at the pre-defined operating points of 96.0% specificity or 30.0% sensitivity within the study data for each site. The second threshold is the ‘identified year operating points’ (threshold 2) for each algorithm, which were found using 10,206 cases (229 cancers (150 SDCs and 79 ICs)) of Cambridge 2017 data from the main CC-MEDIA database. Both the ‘pre-specified specificity / sensitivity’ (threshold 1) and ‘identified year operating points’ (threshold 2) thresholds were then applied to the Cambridge and Norwich data in this study. The last threshold was the ‘identified Cambridge operating points’ (threshold 3) for each algorithm, where the operating point was identified on the study Cambridge data and then applied to the study Norwich data. 100 Figure 5-3 – Proposed workflow image for testing the artificial intelligence (AI) systems as stand-alone readers for interval cancer (IC) detection. a) Routine UK double reading workflow, b) stand-alone artificial intelligence algorithm reading at 96.0% specificity and 30.0% sensitivity thresholds workflow. 5.3.6 Statistical analysis All statistical analysis took place in R (R Foundation for Statistical Computing, Vienna, Austria, version 4.0.4)225, using packages: ggplot2, dplyr, tidyr, lme4, pROC, precrec, lubridate, epiR, data.table and VennDiagram276–284. The overall predictive performance of each AI algorithm was evaluated by calculating the area under the receiver operating characteristic curve (AUROC), proportion of true positive (TP), true negatives (TN), false positives (FP), false negatives (FN), sensitivity and specificity. 𝑆𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦 = 𝑇𝑃𝑇𝑃 + 𝐹𝑁 𝑆𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦 = 𝑇𝑁𝑇𝑁 + 𝐹𝑃 To investigate the variability between sites and mammography machine vendors the results from each site are reported separately. Data is presented as integer number and percentage (n (%)), or median and interquartile range [IQR 25th – 75th centile range] as appropriate. A multivariable model was created for a combination of all three individual algorithms using a generalised linear mixed effects model and Cambridge data. This model was then applied to Norwich data to check for overfitting. DeLong’s test was used to assess for a statistically significant difference between the AUROC curve of individual AI algorithms using 2000 bootstrapping examples. Subgroup analysis for each algorithm based on IC detection at different categories of; age, radiological classifications, time interval to diagnosis (months), mammographic machine vendor, invasive status of cancer, invasive tumour size, invasive tumour grade and mammographic breast density was performed using both Cambridge and Norwich data. The true integer values and 101 sensitivity were reported as well as Chi squared c2 test was used to investigate if there was a statistically significance between categories285. In all analyses, 95.0% confidence intervals are used and p-values < 0.05 were considered statistically significant. 5.3.7 Reporting Each AI algorithm was assigned a Deep Learning (DL) Identifier (ID) for the purposes of this study. The de-identified results of all algorithms were reported back to the companies prior to publication. The individual companies’ results were re-identified and presented back to each company for their own performance; the companies could not alter any reporting or methods used. 5.4 Results 5.4.1 Data In total 8,452 images from Cambridge and 8,012 images from Norwich were included in the study dataset. 2,113 cases from Cambridge contained 523 IC cases (24.8%), and 2,003 cases from Norwich contained 506 IC cases (25.3%). Study case cohort characteristics are provided in Table 5-2. 102 Cambridge Normal Cases n (%) Cambridge Interval Cancers n (%) Norwich Normal Cases n (%) Norwich Interval Cancers n (%) Total Cases n 2113 2003 Age at Screening Median 57.0 [51.0-64.0] 60.0 [54.0-67.0] 47-49 187 (8.8%) 62 (2.9%) 145 (7.2%) 52 (2.6%) 50-54 430 (20.4%) 142 (6.7%) 258 (12.9%) 89 (4.4%) 55-59 261 (12.4%) 89 (4.2%) 308 (15.4%) 105 (5.2%) 60-64 318 (15.1%) 107 (5.1%) 222 (11.1%) 77 (3.8%) 65-69 298 (14.1%) 91 (4.3%) 390 (19.5%) 129 (6.4%) 70+ 96 (4.5%) 32 (1.5%) 174 (8.7%) 54 (2.7%) Year of Screen 2011 8 (0.4%) 7 (0.3%) 0 (0.0%) 0 (0.0%) 2012 218 (10.3%) 70 (3.3%) 10 (0.5%) 13 (0.7%) 2013 231 (10.9%) 72 (3.4%) 166 (8.3%) 66 (3.3%) 2014 313 (14.8%) 105 (5.0%) 353 (17.6%) 115 (5.7%) 2015 264 (12.5%) 87 (4.1%) 293 (14.6%) 97 (4.8%) 2016 198 (9.4%) 64 (3.0%) 283 (14.1%) 91 (4.5%) 2017 238 (11.3%) 76 (3.6%) 237 (11.8%) 77 (3.8%) 2018 120 (5.7%) 42 (2.0%) 153 (7.6%) 46 (2.3%) 2019 0 (0.0%) 0 (0.0%) 2 (0.1%) 1 (0.05%) FFDM Vendor GE 26 (1.2%) 16 (0.8%) 1490 (74.4%) 502 (25.1%) Philips 1501 (71.0%) 473 (22.4%) 0 (0.0%) 0 (0.0%) Hologic 26 (1.2%) 15 (0.7%) 7 (0.3%) 4 (0.2%) Sectra 37 (1.8%) 19 (0.9%) 0 (0.0%) 0 (0.0%) Density BI-RADSa a 342 (16.2%) 46 (2.2%) 224 (11.2%) 20 (1.0%) b 875 (41.4%) 231 (10.9%) 894 (44.6%) 238 (11.9%) c 368 (17.4%) 236 (11.2%) 361 (18.0%) 232 (11.6%) d 5 (0.2%) 10 (0.5%) 18 (0.9%) 16 (0.8%) Table 5-2 – Summary of testing dataset characteristics. Integer values with percentages in brackets (%) and median with Interquartile range in square brackets [IQR] are provided. BI-RADS: Breast imaging-reporting and data system, FFDM: Full Field Digital Mammography, GE: General Electric. aDL-3 5th edition BI-RADS density scores on processed Full Field Digital Mammograms. The FFDM images were from 48.0% Philips’s, 49.4% GE, 1.3% Hologic, and 1.4% Sectra mammography machines. The median age of the entire cohort was 59.0 [IQR 53.0–65.3] years old and the median time interval between screening and follow-up normal recall as 1071.0 [IQR 1041.0– 1105.0] days. IC cases had a median time interval from screening to diagnosis of 690.0 [IQR 465.0– 911.0] days at Cambridge and 670.5 [IQR 434.2–880.8] days at Norwich. The majority of cases (78.4%) were classified as normal / benign, with 16.6% assigned uncertain and 3.3% suspicious classification at the routine IC audit. IC characteristics are provided in Table 5-3 and Table 5-4. 103 Cambridge Interval Cancers n (%) Norwich Interval Cancers n (%) Total Cases n 523 506 Interval (months) 0-12 85 (16.3%) 83 (16.4%) 12-24 205 (39.2%) 201 (39.7%) 24-36 233 (44.5%) 222 (43.9%) 36-40 0 (0.0%) 0 (0.0%) Radiological Audit Classification Normal / Benign 429 (82.0%) 382 (75.5%) Uncertain 80 (15.2%) 92 (18.2%) Suspicious 8 (1.5%) 27 (5.3%) Unclassifiable 5 (1.0%) 5 (1.0%) Missing 1 (0.2%) 0 (0.0%) Density BI-RADSb a 20 (3.8%) - b 110 (21.0%) - c 114 (21.8%) - d 76 (14.5%) - Missing 203 (38.8%) - Table 5-3 – Interval cancer (IC) characteristics by case. Integer values with percentages in brackets (%) are provided. BI-RADS: Breast imaging-reporting and data system. b Volpara 5th edition BI-RADS mammographic breast density from raw full field digital mammograms Cambridge data. One AI algorithm (DL-3) provided a density score based on processed data for the entire cohort which was used in study analysis, Table 5-2. Volpara mammographic breast density (research version – VolparaResearch32_L30Enabled_v2, Wellington, New Zealand) was only available for Cambridge cases, where the raw mammographic data was available (67.3% of Cambridge cases), Table 5-3. Cambridge Interval Cancers n (%) Norwich Interval Cancers n (%) Total Lesions n 535 519 Invasive Status Invasive 474 (88.6%) 490 (94.4%) Non-invasive 55 (10.3%) 29 (5.6%) Missing 6 (1.1%) 0 (0.0%) Invasive Tumour Sized (mm) < 15 114 (24.1%) 146 (29.8%) >= 15 278 (58.6%) 288 (58.8%) Missing 82 (17.3%) 56 (11.4%) Invasive Tumour Graded 1 49 (10.3%) 65 (13.3%) 2 223 (47.0%) 231 (47.1%) 3 183 (38.6%) 180 (36.7%) Missing 19 (4.0%) 14 (2.9%) Table 5-4 – Interval cancer (IC) characteristics by lesions. Integer values with percentages in brackets (%) are provided. dInvasive lesions only. 5.4.2 Algorithm results The area under the receiver operating characteristic curve (AUROC) was 0.710 [95% CI 0.691–0.730], 0.713 [95% CI 0.695–0.732], 0.732 [95% CI 0.715–0.750] for DL-1, DL-2, and DL-3 respectively when 104 testing on the entire cohort. When tested on Cambridge data the AUROC was 0.719 [95% CI 0.692– 0.746], 0.723 [95% CI 0.698–0.748], 0.726 [95% CI 0.701–0.752], and on Norwich data was 0.713 [95% CI 0.686–0.740], 0.704 [95% CI 0.677–0.730], 0.760 [95% CI 0.736–0.784] for DL-1, DL-2, and DL-3 respectively. ROC curve plots for comparison between sites is shown in Figure 5-4.a. and between AI algorithms in Figure 5-4.b. All algorithms perform similarly on Cambridge and Norwich data. However, the AUROC of DL-3 is statistically significantly greater than DL-1 and DL-2 when tested on all and Norwich data (p < 0.05). Figure 5-4 – Receiver operating characteristic (ROC) curves for all three artificial intelligence (AI) algorithms at each site. a) For each artificial intelligence algorithm with the overall results in grey, Cambridge in orange and Norwich in pink, b) for each site with the results for DL-1 are in blue, DL-2 in purple, and DL-3 in green. Testing using the ‘pre-specified specificity / sensitivity’ (threshold 1) thresholds on Cambridge data at 96.0%, specificity, found a sensitivity of 23.7% , 21.6%, 23.1% and at 30.0% sensitivity specificity was 93.8% , 93.1%, 93.0% for DL-1, DL-2, and DL-3 respectively, results are shown in Table 5-5. a) b) 105 Threshold a) Sensitivity a) Specificity b) Sensitivity b) Specificity 96.0% specificity (DL-1) 23.7% [19.0-28.7] 96.0% 21.6% [17.0-26.4] 96.7% [95.3-97.9] 96.0% specificity (DL-2) 21.6% [17.2-26.4] 96.0% 21.8% [17.0-26.4] 96.0% [94.1-97.3] 96.0% specificity (DL-3) 23.1% [17.8-27.3] 96.0% 20.8% [16.3-26.0] 96.5% [95.1-97.7] 30.0% sensitivity (DL-1) 30.0% 93.8% [91.8-95.6] 2.9% [1.9-9.6] 99.9% [99.7-100] 30.0% sensitivity (DL-2) 30.0% 93.1% [90.5-94.8] 2.1% [1.5-5.2] 100% [99.9-100] 30.0% sensitivity (DL-3) 30.0% 93.0% [90.9-95.0] 0.4% [0.2-5.6] 100% [99.9-100] Table 5-5 – Cambridge data testing of three artificial intelligence (AI) algorithms. a) At the ‘pre-specified specificity / sensitivity’ (threshold 1) for 96.0% specificity, and 30.0% sensitivity, b) at the ‘identified year operating points’ (threshold 2) from Cambridge external year cohort testing. 95.0% confidence intervals are in square brackets [95.0% CI]. Applying the ‘identified year operating points’ (threshold 2) on Cambridge 2017 data at 96.0% specificity, found a specificity of 96.7%, 96.0% and 96.5%, and sensitivity of 21.6% , 21.8%, 20.8% respectively for DL-1, DL-2, and DL-3. At 30.0% sensitivity DL-1, DL-2, and DL-3 specificity was 99.9% , 100.0%, 100.0% and sensitivity was 2.9%, 2.1%, 0.4% respectively. Figure 5-5 shows the distribution of IC cases and normal cases from Cambridge data by the assigned continuous score for each AI algorithm with the four different operating points used in this study. Figure 5-5 – Cambridge data testing density plots for each artificial intelligence (AI) algorithm. Interval cancer case distribution is shown in red and normal case distribution is in blue. The green line represents the ‘pre-specified specificity / sensitivity’ (threshold 1) 96.0% specificity operating point for each algorithm and the orange line the 30.0% sensitivity operating point on Cambridge study data. The purple line represents the ‘identified year operating points’ (threshold 2) 96.0% specificity operating point for each algorithm and the pink line the 30.0% sensitivity operating point. Applying the ‘pre-specified specificity / sensitivity’ (threshold 1) thresholds on Norwich data at 96.0%, specificity, the sensitivity was 23.3%, 16.4%, 27.9%. At 30.0% sensitivity DL-1, DL-2, and DL-3 specificity was 94.1% , 91.2%, 95.4% respectively, the results are shown in Table 5-6. 106 Threshold a) Sensitivity a) Specificity b) Sensitivity b) Specificity c) Sensitivity c) Specificity 96.0% Specificity (DL-1) 23.3% [18.4-29.1] 96.0% 36.8% [31.8-42.3] 90.0% [87.2-92.5] 39.3% [34.2-44.7] 88.6% [86.0-91.3] 96.0% Specificity (DL-2) 16.4% [13.0-21.0] 96.0% 16.2% [12.9-20.1] 96.3% [94.5-97.9] 16.2% [12.9-20.1] 96.3% [94.5-97.9] 96.0% Specificity (DL-3) 27.9% [22.7-32.8] 96.0% 13.8% [9.9-18.2] 99.0% [98.2-99.7] 14.8% [11.1-20.0] 98.8% [98.1-99.6] 30.0% Sensitivity (DL-1) 30.0% 94.1% [91.0-95.7] 6.5% [2.8-11.7] 99.5% [99.1-99.9] 47.4% [42.7-52.6] 83.4% [79.8-87.2] 30.0% Sensitivity (DL-2) 30.0% 91.2% [88.6-93.4] 2.6% [0.0-5.9] 99.9% [99.7-100] 24.1% [18.8-28.7] 93.7% [91.4-95.3] 30.0% Sensitivity (DL-3) 30.0% 95.4% [92.9-97.0] 1.0% [0.2-2.0] 100% [100-100] 20.6% [12.6-25.3] 98.0% [96.7-98.7] Table 5-6 – Norwich data testing of three artificial intelligence (AI) algorithms. a) At the ‘pre-specified specificity / sensitivity’ (threshold 1) for 96.0% specificity, and 30.0% sensitivity, b) at the ‘identified year operating points’ (threshold 2) from Cambridge external year cohort testing, c) at the ‘identified Cambridge operating points’ (threshold 3) from Cambridge data in this study. 95.0% confidence intervals are in square brackets [95.0% CI]. Testing using the ‘identified year operating points’ (threshold 2) on Norwich data at 96.0% specificity, the specificity of each AI algorithm was 90.0%, 96.3% and 99.0%, and the sensitivity was 36.8%, 16.2%, 13.8% for DL-1, DL-2, and DL-3 respectively. At 30.0% sensitivity DL-1, DL-2, and DL-3 specificity was 99.5%, 99.9%, 100% and sensitivity was 6.5%, 2.6%, 1.0% respectively. Applying the ‘identified Cambridge operating points’ (threshold 3) on Norwich data at 96.0% specificity, the specificity was 88.6%, 96.3% and 98.8%, and the sensitivity was 39.3%, 16.2%, 14.8% respectively for DL-1, DL-2, and DL-3. At 30.0% sensitivity DL-1, DL-2, and DL-3 specificity was 83.4%, 93.7%, 98.0% and sensitivity was 47.4%, 24.1%, 20.6% respectively. Figure 5-6 shows the distribution of IC cases and normal cases from Norwich data by the assigned score for each AI algorithm with the six different operating points used in this study. 107 Figure 5-6 – Norwich data testing density plots for each artificial intelligence (AI) algorithm. Interval cancer cases distribution is shown in red and normal case distribution is in blue. The green line represents the ‘pre- specified specificity / sensitivity’ (threshold 1) 96.0% specificity operating point for each algorithm and the orange line the 30.0% sensitivity operating point on Norwich study data. The purple line represents the ‘identified year operating points’ (threshold 2) 96.0% specificity operating point for each algorithm and the pink line the 30.0% sensitivity operating point. The red line represents the identified Cambridge operating points’ (threshold 3) 96.0% specificity operating point for each algorithm and the grey line the 30.0% sensitivity operating point. 5.4.3 Combined algorithm results Combining the performance of all three DL algorithms (DL-1, DL-2, DL-3) using Cambridge data resulted in an AUROC of 0.738 [95% CI 0.713–0.764], which was not statistically significant different to the individual AI algorithms performance (p = 0.302–0.508). And at the threshold of 96.0% specificity, the sensitivity of the combined model was 25.4% [95% CI 21.4–30.0]. The contribution to the combined model was similar from both DL-1, DL-2 and DL-3. The ROC plot for each model on Cambridge data is shown in Figure 5-7. Figure 5-7 – Combined model receiver operating characteristic (ROC) curve on Cambridge data compared to individual artificial intelligence (AI) algorithms (DL-1, DL-2, DL-3) performance. Results for DL-1 are in blue, DL-2 in purple, DL-3 in green, and the Combined model in red, with area under the receiver operating characteristic curve values provided for each algorithm. 108 Applying the combined model to Norwich data also resulted in an AUROC of 0.738, which was statistically significantly different to DL-1, DL-2 and DL-3 (p < 0.05). At the 96.0% specificity operating point the combined model sensitivity on Norwich data was 25.7% [95% CI 21.0–31.6]. Applying the 96.0% operating point from Cambridge testing of the combined model, the sensitivity was 12.8% [95% CI 10.1–15.6] and specificity was 99.1% [95% CI 98.5–99.5]. The ROC plots for each model on Norwich data are shown in Figure 5-8. Figure 5-8 – Combined model receiver operating characteristic (ROC) curve on Norwich data compared to individual artificial intelligence (AI) algorithms (DL-1, DL-2, DL-3) performance. Results for DL-1 are in blue, DL-2 in purple, DL-3 in green, and the Combined model in red, with area under the receiver operating characteristic curve values provided for each algorithm. 5.4.4 Sub-group analysis Sub group analysis on the entire cohort to evaluate each AI algorithms performance across key IC characteristic parameters at the 96.0% specificity ‘identified year operating points’ (threshold 2) is detailed in Table 5-7 and Table 5-8. Threshold 2 was used in this subgroup analysis as this threshold was found on a separate dataset reducing the bias of the threshold as well as the same threshold was used for Cambridge and Norwich data for each algorithm. When interpreting these results please refer to Table 5-5 and 5-6 which details the sensitivity and specificity at this threshold. When re-applying this threshold to Norwich there was a decrease in specificity with an increase in sensitivity for DL-1 and the opposite for DL-3, with the performance DL-2 remaining stable. Therefore the number of cancers detected by DL-1 at this threshold is greater than that for DL-2 and DL-3, however with the trade-off of decreased specificity. Overall detection was greater for ICs occurring in the first year, and for cancers that were classified as suspicious at the IC audit for all 109 three AI tools. On the other hand, detection was lower for grade 3 and less than 15 mm in size invasive tumours, however due to missing data this analysis is not definitive. Interval cancer parameter Total DL-1 DL-2 DL-3 Total Cases n 1029 299 196 179 Age at Screening 47-49 114 25 (21.9%) 0.328 19 (16.7%) 0.754 16 (14.0%) 0.826 50-54 231 60 (26.0%) 37 (16.0%) 42 (18.2%) 55-59 194 62 (32.0%) 39 (20.1%) 36 (18.6%) 60-64 184 49 (26.6%) 37 (20.1%) 28 (15.2%) 65-70+ 306 103 (34.5%) 64 (22.7%) 57 (17.7%) FFDM Vendor GE 518 194 (37.5%) < 0.01 85 (16.4%) < 0.01 72 (13.9%) < 0.01 Philips 473 96 (20.3%) 101 (21.4%) 98 (20.7%) Hologic 19 3 (15.8%) 4 (21.1%) 3 (15.8%) Sectra 19 6 (31.6%) 6 (31.6%) 6 (31.6%) Interval (months) 0-12 168 56 (33.3%) 0.577 38 (22.6%) 0.563 39 (23.2%) 0.199 13-24 406 118 (29.0%) 76 (18.7%) 65 (16.0%) 35-36 455 125 (27.5%) 82 (18.0%) 75 (16.5%) Radiological Audit Classification Normal / Benign 811 180 (22.3%) < 0.01 119 (14.7%) < 0.01 103 (12.8%) < 0.01 Uncertain 172 90 (52.3%) 58 (33.9%) 58 (33.9%) Suspicious 35 26 (74.3%) 16 (45.7%) 14 (41.2%) Unclassifiable 10 3 (30.0%) 3 (30.0%) 4 (40.0%) Density BI-RADSb a 20 1 (5.0%) 0.187 3 (15.0%) 0.752 4 (20.0%) 0.213 b 110 20 (18.2%) 20 (18.2%) 14 (12.7%) c 114 32 (28.1%) 28 (24.6%) 25 (21.9%) d 76 15 (19.7%) 16 (21.1%) 21 (27.6%) 110 Density BI-RADSa a 66 13 (19.7%) 0.093 10 (15.2%) 0.212 8 (12.1%) 0.357 b 469 155 (33.0%) 86 (18.3%) 75 (16.0%) c 468 128 (27.4%) 99 (21.2%) 93 (19.9%) d 26 3 (11.5%) 1 (3.8%) 3 (11.5%) Invasive Status Invasive 964 283 (29.4%) 0.578 186 (19.3%) 0.963 170 (17.6%) 0.966 Non-invasive 84 28 (33.3%) 16 (19.0%) 15 (17.9%) Table 5-7 – Subgroup analysis of cases using all interval cancer (IC) data from both Cambridge and Norwich sites. The total number of interval cancer cases detected at the 96.0% specificity ‘identified year operating points’ (threshold 2) for each artificial intelligence algorithm is reported. Sensitivity is reported in round brackets. BI-RADS: Breast imaging-reporting and data system, FFDM: Full Field Digital Mammography. bVolpara 5th edition BI-RADS mammographic breast density from raw full field digital mammogram Cambridge data, aDL-3 5th edition BI-RADS scores from processed full field digital mammogram data at both sites. p values were determined by using Chi squared c2 test to compare against the detected proportion of interval cancer cases / lesions by each artificial intelligence algorithm for each interval cancer characteristic category. p-values < 0.05 were considered statistically significant. At this threshold there was no statistically significant difference in the cancers detected by each AI tool for; patient age, interval to diagnosis, BI-RADS mammographic breast density, invasive status, and grade (p > 0.05). There was however a statistically significant difference between radiological classification groups and mammographic machine vendor for all of the three AI algorithms (p < 0.05). In addition, there was a statistically significant difference for DL-3 invasive tumour size (p < 0.05). Interval cancer parameter Total DL-1 DL-2 DL-3 Total Invasive Lesions n 964 283 186 170 Invasive Tumour Graded 1 114 37 (32.5%) 0.171 23 (20.2%) 0.350 23 (20.2%) 0.085 2 454 146 (32.2%) 97 (21.4%) 92 (20.3%) 3 363 89 (24.5%) 60 (16.5%) 49 (13.5%) Invasive Tumour Sized (mm) < 15 260 71 (27.3%) 0.337 39 (15.0%) 0.055 34 (13.1%) 0.034 >= 15 566 180 (31.8%) 124 (21.9%) 115 (20.3%) Table 5-8 – Subgroup analysis of lesions using all interval cancer (IC) data from both Cambridge and Norwich sites. The total number of interval cancer lesions detected at the 96.0% specificity ‘identified year operating points’ (threshold 2) for each artificial intelligence algorithm is reported. Sensitivity is reported in round brackets. dReport by invasive lesions for size and grade. p values were determined by using Chi squared c2 test to compare against the detected proportion of interval cancer lesions by each artificial intelligence algorithm for each interval cancer characteristic category. p-values < 0.05 were considered statistically significant. 111 The AI algorithms did overlap in the ICs detected. However, the AI algorithms did not identify identical IC cases as shown in Figure 5-9, for threshold 1 and 2 at 96.0% specificity. Figure 5-9 – Proportional Euler diagram of each artificial intelligence (AI) algorithms interval cancer (IC) detection. a) At threshold 2 (96.0% specificity), using all interval cancer data from both Cambridge and Norwich sites, b) at threshold 1 (96.0% specificity), using all interval cancer data from Cambridge, c) at threshold 1 (96.0% specificity), using all interval cancer data from Norwich. 5.4.5 Failure analysis A case classified as a suspicious IC that was not detected by all methods, human readers and AI algorithms, at the 96.0% specificity threshold 2 is shown in Figure 5-10. This was a case of a 59-year- old patient, diagnosed with a left sided grade 2, 140 mm invasive cancer, 987 days after screening. Figure 5-10 – False negative case, which was not detected by all three commercial artificial intelligence (AI) algorithms. The screen and diagnostic images were annotated by a breast radiologist to show the true location of the cancer. a) b) c) Screen Diagnostic 112 A case classified as a normal / benign IC that was detected by all AI algorithms, at the 96.0% specificity threshold 2, is shown in Figure 5-11. This was a case of a 52-year-old patient, diagnosed with a left sided grade 3, 10 mm invasive cancer, 789 days after screening. Figure 5-11 – True positive case, which was detected by all three commercial artificial intelligence (AI) algorithms. The screen and diagnostic images were annotated by a breast radiologist to show the true location of the cancer. 5.5 Discussion The three commercial AI algorithms performed similarly and maintained acceptable performance at the ‘pre-specified specificity / sensitivity’ (threshold 1) for stand-alone IC detection. Thus, AI algorithms could play a role in the earlier detection of cancers. When using the algorithms at the same specificity as the screening programme double reader performance (96.0%), 21.6%-23.7% of ICs at Cambridge and 16.4%-27.9% of ICs at Norwich were detected. This is similar to the expected reported percentage of visible cancers that could have been detected, ~20.0-30.0%, at the previous screen103. Although this result was found using the ‘pre-specified specificity / sensitivity’ (threshold 1), which is not the threshold used in routine practice as this cut off is drawn from a population without SDCs and also a dataset enriched with IC cases. When transferring operating points identified at one site (Cambridge) using a one-year cohort (2017) with 2.2% cancers (SDC and IC) to both sites, performance was maintained for the Cambridge site, whilst there was a shift in performance shown for two out of the three algorithms (DL-1 and DL-3) when applied at the Norwich site. A significant shift was seen for all AI algorithms at the 30.0% sensitivity thresholds at both sites, with a very high specificity (99.5%-100.0%) whilst very low sensitivity was achieved (0.4%- Screen Diagnostic 113 6.5%). This is expected due to the change in cancer proportions between the two datasets. Consistency / reliability of transferring operating points between sites should be monitored and is a key metric in performance. In addition, the dataset used to identify the threshold should be clearly documented in order to allow for monitoring where there is variation between sites e.g. mammography machine manufacturer. Based on this analysis, DL-2 demonstrated good generalizability and reliability to other sites with stable performance in the 96.0% specificity threshold 2 at both Cambridge and Norwich. Sensitivity and specificity should be stated when using AUROC to report model performance as demonstrated in this study where model (e.g. DL-3) achieved the highest AUROC on Cambridge site data. However, as we were evaluating model performance at one extreme of the ROC curve (96.0% specificity), in order to avoid an increase in recall rates and thus costs of assessment clinics, the sensitivity when reported for the model with the highest AUROC (DL-3) is lower than another model (DL-1) at this threshold. Thus, AUROC should not be the only metric reported and should not be the deciding factor of an AI algorithms performance when under taking evaluation in a breast screening programme task. As there is an overlap of the Cambridge database with The Optimam Mammography Image Database (OMI-DB) database (2012-2016), detailed in Chapter 4, these cases were identified and removed from this study to ensure that the same cases were not used in training and testing286. Two out of the three companies used OMI-DB in their training of the algorithm which may explain the good performance in Cambridge Philips data, despite Philips’s data being used for a small percentage of training. To account for this the algorithms were tested on the completely independent Norwich dataset, that has never been used for the training of any AI algorithm. DL-3 performance improved when tested on Norwich data compared to Cambridge data, this is likely due to the significant proportion of GE images used in the DL-3 algorithm training. Combining the three AI algorithms into one model did not significantly increase performance compared to a single algorithm when tested on Cambridge data, however there was a statistically significant difference between the Combine model and all three algorithms when tested on Norwich data. Thus there was no overfitting displayed and further work is needed to determine if there is an advantage of using different AI systems together for screen reading tasks. The AI algorithms did not preferentially detect specific IC characteristics, other than for the radiological classification of cases (uncertain and suspicious) and mammographic machine vendor. Importantly, there was no difference found between invasive size and grade categories of ICs detected by all three AI systems, except for one system and invasive size. The algorithms did detect 114 different ICs to each other at threshold 1 and 2, therefore it may be possible that these systems could be used in tandem with all three systems operating independently to increase IC detection. This is the first study to compare three separate algorithms for the use in IC detection on UK data, as well as using the largest set of IC cases reported, addressing the gap in evidence identified in the NSC report 2021. The results found in this study are similar to the results in Lång et al, where 11.2% of visible ICs were detected and correctly located using Transpara v1.5.0 at 4.0% recall rate, Larsen et al where 30.7% ICs were detected at a 5.8% recall rate using Transpara v1.7.0, and Dembrower et al where 27.0% of ICs were detected using Lunit v5.5.0 at a 5.0% recall rate134,270,271. McKinney et al reported slightly lower values of 2.7-9.4% of ICs were detected when using an in house algorithm from Google138 . There is a potential role of these systems to be used to guide supplemental imaging in breast screening, as shown in Wanders et al, where 50.9% of women who developed an IC were identified at 90.0% specificity by combining Transpara v1.6.0 with LIBRA density in an enriched cohort287. Also Dembrower et al showed the rule in triage identified 12.0-27.0% ICs in the highest 1.0-5.0% cases suggesting a hybrid tailored screening approach could be made feasible by using AI algorithms134. There are limitations to this study such as the overall small study cohort without the representative class-imbalance of routine screening. In addition, not using the annotation provided by AI tools to confirm correct AI tool location identification of a cancer, which is critical for ICs with no radiological signs to guide further assessment unlike in Lång et al where they did confirm the location for each IC cases270. Thus, it is not possible to conclude if the cancers identified based on the threshold score only would be detected at assessment without additional correct location prompting by the AI system. In addition, this was a retrospective study and so it is not known how a human reader would behave with a prompt on a cancer that two readers have previously dismissed. Further prospective studies are required to confirm these results. Furthermore, there was missing cancer information at both sites, this was due to both data not being recorded and well as patients not undergoing further investigations or operations when their IC was diagnosed. Lastly, it should also be considered that the invasive size used in the analysis maybe subject to change due to the effect of neo-adjuvant chemotherapy, which is not commonly available through the automated extraction of data from NBSS. 5.6 Conclusion The three AI algorithms were able to detect ICs at the preceding screening mammogram, detecting between 16.0-27.0% ICs across two UK screening sites at a 96.0% specificity threshold. However, when translating identified operating points from a year cohort from one study site to the other 115 there was a significant variation in performance for two out of the three algorithms and thus stability must be monitored across sites when translating operating points. It is unknown how readers would react to such cases being flagged where no location information is provided, and what is the best deployment route for algorithms to maintain such performance for IC detection in the real-time screening workflow. Thus, future prospective studies using the identified operating points across UK screening sites are required as well as sufficient follow-up to monitor the impact of AI algorithms on IC rates. 116 Chapter 6 – Performance of stand-alone deep learning algorithms in a UK screening cohort for detection and diagnosis 6.1 Aims In this chapter three commercial artificial intelligence (AI) algorithms for stand-alone screen reading are investigated using a representative cohort from two UK screening centres. Performance was compared against that of a human reader, for three stand-alone reading approaches and non- inferiority was demonstrated for the AI algorithms at various benchmarks. The inclusion of interval and next round cancers allowed for the robust assessment of AI algorithms performance for the earlier detection of breast cancer. The results from this chapter provided data to plan future prospective studies. Contents of this chapter have been accepted for presentation at the European Congress of Radiology 2022 [abstract number - #12040] and submitted to the European Society of Breast Imaging conference 2022 [abstract ID - #A-165]. 6.2 Introduction Traditional computer aided detection (CAD) algorithms have been used as clinical decision support systems, predominantly in the USA screening programme. However CAD systems have been shown to increase the recall rate with little improvement in reader performance, especially for experienced readers125,288. With the increasing improvement in performance of deep learning (DL) methods it has been proposed that these AI tools could be deployed as computer aided detection and diagnosis (CADe+x) stand-alone systems either entirely independently or alongside existing readers133. Many stand-alone systems have been tested and shown to be non-inferior to the first reader / single reader performance and even superior in some cases when used as a stand-alone system133,289,290. However no algorithm has been shown to be superior to the standard double reading performance whilst maintaining acceptable recall rates, suggesting that DL will not replace human reading entirely in these programmes at present133,289. Few algorithms have been compared on the same independent dataset for benchmarking against acceptable performance thresholds137,149. The UK National Screening Committee (NSC) report in 2021 concluded that further evidence, retrospective and prospective, using UK data, is required before these systems are implemented into the National Health Service Breast Screening Programme (NHSBSP)136. This study aimed to evaluate the performance of three different AI algorithms for stand-alone detection and diagnosis (CADe+x) of breast cancer, using a representative UK screening cohort from two sites, and comparing against UK reader performance thresholds. This will provide evidence for 117 three AI algorithm deployment approaches as well as identifying AI algorithm thresholds for prospective studies. 6.3 Methods 6.3.1 Sample size The sample size for this study was calculated to determine the minimum number of cases required to reliably detect a meaningful difference between the AI algorithm performance and reader performance. The method was derived from Arkin et al for ‘comparing a variable and a fixed proportion’274. The fixed proportion was determined by the screen readers sensitivity in the study reported cohorts (2017), in order to determine non-inferiority of the AI algorithm in comparison to the average UK reader performance. The key metrics involved in the calculation are: • 𝑎 - Reference proportion – which is the average three-year single first reader and double reader sensitivity at the two screening sites, 62.9% and 67.4% respectively • Effect size – which is the size of difference required to be shown between the groups. This was set at 10% • 𝑏 - (Reference proportion - Effect size) • Power - (1-P(Type 2 error)) – which was set to 95% (β = 0.05) • Significance level - P(Type 1 error) – which was set at 0.025 𝑆𝑎𝑚𝑝𝑙𝑒 𝑠𝑖𝑧𝑒 𝑛 = 8 Ζa!:𝜋!(1 − 𝜋!) + Ζb:𝜋"(1 − 𝜋")(𝜋! − 𝜋") @# The first calculation, to demonstrate that the AI algorithm is as good as the average single first reader (independent) performance over three years (screen detected cancers (SDCs) plus interval cancers (ICs)) when used as a stand-alone system, found that 313 cancers were required for this study. The second calculation, to show the AI algorithm plus a single first reader and arbitration is as good as the average double reading three yearly performance (SDCs plus ICs), found that 300 cancers were required for this study. Therefore a sufficient sample size between 300 and 313 cancers was required for this study. 6.3.2 Data Patient data was obtained from the CC-MEDIA database described in Chapter 4, where data was collected from two NHSBSP sites (Cambridge and Norwich) under existing ethical approval (HRA REC 20/LO/0104, HRA CAG 20/CAG/0009, PHE RAC BSPRAC_090). All study data was de-identified prior to use in this research. Processed Full Field Digital Mammogram (FFDM) image data (right and left 118 craniocaudal (CC) and mediolateral oblique views (MLO)) and corresponding clinical metadata was retrospectively collected for all women who attended routine three yearly screening between January 1 2017 and December 31 2017 in order to obtain a sufficient sample size. Cases were excluded if they had an incomplete mammogram (less than two views of each breast or images not available on Picture Archiving and Communication System (PACS)), no ground truth was available, if the case was part of high-risk screening or the screen was documented as a technical recall. Cancer cases were also removed where they did not meet the specified definition, such as secondary melanoma metastasis recorded as an IC and confirmed following discussions with Public Health England (PHE). IC cases were removed if the interval from screening was recorded as longer than 40 months. As per the AI algorithm manufacturers documentation breast implants, pacemakers (including loop recorders) were excluded as well as any cases where only raw data was available or a pixel error occurred. Examples of artefacts excluded from the study cohort are shown in Figure 6-1. Figure 6-1 – Mediolateral oblique (MLO) views of mammogram artefacts removed from the study. a) Pacemaker, b) breast implant, c) loop recorder device, d) pixel error. a) c) b) d) 119 The study case selection process is shown in a Standards for Reporting of Diagnostic Accuracy Studies (STARD) diagram in Figure 6-2275. Figure 6-2 – Standards for Reporting of Diagnostic Accuracy Studies (STARD) flow diagram of cases included and excluded in this study. FFDM: Full Field Digital Mammogram, FHx: Family history, IC: Interval cancer, NHS: National Health Service, OMI-DB: The Optimam Mammography Image Database, PHE: Public Health England, PACS: Picture Archiving and Communication System. One exam was included per patient. All exams had not previously been seen by any AI algorithm. All images were stored in JPEG Lossless DICOM format and no additional pre-processing other than that performed by the mammography vendor and that performed by the AI algorithm occurred. Corresponding clinical metadata was available for each case, including each readers decision at screening. Trainee readers were removed from this analysis and replaced with the first and second trained reader decision. The invasive status (ICD-10 code), histological grade (assigned using Nottingham grading system), and histological size, was obtained using an automated National Breast Screening System (NBSS) query, for further detail please see Chapter 4 Section 4.4.5. 6.3.3 Ground truth The NHSBSP is a triennial screening programme, thus women are screened every 34-36 months using FFDM. The ground truth for normal cases was defined as a final reader action of routine recall (RR) more than > 912 days (30 months) after their previous screen, to account for early recall of women to three-year screening and a confirmed ‘no cancer diagnoses’ within three years. The study follow-up time period overlaps with the pause in screening during the Covid-19 pandemic, therefore cases were excluded if sufficient follow-up information was not available. We used this definition of a ‘normal’ case to provide a robust ground truth for these cases. 120 Cancer cases were identified using the existing NBSS queries. All cancers received a confirmed histopathological diagnosis and were classed as either a: SDC, next round cancer (NRC), future round cancer (FRC), IC, or next round interval cancer (NRIC), such that; • SDCs were recalled and diagnosed at the screening episode included in the study, within 90 days (3 months). • NRCs were recalled at the next screening episode, after the screening episode included in this study. • FRCs were recalled at the second screening episode, after the screening episode included in this study. • ICs occurred in the interval following a negative screen, within 1216 days (40 months) of the screening episode, and received a confirmed histopathological diagnosis. • NRICs occurred less than 1216 days (40 months) after the next round screening episode, and received a confirmed histopathological diagnosis. 6.3.4 AI tools Three commercial AI algorithms were installed at the University of Cambridge. Two AI algorithms were hosted in a local environment using a virtual machine connection and one AI algorithm was run using hardware supplied by the AI company. The AI companies did not have access to their algorithms following the successful setup installation and at no time had access to the study data. Details regarding the training data used by each AI algorithm as well as the technical setup and algorithm output is outlined in Chapter 5 Table 5-1. Density was calculated using Volpara (research version - VolparaResearch32_L30Enabled_v2, Wellington, New Zealand) and DL-3. The Breast Imaging-Reporting and Data System (BI-RADS) 5th edition density score from both Volpara and DL-3 is reported in this study. 6.3.5 Thresholds SDCs and ICs, occurring within the three-year screening interval, were classified as cancer cases in both the study and when identifying any study thresholds. Three thresholds were used in this study. The first was set at the single first reader three yearly specificity for the entire study cohort (96.6%) (threshold 1). The second threshold was identified using one year of Cambridge study data (2018) to identify the operating point for each AI algorithm at the first reader specificity performance (96.6%) (threshold 2). The 2018 Cambridge cohort used to identify this threshold consisted of 12,455 cases of which 239 were cancer cases (183 SDCs (1.5%), and 56 ICs (0.5%)). The third threshold (threshold 3) of 99.0% specificity, was also identified using the Cambridge 2018 data cohort. 121 Each AI algorithms performance was then assessed using these three thresholds. Adapted screening reading workflows, are outlined in Figure 6-3. Figure 6-3.b, shows how the AI algorithm performance alone was compared to the single first independent reader using threshold 1 and threshold 2. Figure 6-3.c, shows the combined AI and human reader approach, where the AI algorithm was set at threshold 2 and combined with the single human first reader. If there was discordance the final action decision was used (either second reader or arbitration) and the overall performance was compared to double reading performance as shown in Figure 6-3.a. Figure 6-3 – Proposed workflow deployment of a stand-alone computer aided detection and diagnosis (CADe+x) artificial intelligence (AI) algorithm. a) Routine UK double reading workflow, b) stand-alone artificial intelligence algorithm reader, c) single human and artificial intelligence algorithm reader, with arbitration where there is discordance, d) auto recall of cases, not recalled by single human and artificial intelligence algorithm workflow, for cases that score above the artificial intelligence algorithm threshold of 99.0% specificity. Figure 6-3.d, demonstrates the use of the auto recall threshold where all cases above the 99.0% specificity threshold of the AI algorithm were automatically recalled. Any cases below this threshold but above the 96.6% of the AI algorithm, and those recalled by the first reader, were recalled. Where there was discordance between the AI algorithm and the first reader (not including cases above the 122 99.0% specificity threshold) cases were referred to arbitration where the final action decision was taken. The results from this workflow were also compared to double reading performance as shown in Figure 6-3.a. 6.3.6 Statistical analysis All statistical analysis took place in R version 4.0.4 (R Foundation for Statistical Computing, Vienna, Austria)225, using the packages detailed in Chapter 5 Section 5.3.6. The overall predictive performance of each AI algorithm was evaluated using area under the receiver operating characteristic (AUROC) curve, the partial AUROC (pAUROC) at 96.0-100% specificity, and area under the precision recall curve (AUPRC). Due to the imbalanced nature of the data (3% cancers to 97% normal cases) precision and sensitivity were the primary outcome measures for this study. 𝑆𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦 = 𝑇𝑃𝑇𝑃 + 𝐹𝑁 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = 𝑇𝑃𝑇𝑃 + 𝐹𝑃 Performance of each AI algorithm was compared to readers performance, using one sample one tailed z-test to determine if the algorithm was non-inferior. The percentage of cancers detected from each category (SDC, IC, NRC, FRC and NRIC) was calculated for the AI algorithm at each threshold. Perturbation analysis took place to test the robustness of each AI algorithm against changes in performance thresholds. A multivariable model was created through the combination of all three individual algorithms using a generalised linear mixed effects model on Cambridge site data. This combined model was then tested using Norwich site data to check for overfitting. DeLong’s test was used to assess for a statistically significant difference between the AUROC curve of individual AI algorithms using 2000 bootstrapping examples. Finally, sub group analysis to evaluate each AI algorithms performance for SDC and IC detection in the following categories took place; age at screening, breast density, invasive status, invasive grade and size of cancers as well as mammographic machine vendor. Further sub group analysis took place for ICs using the interval between screening and diagnosis, as well as the radiological audit classifications assigned to each case. A Chi squared c2 test was used to investigate if there was a statistically significance between categories285. In all analyses, p-values < 0.05 were considered statistically significant and 95% confidence intervals were calculated, using bootstrapping with 2000 samples or through an approximation method from Simel et al using the epiR package291. 123 6.3.7 Reporting Each AI algorithm was assigned a DL-ID for the purposes of this study. For additional details please refer to section 5.3.7 in Chapter 5. This study is reported in accordance with The Checklist for Artificial Intelligence in Medical Imaging (CLAIM) criteria167. 6.4 Results 6.4.1 Data In total 26,722 cases were included in this study, 11,924 cases (44.6%) were from Cambridge and 14,798 cases (55.4%) were from Norwich. Patient characteristics of the study cohort are shown in Table 6-1. The median age for the entire cohort was 59.0 [IQR 54.0–63.0]. Cambridge n (%) Norwich n (%) Total Cases n 11924 14798 FFDM Vendor GE 121 (1.0%) 14798 (100%) Philips 11803 (99.0%) 0 (0.0%) Age at Screening Median [IQR] 57.0 [54.0-63.0] 59.0 [55.0-64.0] 47-49 13 (0.1%) 70 (0.5%) 50-54 4002 (33.6%) 2958 (20.0%) 55-59 2802 (23.5%) 5290 (35.8%) 60-64 2928 (24.6%) 2915 (19.7%) 65-69 1826 (15.3%) 2787 (18.8%) 70+ 353 (3.0%) 778 (5.3%) Density BI-RADS Volparab DL-3a DL-3a a 1968 (16.5%) 2755 (23.1%) 2353 (15.9%) b 5474 (45.9%) 6568 (55.1%) 8660 (58.5%) c 3247 (27.2%) 2548 (21.4%) 3614 (24.4%) d 1217 (10.2%) 53 (0.4%) 171 (1.2%) Missing 18 (0.2%) 0 (0.0%) 0 (0.0%) Cancers SDC Rate per 1000 screens 152 8.1/1000 180 7.9/1000 IC Rate per 1000 screens 84 4.5/1000 90 3.9/1000 NRC Rate per 1000 screens 99 7.5/1000 155 9.6/1000 FRC 0 1* NRIC 13* 15* Table 6-1 – Summary of testing dataset characteristics. Integer values with percentages in brackets (%) and median with Interquartile range in square brackets [IQR] are provided. BI-RADS: Breast imaging-reporting and data system, FRC: Future round cancer, FFDM: Full Field Digital Mammography, GE: General Electric, IC: Interval cancer, NRC: Next round cancer, NRIC: Next round interval cancer, SDC: Screen detected cancer. *Rate was calculated by the total number of women screened that year, there was incomplete follow-up time period information from which to calculate an accurate rate for these groups. aDL-3 5th edition BI-RADS density scores on processed full field digital mammograms. b Volpara 5th edition BI-RADS mammographic breast density from raw full field digital mammograms for Cambridge data. 124 The majority of Cambridge cases were classed as b / c BI-RADS density, which was consistent with the expected distribution across the reported screening population75. In total 506 three-year cancer cases (SDCs and ICs) were included in this study, 236 from Cambridge and 270 from Norwich. A total of 254 NRC cases were also included. The characteristics of the SDC and NRC cases in the study cohort are shown in Table 6-2. Cambridge SDC n (%) Cambridge NRC n (%) Norwich SDC n (%) Norwich NRC n (%) Total Cases n 152 99 180 155 Total Lesions n 159 104 184 161 Round length*l [IQR] 35.6 [35.1-36.1] 41.7 [36.7-45.2] 35.2 [35.1-36.1] 39.0 [35.5-39.8] Age at Screening l Median [IQR] 62.0 [56.0-67.0] 59.0 [54.0-65.0] 64.0 [59.0-68.0] 60.0 [56.0-65.0] 47-49 0 (0.0%) 1 (1.0%) 0 (0.0%) 0 (0.0%) 50-54 37 (24.3%) 29 (29.3%) 15 (8.3%) 20 (12.9%) 55-59 25 (16.5%) 22 (22.2%) 36 (20.0%) 50 (32.3%) 60-64 29 (19.1%) 18 (18.2%) 40 (22.2%) 34 (21.9%) 65-69 44 (29.0%) 22 (22.2%) 56 (31.1%) 35 (22.6%) 70+ 17 (11.2%) 7 (7.1%) 33 (18.3%) 16 (10.3%) Invasive Status Invasive 134 (84.3%) 86 (82.7%) 152 (82.6%) 136 (84.5%) Non-invasive 24 (15.1%) 18 (17.3%) 30 (16.3%) 25 (15.5%) Missing 1 (0.6%) 0 (0.0%) 2 (1.1%) 0 (0.0%) Invasive Tumour Sized < 15 mm 73 (54.5%) 40 (46.5%) 87 (57.2%) 71 (52.2%) >= 15 mm 59 (44.0%) 34 (39.5%) 61 (40.1%) 55 (40.4%) Missing 2 (1.5%) 12 (14.0%) 4 (2.6%) 10 (7.4%) Invasive Tumour Graded 1 23 (17.2%) 10 (11.6%) 44 (29.0%) 37 (27.2%) 2 76 (56.7%) 60 (69.8%) 81 (53.3%) 65 (47.8%) 3 30 (22.4%) 11 (12.8%) 26 (17.1%) 27 (19.9%) Missing 5 (3.7%) 5 (5.8%) 1 (0.7%) 7 (5.1%) Volparab DL-3a Volparab DL-3a DL-3a DL-3a Density BI-RADSl a 18 (11.8%) 30 (19.7%) 13 (13.1%) 22 (22.2%) 15 (8.3%) 19 (12.3%) b 79 (52.0%) 92 (60.5%) 52 (52.5%) 50 (50.5%) 116 (64.4%) 89 (57.4%) c 43 (28.3%) 29 (19.1%) 24 (24.2%) 27 (27.3%) 49 (27.2%) 47 (30.3%) d 12 (7.9%) 1 (0.7%) 10 (10.1%) 0 (0.0%) 0 (0.0%) 0 (0.0%) Table 6-2 – Cancer characteristics by lesions and cases. With integer values and percentages in brackets (%). dInvasive lesions only. BI-RADS: Breast imaging-reporting and data system, NRC: Next round, SDC: Screen detected cancer. lCases only. aDL-3 5th edition BI-RADS density scores on processed full field digital mammograms.b Volpara 5th edition BI-RADS mammographic breast density from raw full field digital mammograms for Cambridge data. *Round length is shown in months from the previous screen, cases without a previous screen or screened more than six years previously are removed from this analysis. The round length was increased for next round cancers due to the pause in screening during the Covid-19 pandemic which is described in Chapter 4. 125 In total 174 ICs, 84 from Cambridge and 90 from Norwich were included in the study cohort. The characteristics of the IC cases in the study cohort are shown in Table 6-3. The median time to diagnosis was 825.5 [IQR 531.0–1002.5] days for all ICs at Cambridge and 725.5 [IQR 486.8–964.0] days at Norwich. Cambridge IC n (%) Norwich IC n (%) Total Cases n 84 90 Total Lesions n 86 100 Age at Screeningl Median [IQR] 58.0 [54.0-65.3] 62.0 [55.0-68.0] 47-49 0 (0.0%) 0 (0.0%) 50-54 27 (32.1%) 17 (18.9%) 55-59 19 (22.6%) 23 (25.6%) 60-64 14 (16.7%) 13 (14.4%) 65-69 14 (16.7%) 21 (23.3%) 70+ 10 (11.9%) 16 (17.8%) Invasive Status Invasive 74 (86.0%) 93 (93.0%) Non-invasive 9 (10.5%) 6 (6.0%) Missing 3 (3.5%) 1 (1.0%) Invasive Tumour Sized < 15 mm 16 (21.6%) 36 (38.7%) >= 15 mm 48 (64.9%) 48 (51.6%) Missing 10 (13.5%) 9 (9.7%) Invasive Tumour Graded 1 10 (13.5%) 18 (19.4%) 2 33 (44.6%) 40 (43.0%) 3 31 (41.9%) 32 (34.4%) Missing 0 (0.0%) 3 (3.2%) Volparab DL-3a DL-3a Density BI-RADSl a 8 (9.5%) 11 (13.1%) 2 (2.2%) b 34 (40.5%) 43 (51.2%) 51 (56.7%) c 29 (34.5%) 29 (34.5%) 36 (40.0%) d 12 (14.3%) 1 (1.2%) 1 (1.1%) Missing 1 (1.2%) 0 (0.0%) 0 (0.0%) Interval (months)l 0-12 13 (15.5%) 18 (20.0%) 12-24 23 (27.4%) 28 (31.1%) 24-36 48 (57.1%) 44 (48.9%) 36-40 0 (0.0%) 0 (0.0%) Radiological Audit Classificationl Normal/ Benign 69 (82.1%) 60 (66.7%) Uncertain 11 (13.1%) 28 (31.1%) Suspicious 0 (0.0%) 0 (0.0%) Unclassifiable 0 (0.0%) 1 (1.1%) Missing 4 (4.8%) 1 (1.1%) Table 6-3 – Interval cancer (IC) characteristics by lesions and cases. With integer values and percentages in brackets (%). Invasive Tumour Size in millimetres (mm). BI-RADS: Breast imaging-reporting and data system, IC: Interval cancer. dInvasive lesions only. lCases only. aDL-3 5th edition BI-RADS density scores on processed full field digital mammograms.b Volpara 5th edition BI-RADS mammographic breast density from raw full field digital mammograms for Cambridge data. 126 The majority of IC cases were classified as normal / benign in keeping with the reported UK distribution103. No cases were classified as suspicious, which was likely to be due to the change in national reporting of interval cancers in 2017101. 6.4.2 Algorithm results The overall AUROC for DL-1, DL-2, DL-3 was 0.868 [95% CI 0.849–0.887], 0.885 [95% CI 0.869–0.902] and 0.894 [95% CI 0.878–0.910] respectively. ROC curves for each AI algorithm are shown in Figure 6-4. All algorithms maintained a similar AUROC performance for both sites (p > 0.05). The AUROC of DL-3 was statistically significantly different to DL-1 on all and Norwich data, and DL-2 on Norwich data (p < 0.05). DL-1 and DL-2 were also statistically significantly different to each other on all and Norwich data (p < 0.05). The comparator ROC curves for each site are shown in Figure 6-5. Figure 6-4 – Receiver operating characteristic (ROC) curves per artificial intelligence (AI) algorithm. The overall results are in grey, Cambridge in orange and Norwich in pink. Area under the receiver operating characteristic curve values are provided for each site. The pAUROC, from 96.0% to 100% specificity, for DL-1, DL-2, DL-3 was 0.744 [95% CI 0.723–0.764], 0.739 [95% CI 0.720–0.760] and 0.774 [95% CI 0.754–0.794] respectively on all data. 127 Figure 6-5 – Receiver operating characteristic (ROC) curves per site. a) receiver operating characteristic curves per site, with the area under the receiver operating characteristic curve values provided for each algorithm, b) partial receiver operating characteristic curves to show the performance of each artificial intelligence algorithm between 95.0% and 100% specificity at each site. The results for DL-1 are in blue, DL-2 in purple, and DL-3 in green. A pink triangle represents the first reader performance, and a red diamond represents the overall double reader performance at each site. The pAUROC when tested on Cambridge data was lower for all algorithms; 0.739 [95% CI 0.709– 0.769], 0.737 [95% CI 0.709–0.767] and 0.759 [95% CI 0.730–0.787] for DL-1, DL-2 and DL-3 respectively. On Norwich data all algorithms achieved a higher pAUROC compared to all and Cambridge data, with DL-1 achieving a pAUROC of 0.761 [95% CI 0.732–0.789], DL-2 0.741 [95% CI 0.714–0.769], and DL-3 0.791 [95% CI 0.762–0.818]. The pAUROC of DL-3 was statistically significantly different (p < 0.05) when compared to DL-1 and DL-2 on all, Cambridge and Norwich data. The pAUROC of DL-1 was also statistically significantly different (p < 0.05) compared to DL-2 on Norwich data. The overall AUPRC for DL-1, DL-2, DL-3 was 0.440, 0.407, 0.513 respectively, Figure 6-6. The drop in DL-2 and DL-3 precision, shown in the precision recall curves (PRC), was due to either missing a true positive case or including more false positives at a high recall threshold. Although both curves recover, the curve for DL-2 remains consistently lower than DL-3. b) a) 128 Figure 6-6 – Precision recall curves (PRC). For DL-1 in blue, DL-2 in purple and DL-3 in green. When the AI algorithm threshold is set at the first screen reader specificity (96.6%) (threshold 1), DL- 1, DL-2 and DL-3 were non-inferior relative to the single first reader, as shown in Table 6-4. DL-3 was also non-inferior to the double reader sensitivity. The AI algorithms detected more NRC (D +4.5%~+9.9%) and IC (D +5.2%~+8.0%) compared to the first reader, when these systems were used as stand-alone CADe+x readers. However, the number of SDCs found by all AI algorithms was less than the first reader (D -4.9%~-10.9%). At the identified threshold 2, DL-2 maintained performance and was non-inferior to the single first human reader. However, the sensitivity of DL-1 improved with the trade-off of reduced specificity and the opposite was found for DL-3, whilst both algorithms sensitivity remained non-inferior to the first reader performance, Table 6-5. DL-1 sensitivity was also non-inferior to the double reader performance. All three AI algorithms were able to detect a greater proportion of ICs and NRCs at both threshold 1 (96.6% specificity) and threshold 2 (Cambridge 2018 first reader 96.6% specificity performance) for the earlier detection of cancer compared to the human reader workflows offsetting the reduced rate of SDCs. 129 Double reader First reader DL-1 DL-2 DL-3 AUROC - - 0.868 0.885 0.894 pAUROC - - 0.744 0.739 0.774 AUPRC - - 0.440 0.407 0.513 Sensitivity 67.4% [63.1-71.5] 62.9% [58.5-67.1] 57.7% [53.4-62.1] p = 0.016 57.5% [53.2-61.9] p = 0.02 62.5% [58.3-66.8] p < 0.01 - - - Non-inferior Non-inferior Non-inferior Specificity 97.1% 96.6% 96.6% 96.6% 96.6% Precision 31.3% 26.0% 24.7% 24.6% 26.2% Recall Rate 4.1% 4.6% 4.4% 4.4% 4.5% Cancers SDC n (%) 332 (100%) 302 (91.0%) 266 (80.1%) 266 (80.1%) 286 (86.1%) IC n (%) 9 (5.2%) 16 (9.2%) 26 (14.9%) 25 (14.4%) 30 (17.2%) NRC n (%) 10 (3.9%) 13 (5.1%) 31 (12.2%) 32 (12.6%) 31 (12.2%) FRC n (%) 0 (0.0%) 0 (0.0%) 1 (100%) 1 (100%) 1 (100%) NRIC n (%) 2 (7.1%) 3 (10.7%) 2 (7.1%) 2 (7.1%) 1 (3.6%) Table 6-4 – Stand-alone artificial intelligence (AI) algorithm application compared to the single first reader – threshold 1. All algorithm thresholds set at the first reader, 96.6% specificity (threshold 1). AUROC: Area under the receiver operating characteristic curve, AUPRC: Area under the precision recall curve, FRC: Future round cancer, IC: Interval cancer, NRC: Next round, NRIC: Next round interval cancer, pAUROC: Partial area under the receiver operating characteristic curve, SDC: Screen detected cancer. 95.0% confidence intervals are shown in square brackets [95.0% CI]. p values are calculated using a one-sided z-test. DL-1 DL-2 DL-3 Sensitivity 64.8% [61.3-68.2] p < 0.01 56.7% [53.0-60.5] p = 0.045 58.9% [55.3-62.5] p < 0.01 - Non-inferior Non-inferior Non-inferior Specificity 92.8% [92.5-93.1] p < 0.01 96.8% [96.7-97.0] p < 0.01 97.9% [97.8-98.0] p < 0.01 Precision 14.8% 25.6% 35.2% Recall Rate 8.3% 4.2% 3.2% SDC n (%) 287 (86.5%) 264 (79.5%) 275 (82.8%) IC n (%) 41 (23.6%) 23 (13.2%) 23 (13.2%) NRC n (%) 59 (23.2%) 32 (12.6%) 18 (7.1%) FRC n (%) 1 (100%) 1 (100%) 0 (0.0%) NRIC n (%) 6 (21.4%) 2 (7.1%) 1 (3.6%) Table 6-5 – Stand-alone artificial intelligence (AI) algorithm application compared to the single first reader – threshold 2. All algorithm thresholds set using the operating point identified using Cambridge 2018 data, threshold 2. FRC: Future round cancer, IC: Interval cancer, NRC: Next round, NRIC: Next round interval cancer, SDC: Screen detected cancer. 95.0% confidence intervals are shown in square brackets [95.0% CI]. p values are calculated using a one-sided z-test. The distribution of scores for each category of cases along with the cut off points of each threshold are shown in Figure 6-7. 130 Figure 6-7 – Individual artificial intelligence (AI) algorithm score distributions normalised from 0-10. a) Density plots, where normal cases are in red and cancer cases (screen detected and interval cancers) are in blue, and b) violin plots where the blue dot in the violin plot is the mean score and the red is the median score. The green line indicates the 96.6% specificity threshold (threshold 1) and the pink line is the operating point identified from the Cambridge 2018 data (threshold 2). FRC: Future round cancer, IC: Interval cancer, NRC: Next round, NRIC: Next round interval cancer, SDC: Screen detected cancer. Double reader First reader + DL-1 First reader + DL-2 First reader + DL-3 Sensitivity 67.4% [63.1-71.5] 67.0% [62.7-71.1] p < 0.01 65.6% [61.3-69.7] p < 0.01 65.4% [61.1-69.6] p < 0.01 - - Non-inferior Non-inferior Non-inferior Specificity 97.1% [96.7-97.3] 97.4% [97.2-97.6] p < 0.01 97.6% [97.4-97.7] p < 0.01 97.6% [97.4-97.8] p < 0.01 Precision 31.3% 33.4% 34.2% 34.4% Arbitration 2.7% 9.5% 6.3% 5.2% Recall Rate 4.1% 3.8% 3.6% 3.6% SDC n (%) 332 (100%) 326 (98.2%) 323 (97.3%) 321 (96.7%) IC n (%) 9 (5.2%) 13 (7.5%) 9 (5.2%) 10 (5.8%) NRC n (%) 10 (3.9%) 0 (0.0%) 0 (0.0%) 0 (0.0%) FRC n (%) 0 (0.0%) 0 (0.0%) 0 (0.0%) 0 (0.0%) NRIC n (%) 2 (7.1%) 0 (0.0%) 0 (0.0%) 0 (0.0%) Table 6-6 – Artificial intelligence (AI) algorithm (at threshold 2) combined with the single first reader (+/- arbitration where discordance) compared to double reading performance. FRC: Future round cancer, IC: Interval cancer, NRC: Next round, NRIC: Next round interval cancer, SDC: Screen detected cancer. 95.0% confidence intervals are shown in square brackets [95.0% CI]. p values are calculated using a one-sided z-test. Combining the AI algorithm with the human reader decision, using the identified threshold 2 for each AI algorithm, resulted in non-inferior sensitivity and specificity performance. The overall recall b) a) 131 rate was lower, however the was arbitration rate was higher (D +2.5%~+6.8%), Table 6-6. A reduction in SDCs (D -1.8%~-3.3%) and reduction in NRCs (D -3.9%) with only a modest increase in IC detection (D +0.0%~+2.3%) was noted for all algorithms compared to double reader performance, Table 6-6. Perturbation analysis demonstrated that all the algorithms performed similarly and are robust to changes in specificity. When adjusting the AI algorithms specificity to ~90.5% and in combination with the first reader, and arbitration for discordance, the performance for all AI algorithms was close to double reader sensitivity without increasing the overall recall rate, but with an increase in the arbitration rate, Table 6-7. AI-Specificity Sensitivity Specificity Precision Arbitration Recall DL-1 + readers 97.5% 65.4% 97.6% 34.5% 5.82% 3.60% 96.5% 66.2% 97.6% 34.5% 6.58% 3.63% 95.5% 66.4% 97.5% 34.1% 7.34% 3.69% 94.5% 66.4% 97.5% 33.8% 8.14% 3.72% 93.5% 66.8% 97.5% 33.7% 8.94% 3.76% 92.5% 67.0% 97.4% 33.3% 9.75% 3.81% 91.5% 67.0% 97.4% 33.1% 10.60% 3.83% 90.5% 67.0% 97.4% 33.0% 11.47% 3.84% DL-2 + readers 97.5% 65.4% 97.6% 34.4% 5.70% 3.60% 96.5% 65.6% 97.5% 34.0% 6.48% 3.65% 95.5% 65.8% 97.5% 33.7% 7.31% 3.70% 94.5% 66.0% 97.5% 33.4% 8.13% 3.75% 93.5% 66.4% 97.4% 33.1% 8.97% 3.80% 92.5% 66.8% 97.4% 33.0% 9.79% 3.84% 91.5% 66.8% 97.4% 32.8% 10.6% 3.86% 90.5% 67.0% 97.3% 32.7% 11.5% 3.88% DL-3 + readers 97.5% 66.0% 97.6% 34.4% 5.48% 3.64% 96.5% 66.4% 97.5% 34.0% 6.23% 3.70% 95.5% 66.4% 97.5% 33.6% 7.02% 3.75% 94.5% 66.4% 97.4% 33.3% 7.84% 3.78% 93.5% 66.4% 97.4% 32.8% 8.66% 3.84% 92.5% 66.6% 97.3% 32.6% 9.55% 3.87% 91.5% 66.6% 97.3% 32.3% 10.40% 3.90% 90.5% 67.0% 97.3% 32.2% 11.27% 3.94% Table 6-7 – Perturbation analysis when adjusting the specificity threshold for the artificial intelligence (AI) algorithm, then combining with the first reader and final action arbitration decision if there is discordance. 6.4.3 Scenario D 99.0% specificity auto recall threshold Including the auto recalled cases above the identified 99.0% specificity threshold of each AI algorithm resulted in an overall increase in sensitivity and decrease in specificity, Table 6-8. On average sensitivity increased by +0.8~+3.4%% and specificity decreased by -0.8~-2.3%, compared to 132 the results in Table 6-6 where the 99.0% specificity AI threshold was not implemented. This is also reflected in the increased recall rate (D +0.8~+2.3%) and decreased arbitration rate (D -0.9~-2.4%). There was an overall increase in the NRCs (D +1.8%~+9.9%), and ICs (D +2.3%~+9.7%) detected due to the auto recall implementation, thus Scenario D facilitates the earlier detection of cancer at the expense of an increased recall rate. Double reader First reader + DL-1 First reader + DL-2 First reader + DL-3 Sensitivity 67.4% [63.1-71.5] 70.4% [66.2-74.3] 66.4% [62.1-70.5] 67.4% [63.1-71.5] Specificity 97.1% [96.7-97.3] 95.1% [94.9-95.4] 96.6% [96.3-96.8] 96.8% [96.6-970] Precision 31.3% 21.9% 27.2% 28.9% Arbitration 2.7% 7.1% 5.2% 4.3% Recall Rate 4.1% 6.1% 4.6% 4.4% Cases Flagged Total - 947 (3.5%) 530 (2.0%) 533 (2.0%) True Positive - 274 (1.0%) 235 (0.9%) 272 (1.0%) False Positive - 673 (2.5%) 295 (1.1%) 261 (1.0%) Cancers Flagged SDC n (%) - 254 (76.5%) 229 (69.0%) 258 (77.7%) IC n (%) - 20 (11.5%) 6 (3.5%) 14 (8.1%) NRC n (%) - 24 (9.4%) 10 (3.9%) 8 (3.1%) FRC n (%) - 1 (100%) 0 (0.0%) 0 (0.0%) NRIC n (%) - 1 (3.6%) 1 (3.6%) 0 (0.0%) Auto Recalled SDC n (%) - 20 (6.0%) 15 (4.5%) 18 (5.4%) IC n (%) - 17 (9.8%) 4 (2.3%) 10 (5.8%) NRC n (%) - 21 (8.3%) 9 (3.5%) 7 (2.8%) FRC n (%) - 1 (100%) 0 (0.0%) 0 (0.0%) NRIC n (%) - 1 (3.6%) 1 (3.6%) 0 (0.0%) Final Total Detected SDC n (%) 332 (100%) 326 (98.2%) 323 (97.3%) 321 (96.7%) IC n (%) 9 (5.2%) 30 (17.2%) 13 (7.5%) 20 (11.5%) NRC n (%) 4 (3.6%) 21 (8.3%) 9 (3.5%) 7 (2.8%) FRC n (%) 6 (4.2%) 1 (100%) 0 (0.0%) 5 (3.5%) NRIC n (%) 2 (7.1%) 1 (3.6%) 1 (3.6%) 0 (0.0%) Table 6-8 – Artificial intelligence (AI) algorithm (at threshold 2) combined with the single first reader (+/- arbitration where discordance below 99.0% specificity for the algorithm and above 96.6% specificity) with cases auto recalled above the 99.0% specificity threshold (threshold 3) compared to double reading performance. FRC: Future round cancer, IC: Interval cancer, NRC: Next round, NRIC: Next round interval cancer, SDC: Screen detected cancer. 6.4.4 Combined algorithm results The Combined model was created by combining DL-1, DL-2 and DL-3 AI algorithms. The Combined model achieved a performance of AUROC 0.886 [95% CI 0.860–0.912], with an improvement in AUROC of D +0.002~+0.014, and pAUROC of 0.761 [95% CI 0.733–0.793]. At the 96.6% specificity 133 threshold for the first reader specificity, the Combined model achieved a sensitivity of 61.4% [95% CI 54.7 – 67.4]. Figure 6-8 compares the ROC curves of the Combined model to, DL-1, DL-2 and DL-3 on the Cambridge data. Figure 6-8 – Combined model receiver operating characteristic (ROC) curves on Cambridge data. For DL-1 in blue, DL-2 in purple, DL-3 in green and the Combined algorithm performance in red, with area under the receiver operating characteristic curve values provided for each algorithm. The Combined model performance was not statistically significant from each individual AI algorithm performance (p > 0.05), as demonstrated in Table 6-9. DL-1 DL-2 DL-3 DeLongs test p value 0.4642 0.9093 0.806 Table 6-9 – DeLong’s test comparison results for DL-1, DL-2, DL-3 compared to the Combined model performance on Cambridge data. Taking the Combined model and then applying the model to Norwich data, found there was no overfitting of the model and that the model was generalisable to a different site using a different machine vendor (GE), achieving an AUROC of 0.902 [95% CI 0.880–0.925] and pAUROC of 0.783 [95% CI 0.756–0.810]. Figure 6-9 compares the ROC curves of the Combined model to, DL-1, DL-2 and DL-3 on Norwich data. Applying the 96.6% specificity threshold found on the Cambridge data using the Combined model, the Combined model on the Norwich data achieved a 99.8% [95% CI 99.8–99.9] specificity and 37.8% [95% CI 33.0–43.0] sensitivity. 134 Figure 6-9 – Combined model receiver operating characteristic (ROC) curves on Norwich data. For DL-1 in blue, DL-2 in purple, DL-3 in green and the Combined algorithm performance in red, with area under the receiver operating characteristic curve values provided for each algorithm. The Combined model performance was not statistically significant from DL-2 and DL-3 performance (p > 0.05). However, it was statistically significantly different from DL-1 as demonstrated in Table 6- 10. DL-1 DL-2 DL-3 DeLongs test p value < 0.01 0.2434 0.6288 Table 6-10 – DeLong’s test comparison results for DL-1, DL-2, DL-3 compared to the Combined model performance on Norwich data. 6.4.5 Sub-group analysis Performance of the AI algorithms was further assessed at the 96.6% specificity threshold (threshold 1) for sensitivity of the following subgroups; age at screening, mammographic machine vendor, invasive status, invasive size of cancer, invasive grade of cancer, and mammographic breast density categories for SDCs, Table 6-11, and ICs, Table 6-12, at all sites. 135 n First reader DL-1 DL-2 DL-3 Total SDC Cases 332 302 266 266 286 Total Lesionsd 343 315 278 278 297 Total Invasive Lesionsd 286 266 238 240 253 Age at Screening < 60 113 (34.0%) 105 (92.9%) 88 (77.9%) 89 (78.8%) 96 (85.0%) >= 60 219 (66.0%) 197 (90.0%) 178 (81.3%) 177 (80.8%) 190 (86.8%) p value - 0. 846319 0.806241 0.88204 0.902054 FFDM Vendor GE 184 (55.4%) 169 (91.8%) 159 (86.4%) 148 (80.4%) 160 (87.0%) Philips 148 (44.6%) 133 (89.9%) 107 (72.3%) 118 (79.7%) 126 (85.1%) p value - 0.891552 0.284815 0.9576 0.896299 Invasive statusd Invasive 286 (83.3%) 266 (93.0%) 238 (83.2%) 240 (83.9%) 253 (88.5%) Non-invasive 54 (15.7%) 46 (85.2%) 39 (72.2%) 37 (68.5%) 43 (79.6%) Missing 3 (0.9%) 3 (100%) 1 (33.3%) 1 (33.3%) 1 (33.3%) p value - 0.916924 0.599303 0.493973 0.616753 Invasive Tumour Sized < 15 mm 160 (55.9%) 144 (90.0%) 126 (78.8%) 133 (83.1%) 135 (84.4%) >= 15 mm 120 (42.0%) 116 (96.7%) 106 (88.3%) 101 (84.2%) 112 (93.3%) Missing 6 (2.1%) 6 (100%) 6 (100%) 6 (100%) 6 (100%) p value - 0.911418 0.772362 0.951474 0.82882 Invasive Tumour Graded 1 67 (23.4%) 60 (89.6%) 55 (82.1%) 55 (82.1%) 57 (85.1%) 2 157 (54.9%) 147 (93.6%) 133 (84.7%) 134 (85.4%) 141 (89.8%) 3 56 (19.6%) 54 (96.4%) 45 (80.4%) 46 (82.1%) 50 (89.3%) Missing 6 (2.1%) 5 (83.3%) 5 (83.3%) 5 (83.3%) 5 (83.3%) p value - 0.989649 0.996254 0.997324 0.994561 Density BI-RADSb a 18 (5.4%) 17 (94.4%) 11 (61.1%) 13 (72.2%) 15 (83.3%) b 79 (23.8%) 71 (89.9%) 59 (74.7%) 66 (83.5%) 71 (89.9%) c 43 (13.0%) 39 (90.7%) 34 (79.1%) 36 (83.7%) 37 (86.1%) d 12 (3.6%) 10 (83.3%) 6 (50.0%) 7 (58.3%) 7 (58.3%) p value - 0.996736 0.817878 0.889323 0.860564 Density BI-RADSa a 45 (13.6%) 42 (93.3%) 31 (68.9%) 35 (77.8%) 38 (84.4%) b 208 (62.7%) 190 (91.3%) 174 (83.7%) 172 (82.7%) 185 (88.9%) c 78 (23.5%) 69 (88.5%) 61 (78.2%) 59 (75.6%) 63 (80.8%) d 1 (0.3%) 1 (100%) 0 (0.0%) 0 (0.0%) 0 (0.0%) p value - 0.997149 - - - Table 6-11 – Sub group analysis of DL-1, DL-2, DL-3 set at the first reader specificity threshold of 96.6% (threshold 1) for screen detected cancers (SDCs). BI-RADS: Breast imaging-reporting and data system, FFDM: Full field digital mammography, SDC: Screen detected cancer. dLesions reported. aDL-3 5th edition BI-RADS density scores on processed full field digital mammograms. b Volpara 5th edition BI-RADS mammographic breast density from raw full field digital mammograms for Cambridge data. p values were determined by using Chi squared c2 test to compare against the detected proportion of cancers cases by the true distribution for each cancer characteristic category. p values < 0.05 were considered statistically significant. 136 n Double reader First reader DL-1 DL-2 DL-3 Total IC Cases 174 9 16 26 25 30 Total Lesionsd 186 10 17 31 25 32 Total Invasive Lesionsd 167 7 16 30 19 30 Age at Screening < 60 86 (49.4%) 3 (3.5%) 4 (4.7%) 13 (15.1%) 12 (14.0%) 13 (15.1%) >= 60 88 (50.6%) 6 (6.8%) 12 (13.6%) 13 (14.8%) 13 (14.8%) 17 (19.3%) p value - 0.346281 0.061133 0.956401 0.893963 0.537597 FFDM Vendor GE 91 (52.3%) 3 (3.3%) 7 (7.7%) 18 (19.8%) 9 (9.9%) 9 (9.9%) Philips 83 (47.7%) 6 (7.2%) 9 (10.8%) 8 (9.6%) 16 (19.3%) 21 (25.3%) p value - 0.266994 0.512593 0.105847 0.127486 0.024046 Invasive statusd Invasive 167 (89.8%) 7 (4.2%) 16 (9.6%) 30 (18.0%) 19 (11.4%) 30 (18.0%) Non-invasive 15 (8.1%) 2 (13.3%) 1 (6.7%) 1 (6.7%) 5 (33.3%) 2 (13.3%) Missing 4 (2.2%) 1 (25.0%) 0 (0.0%) 0 (0.0%) 1 (25.0%) 0 (0.0%) p value - 0.118297 0.732246 0.327375 0.128396 0.700804 Invasive Tumour Sized < 15 mm 52 (31.1%) 4 (7.7%) 5 (9.6%) 10 (19.2%) 4 (7.7%) 6 (11.5%) >= 15 mm 96 (57.5%) 3 (3.1%) 11 (11.5%) 19 (19.8%) 14 (14.6%) 24 (25.0%) Missing 19 (11.4%) 0 (0.0%) 0 (0.0%) 1 (5.3%) 1 (5.3%) 0 (0.0%) p value - 0.236243 0.756546 0.401431 0.394622 0.106786 Invasive Tumour Graded 1 28 (16.8%) 1 (3.6%) 2 (7.1%) 4 (14.3%) 3 (10.7%) 4 (14.3%) 2 73 (43.7%) 5 (6.9%) 9 (12.3%) 21 (28.8%) 13 (17.8%) 18 (24.7%) 3 63 (37.7%) 1 (1.6%) 5 (7.9%) 5 (7.9%) 3 (4.8%) 8 (12.7%) Missing 3 (1.8%) 0 (0.0%) 0 (0.0%) 0 (0.0%) 0 (0.0%) 0 (0.0%) p value - 0.34277 0.663023 0.029639 0.105161 0.291008 Density BI-RADSb a 8 (4.6%) 0 (0.0%) 1 (12.5%) 0 (0.0%) 1 (12.5%) 3 (37.5%) b 34 (19.4%) 3 (8.8%) 3 (8.8%) 0 (0.0%) 4 (11.8%) 6 (17.6%) c 29 (16.6%) 3(10.3%) 4 (13.8%) 5 (17.2%) 7 (24.1%) 7 (24.1%) d 12 (6.9%) 0 (0.0%) 1 (8.3%) 3 (25.0%) 4 (33.3%) 5 (41.7%) Missing 1 (0.6%) 0 (0.0%) 0 (0.0%) 0 (0.0%) 0 (0.0%) 0 (0.0%) p value - - 0.939331 - 0.518521 0.58906 Density BI-RADSa a 13 (7.4%) 0 (0.0%) 1 (7.7%) 0 (0.0%) 1 (7.7%) 4 (30.8%) b 94 (53.7%) 5 (5.3%) 9 (9.6%) 17 (19.1%) 14 (14.9%) 14 (14.9%) c 65 (37.7%) 4 (6.1%) 6 (9.1%) 9 (13.6%) 10 (15.2%) 12 (18.2%) d 2 (1.2%) 0 (0.0%) 0 (0.0%) 0 (0.0%) 0 (0.0%) 0 (0.0%) p value - - - - - - Table 6-12 – Sub group analysis of DL-1, DL-2, DL-3 set at the first reader specificity threshold of 96.6% (threshold 1) for interval cancers (IC). BI-RADS: Breast imaging-reporting and data system, FFDM: Full field 137 digital mammography, IC: Interval cancer. dLesions reported. aDL-3 5th edition BI-RADS density scores on processed full field digital mammograms. b Volpara 5th edition BI-RADS mammographic breast density from raw full field digital mammograms for Cambridge data. p values were determined by using Chi squared c2 test to compare against the detected proportion of cancers cases by the true distribution for each cancer characteristic category. p values < 0.05 were considered statistically significant. The AI algorithms behaved similarly to true distribution of cancer cases in the all types of cancers detected (p > 0.05). A statistically significant difference between the distribution of ICs invasive grade for DL-1, and ICs mammographic machine vendor for DL-3 was found. Otherwise, no statistically significant difference was found between the distribution of each sub category and the types of cancers detected by human readers or AI algorithms. The AI algorithm performance is similar to human reader performance, and reduces in sensitivity as density increases. Performance of the AI algorithms was further assessed at the 96.6% specificity threshold (threshold 1) for sensitivity of the following IC subgroups; interval time in months and radiological audit classification, Table 6-13. The AI algorithms followed a similar distribution to the true distribution of IC over the interval time period (months) and detected more year three cancers than human readers. In addition, the algorithm like human readers picked up more uncertain cases, where there was potentially a visible sign “seen with hindsight”, than normal / benign cases where there was no visible sign on case review. n Double reader First reader DL-1 DL-2 DL-3 Total IC Cases 174 9 16 26 25 30 Interval (Months) 0-12 31 (17.7%) 4 (12.9%) 6 (19.4%) 3 (9.7%) 4 (12.9%) 5 (16.1%) 12-24 51 (29.1%) 5 (9.8%) 10 (19.6%) 7 (13.7%) 5 (9.8%) 8 (15.7%) 24-36 92 (52.6%) 0 (0.0%) 0 (0.0%) 16 (17.4%) 16 (17.4%) 17 (18.5%) 36-40 0 (0.0%) 0 (0.0%) 0 (0.0%) 0 (0.0%) 0 (0.0%) 0 (0.0%) p value - - - 0.642967 0.545267 0.927792 Radiological Audit Classification Normal/Benign 129 (74.1%) 5 (3.9%) 9 (7.0%) 14 (10.9%) 16 (12.4%) 21 (16.3%) Uncertain 39 (22.3%) 3 (7.7%) 7 (18.0%) 12 (30.8%) 8 (20.5%) 9 (23.1%) Suspicious 0 (0.0%) 0 (0.0%) 0 (0.0%) 0 (0.0%) 0 (0.0%) 0 (0.0%) Unclassifiable 1 (0.6%) 0 (0.0%) 0 (0.0%) 0 (0.0%) 0 (0.0%) 0 (0.0%) Missing 5 (2.9%) 1 (20.0%) 0 (0.0%) 0 (0.0%) 1 (20.0%) 0 (0.0%) p value - - - - - - Table 6-13 – Sub group analysis of DL-1, DL-2, DL-3 set at the first reader specificity threshold of 96.6% (threshold 1) for interval cancer (IC) specific categories. IC: Interval cancer. p values were determined by using Chi squared c2 test to compare against the detected proportion of cancers cases by the true distribution for each cancer characteristic category. p values < 0.05 were considered statistically significant. 138 Furthermore, the overlap of cases detected by each algorithm (DL-1, DL-2, DL-3) and the human first reader, at threshold 1, is shown in Figure 6-10. The Venn diagram demonstrates how the majority of SDC cases overlap for both the human reader and AI algorithms. Whereas, the IC and NRC cases detected differ between the AI algorithms as well as between the human first reader and AI algorithms. Figure 6-10 – Venn diagram – not proportional. a) Screen detected cancer cases, b) interval cancer cases, c) next round cancer cases. For DL-1 in blue, DL-2 in purple, DL-3 in green and the first human reader in red. 6.4.6 Failure analysis Examples of cases missed by either the human readers, AI algorithms or both are shown below. A case classified as an uncertain IC that was not detected by all methods, human readers and AI algorithms is shown in Figure 6-11. This was a case of a 57-year-old patient, diagnosed with a left sided grade 2, 15 mm invasive cancer, 691 days after screening. Figure 6-11 – Missing case analysis, case missed by both artificial intelligence (AI) and human readers. a) Screening image b) diagnostic image, with a blue bounding box to show the location of the cancer. The screen and diagnostic images were annotated by a breast radiologist to show the true location of the cancer. a) b) c) a) b) 139 A case classified as a normal / benign IC that was not detected by all human readers, but was recalled by all AI algorithms is shown in Figure 6-12. This was a case of a 57-year-old patient, diagnosed with a right sided grade 2, 21 mm invasive cancer, 806 days after screening. Figure 6-12 – Missing case analysis, case missed by all human readers and detected by all artificial intelligence (AI) algorithms. a) Screening image b) diagnostic image, with a blue bounding box to show the location of the cancer. The screen and diagnostic images were annotated by a breast radiologist to show the true location of the cancer. A SDC case recalled at routine screening by all readers that was not detected by all AI algorithms, Figure 6-13. This was a case of a 51-year-old patient, diagnosed with a left sided grade 3, 3 mm invasive cancer, with a 107 mm non-invasive component. Figure 6-13 – Missing case analysis, case missed by all artificial intelligence (AI) algorithms. a) Screening image, with a blue bounding box to show the location of the cancer. The screen images were annotated by a breast radiologist to show the true location of the cancer. a) b) a) 140 6.5 Discussion 6.5.1 Overall performance This study aimed to evaluate the performance of three commercial AI algorithms as stand-alone systems for the task of detection and diagnosis (CADe+x) in routine UK breast screening, using a large unenriched multi-vendor retrospective dataset from two UK NHSBSP sites. It provides an independent external validation which has not previously been performed, on UK data, for multiple algorithms simultaneously133,138,235,236,292,293. Overall all three AI algorithms achieved a good AUROC 0.868–0.910, pAUROC 0.737–0.791 and AUPRC 0.407–0.513 when using cancers diagnosed within 3 years of screening as cases, demonstrating that these algorithms are generalisable to the UK screening population across different sites and mammographic machine vendors. The AUROC and pAUC of DL-3 is statistically significantly different than DL-1 and DL-2 when tested on Norwich data, which is likely due to the predominant manufacturer used for training DL-3 (GE) is the same as the manufacture in the Norwich test set. Interestingly, the AUROC was not statistically significantly different between sites for the same AI algorithm, despite the algorithms either training on no or < 1% Philips data. Generalisability is further demonstrated as all the AI algorithms trained on less than 10% of UK data (triennial screening programme). This study highlights the importance of reporting, AUROC alongside, pAUROC, AUPRC, sensitivity and precision, as the groups are unbalanced in screening with a large proportion of normal cases to cancer cases. Additional metrics of AUPRC, precision and sensitivity provide information regarding the cost trade off, such that there is a high cost for missing a cancer case, which is captured in these metrics, and is demonstrated in Figure 6-6 for DL-3 and DL-2 where the precision is significantly reduced at a high recall for either missing a cancer case or high rates of false positive recalls. The pAUROC allows for the evaluation of an AI algorithms performance at the extreme end of the curve, high specificity, where an algorithm operates for screening tasks to maintain recall rates and so provides a more accurate assessment of clinical performance compared to the overall AUROC. Compared to the first reader, the sensitivity of all three algorithms were shown to be non-inferior at both threshold 1 and threshold 2 (first reader specificity 96.6%). The AI algorithms detected between 13.2-23.6% ICs and 4.5-28.8% NRCs which may offset the reduction in SDCs seen when using these systems as stand-alone readers. This is in keeping with previous studies where AI algorithms have been shown to be non-inferior and in certain cases superior to the first reader in double reading biennial and triennial screening programmes as well as in single reader annual programmes137,138,149,290. Rates of ICs detected were lower than in recent studies where 30.7% (63/205) of ICs were detected in the biennial screening programme of Norway, using a cohort of > 141 47,000 women, although this was at a higher recall rate of 5.8%271. In a study using a ten year UK and Hungarian screening cohort, 29.8% (111/373) of ICs were detected, although again the AI system was operating at a lower specificity (91.2%)290. When the algorithm is set at threshold 2 and combined with the first reader decision and final action decision where discordance, all of the three AI algorithms were non-inferior to double reading performance. However, there was an increase arbitration rate with a decrease in SDC rate and maintenance of IC detection. The decrease in recall rate and overall reduction in workload if this approach was implemented potentially provides a trade-off to the cancer detection and arbitration rate effects. Sharma et al reported a similar non-inferior sensitivity AI algorithm performance in a cohort of UK and Hungarian screening data, as well as similar increase in the arbitration rate290. Deployment of AI algorithms as the second reader is seen as a favourable initial deployment approach with a ‘reader in the loop’ for oversight of the algorithm’s decisions, however the trade-off of reducing the workload of one reader, whilst significantly increasing arbitration, needs to be addressed as to what is an acceptable national level of increase in arbitration as well as who takes part in this arbitration and what information needs to be provided by the AI algorithm to arbitration readers. It is also important recall rates remain the same as existing screening standards so as not to increase the workload of assessment clinics, which are already a workload intensive and costly part of any screening programme. 6.5.2 Further analysis The additional scenario of implementing an auto recall threshold (99.0% specificity), aims to overcome the bias caused by using the original arbitration decision of human readers, as a case can only be recalled if the overall human reader decision was to recall the case. At this threshold there was an overall increase in earlier detection of cancer (ICs and NRCs) at the expense of an increased recall rate. However, it is unknown if the ICs and NRCs recalled would be detected at an assessment clinic or with supplemental imaging. Combining all three AI algorithms did not result in a statistically significant improvement in performance. Salim et al also found using a voting system of three different commercial AI algorithms did not improve performance compared to the best performing algorithm149. However in Schaffter et al, they implemented an ensemble method of the top performing eight algorithms as part of the DREAM challenge, and did show superior performance compared to the single best performing algorithm137. It was also suggested in the UK National Screening Committee report that using algorithms together could potentially improve overall performance136,137. Interestingly as shown in Figure 6-10.b the ICs and NRCs detected by each AI algorithm and human readers are 142 different and thus potential benefit for the early detection of cancers could be found by using these systems together. Investigating the consistency of performance across different categories showed the algorithms detected cancers with a similar distribution to the true distribution across all sub groups. In addition AI algorithms demonstrated similar behaviour to human readers with a decrease in performance at the highest breast density category, which has previously been reported138,149,290. 6.5.3 Limitations There are limitations to this study. Firstly, comparing three yearly performance disadvantages the human reader as it provides the AI algorithms with the opportunity to detect cancers that were not detected by the human readers. Secondly, in practice human readers have access to both prior images and clinical information, which could disadvantage the AI algorithms. Recent developments have seen algorithms starting to use prior images within their decision-making process, and this information was not available in this study. In addition, as all the algorithms in this study are commercial, they are reported under a pseudonym (DL-ID). Whilst this limits the transparency of reporting certain parameters (e.g. model weights and layers) for reproducibility, it does provide an oversight as to the current performance of commercial AI algorithms for programme level decisions and thus evidence for the implementation of this technology as well as the planning of prospective studies. Part of this study uses simulation to estimate the performance of the AI as the second reader, as noted in the recently updated 2021 UK NSC report, simulation studies are unable to “measure the impact of AI on readers and their decisions”. Ethnicity data was missing for a proportion of cases, when searching NBSS and Electronic Health Record (EHR) systems and so the assessment of AI algorithms for consistent non-biased performance based on case ethnicity was not possible. Histopathological size can be influenced by the use of neo-adjuvant chemotherapy, and this information was not commonly available alongside size information for analysis. Finally, two out of the three algorithms had access to UK The Optimam Mammography Image Database (OMI-DB) data, which includes a small proportion of data form Cambridge. Whilst, all time points for these cases were identified and removed from this study testing set, the Cambridge data is not wholly temporally independent from the training sets used by each AI algorithm and there is the potential for bias. 6.5.4 Future work Further work should include evaluating the lesion level prompts provided by each algorithm to investigate the explainability as well as the possibility of use of algorithms as interactive clinical decision support systems. In addition, the development of the database over a ten-year period will 143 allow for the inclusion of prior mammogram information for AI algorithms which could result in an improvement in performance. 6.6 Conclusion In conclusion all of the three commercial AI algorithms met the required benchmark of non- inferiority for the detection and diagnosis of breast cancer as a stand-alone single screen reader and in conjunction with a human reader in a double reading system. Thus, all of the three algorithms are suitable to proceed to prospective assessment. Further work is however required to confirm bias does not occur for certain patient groups, through the evaluation of AI algorithm performance for different ethnicities and in different socio-economic regions of the country as part of prospective studies. 144 Chapter 7 - Performance of stand-alone artificial intelligence algorithms in a UK screening cohort for high sensitivity and high specificity triage 7.1 Aims In this chapter, the performance of three commercial artificial intelligence (AI) algorithms is investigated for high sensitivity and high specificity triage applications. A large representative screening cohort from two UK screening sites is used for this study in order to assess the tools performance at a high sensitivity for normal case rule out triage. In addition, each AI algorithm was evaluated for a high suspicious high specificity rule in triage application, for the detection of interval and next round cancers. A combined approach for both rule in rule out triage was then applied using the thresholds identified in the earlier studies. The results from this chapter provide data for planning prospective trials and adds to the UK evidence for investigating the use of AI algorithms for triage applications in breast cancer screening. Contents of this chapter have been submitted to Radiological Society of North America conference 2022 [abstract ID - #2022-SP-2966-RSNA] and European Society of Breast Imaging conference 2022 [abstract ID - #A-165]. 7.2 Introduction Each year more than 2.5 million women are screened using mammography as part of the National Health Service (NHS) Breast Screening Programme (BSP), and an estimated 15,000 cancers are diagnosed, such that an estimated ~99.0% of women screened will not have a cancer at the time of screening54. Thus the vast majority of the screening workload is from ‘normal’ screens. Screening programmes like the NHSBSP employ a double reading system, where each case is read by two radiologists and if there is discordance between readers the case is arbitrated. Mammographic screen reading is therefore a repetitive task of high volume, which is prone to reader fatigue294. Many countries have also reported a scarcity of radiologists, especially in breast imaging58. Therefore solutions to improve the efficiency of screening are of interest for programmes. One solution would be to reduce the readers’ workload and not have to read mammograms with a very low likelihood of a cancer by using an AI algorithm for automated ‘normal’ case triage. Mammograms below a threshold could be automatically assigned a ‘normal’ outcome and not read by a human reader or only read by one reader in a double reading programme134,135,229. In our systematic review and meta- analysis reported in Chapter 3 we found when applying this computer aided triage (CADt) approach 145 the number of exams could be reduced from 17.0-91.0% whilst missing 0.0-7.0% of cancer cases133. A recent study by Lång et al, found 19.1% of cases could be removed without missing a cancer. If 53.0% of cases were classified as normal, 10.3% of screen detected cancers (SDCs) would not be flagged of which 85.7% (6/7) were clearly visible, with a 27.8% reduction in false positive recalls295. What has not been fully quantified is the acceptable miss rate of these systems when used for this specific application, such as what sensitivity threshold should be used when setting the operating point of these systems133. Alternative triage reading approaches, have been suggested such that cases with the lowest scores are single read and the rest are double read. Using this approach Balta et al demonstrated a 32.6% reduction in workload for the second reader whilst estimated to miss no cancers230. Balta et al also found a reduction in recall rate (5.35% to 4.79% (p < 0.01)), a reduction in arbitration rate by 20.8% and an increase in positive predictive value (11.9% to 13.3% (p < 0.01))230. However, it is unknown if this would be replicated in the real-time clinical workflow and if reader performance would improve or at a minimum stay the same. Concerns raised are if there will be adverse effects if reading a smaller volume of exams and the impact on reader training to maintain the high standards of breast screening through exposure to different cases, both cancer and non- cancer. An alternative method for stand-alone AI triaging is to triage highly suspicious cases with a high score for either automatic referral for assessment or supplemental imaging. This auto CADt rule in approach could improve the detection of interval (IC) and next round (NRC) cancers, thus potentially improving the survival outcomes of women through earlier detection. One approach suggested by Dembrower et al is for enhanced screening of those cases with the highest 1.0-5.0% scores using supplemental imaging (Digital Breast Tomosynthesis (DBT), Magnetic resonance imaging (MRI)), which estimated to increase the detection of ICs by 12.0-27.0% and NRCs by 14.0-35.0%134. Dembrower et al also incorporated a rule out triage approach which identified 60.0% of cases could be triaged out from human reading without missing a cancer134. Lauritzen et al, implemented both a normal rule out triage and a high suspicion auto recall to assessment triage. For the auto recall high suspicion triage only 0.08% of cases were recalled of which 8.8% were SDCs, 0.0% ICs and 0.14% NRCs. However, Lauritzen et al found an overall 63.0% workload reduction and 25.1% false positive reduction, whilst missing 12 (1.5%) SDCs when implementing the auto recall out threshold296. Overall, Lauritzen et al reported a non-inferior sensitivity (69.7% vs 70.8%) and specificity (98.6% vs 98.1%) when comparing the AI system workflow to the routine double reading workflow296. Such improvements in efficiency of reading could also benefit the patients by potentially providing faster results. The anxiety of waiting for screening results is often reported by patients attending screening. Furthermore, the overall cost effectiveness of the screening programme could be 146 improved by this strategic screening reading approach, through the reduction in the number of radiologists hours required for mammographic screen reading and instead utilising radiologists time for complex biopsies and time-consuming MRI reading. This study addresses the gap in evidence highlighted in the UK National Screening Committee report, using a large external UK multi-vendor multi-site cohort for testing multiple AI algorithms in different triage approaches within an independent environment136. 7.3 Methods 7.3.1 Data All study data was obtained from the CC-MEDIA database described in Chapter 4, where data was collected from two NHSBSP sites (Cambridge and Norwich) under existing ethical approval (HRA REC 20/LO/0104, HRA CAG 20/CAG/0009, PHE RAC BSPRAC_090). All study data was de-identified prior to use in this research. Cases were included if the following eligibility criteria was met; age more than 47 years old, complete two-view FFDM, took part in routine NHSBSP screening between January 1 2015 to December 31 2017 at Norwich and January 1 2017 to December 31 2018 at Cambridge. Cases were excluded if they were recorded as a technical recall, were part of high-risk screening, did not meet the specified case definition for ground truth, and any cases where there was an incomplete mammogram; less than four views, more than four views, images not available on Picture Archiving and Communication System (PACS) or only raw data was available. One time point per case was included, such that if a case appeared twice due to repeat screening within the study time frame the earliest time point was used for this study. Cancer cases were also removed where they did not meet the specified definition following discussion with Public Health England (PHE), and interval cancer cases were removed if the interval was recorded as longer than 40 months. Cases were not excluded if they had prior surgery, prior cancer or an artefact was included in the image (e.g. pacemakers). However, cases with implants were excluded. All cases were checked to ensure there was no overlap with any existing databases the tools had used for training and therefore the AI algorithms had not previously seen any data included in this study dataset. The processed screening Full Field Digital Mammogram (FFDM) images were used by all AI algorithms for this study. The images were stored in Joint Photographic Experts Group (JPEG) Lossless Digital Imaging and Communications in Medicine (DICOM) format and no additional pre- processing other than that performed by the mammography vendor and that performed by the AI algorithm occurred. No prior images or clinical data was available for the algorithms to use. 147 Clinical metadata was collected for all cases included in this study from the National Breast Screening System (NBSS). The invasive status, histological grade, and histological size, was obtained using an automated NBSS query, for further detail please see Chapter 4 Section 4.4.5. The case selection process is shown in the Standards for Reporting of Diagnostic Accuracy Studies (STARD) diagram in Figure 7-1275. Cases were excluded if a sufficient ground truth follow-up was not available. In this study 34,889 cases were excluded as a second time point normal screen outcome was not available in the NBSS output. This is due to cases not returning to screening due to either non-attendance or they completed routine screening (aged 50-70) and did not self-refer. The study took place during the Covid-19 pandemic which also effected women being recalled to screening, as detailed in Chapter 4. Figure 7-1 – Standards for Reporting of Diagnostic Accuracy Studies (STARD) flow diagram of cases included and excluded in this study. FFDM: Full Field Digital Mammogram, FHx: Family history, IC: Interval cancer, NHS: National Health Service, OMI-DB: The Optimam Mammography Image Database, PHE: Public Health England, PACS: Picture Archiving and Communication System. 7.3.2 Ground truth The ground truth was determined for each case using definitions available from the NHSBSP as well as normal follow-up standards within the field. The study time period overlaps with the pause in screening during the Covid-19 pandemic, which lead to an increase in round length. For further details regarding the definition for the ground truth of each case please refer to Chapter 6 Section 6.3.3. 148 Figure 7-2 shows the sequence of different cancer outcomes. Previous round cancers and previous round interval cancers were only included if they had a further round outcome of either ‘normal’ or ‘cancer’ outcome. Figure 7-2 – Cancer outcomes for study cohort. FRC: Future round cancer, IC: Interval cancer, NRC: Next round cancer, NRIC: Next round interval cancer, PRC: Previous round cancer, PRIC: Previous round interval cancer, SDC: Screen detected cancer. The human readers were trained breast radiologists or breast radiographers who read as part of the NHSBSP, meeting the NHSBSP standards of reading 5000 mammograms a year and undertaking Personal Performance in Mammographic Screening (PERFOMS) testing each year56. Trainee readers were removed from this analysis. 7.3.3 AI tools Three commercial AI algorithms, which use a deep learning (DL) convolutional neural network architecture, were installed at the University of Cambridge. Details regarding the training data used by each AI algorithm as well as the technical setup and algorithm output is outlined in Chapter 5 Table 5-1. Breast Imaging-Reporting and Data System (BI-RADS) 5th edition density scores were provided from two systems; Volpara (research version – VolparaResearch32_L30Enabled_v2, Wellington, New Zealand) using raw full field digital mammography (FFDM) data, and DL-3 using FFDM processed data. 7.3.4 Thresholds Four thresholds were used for the normal triage aspect of this study. The first threshold was set at 99.0% sensitivity for the AI algorithm (threshold 1) and the second threshold was set at 99.9% sensitivity (threshold 2), with SDCs classified as cases. The third threshold (threshold 3) was set at 85.0% sensitivity, and the fourth threshold was set at 70.0% specificity (threshold 4) for cancers occurring within 3 yearly screening (SDCs, ICs, NRCs). At these thresholds the AI algorithms performance was assessed for an adapted reader workflow, where an initial AI algorithm read takes place and if the case meets the threshold, then it is included in the alternative screening workflow, Figure 7-3.b or Figure 7-3.c. Scenario B, as shown in Figure 7-3.b, results in any case below the 149 threshold not being read by a human reader, whereas in Scenario C, Figure 7-3.c, the case is read by one reader (single first reader) only. Cases that do not meet this threshold proceed to routine double reading creating a simulated workflow. For the high-suspicion rule in triage part of this study, the performance threshold was set at 94.0- 99.0% specificity, for cancers occurring within 3 yearly screening (SDCs, ICs, NRCs). Two approaches are reported in this study, Scenario D (Figure 7-3.d) and Scenario E (Figure 7-3.e). In Scenario D any case above the AI algorithm threshold of 94.0-99.0% specificity is automatically recalled for supplemental imaging or assessment. Alternatively in Scenario E, any case above the threshold (94.0-99.0% specificity) not recalled by routine human reading would be referred for further supplemental imaging or assessment. SDCs and ICs, occurring within the three-year screening interval, were classified as cancer cases for the calculation of overall sensitivity and specificity. 150 Figure 7-3 – Proposed workflow deployment approaches for stand-alone artificial intelligence (AI) systems as triage tools. a) Routine UK double reading workflow, b) rule out normal triage of all cases below a set threshold, c) rule out normal triage of cases below a set threshold to single first reader reading, d) rule in high suspicion triage of all cases above a set threshold to supplemental imaging or assessment, e) rule in high suspicion triage of cases above a set threshold that were not recalled by routine double reading to supplemental imaging or assessment. 7.3.5 Statistical analysis All statistical analysis took place in R version 4.0.4 (R Foundation for Statistical Computing, Vienna, Austria)225, using the packages detailed in Chapter 5 Section 5.3.6. 151 The overall predictive performance of each AI algorithm was evaluated using area under the receiver operating characteristic (AUROC) curve, and the partial AUROC (pAUROC) at 99.0-100% or 85.0- 100% sensitivity and 94.0-99.0% specificity. The primary performance metrics for the normal triage study were; specificity, sensitivity, % cases triaged, % cancers missed due to normal triage and % false positive cases triaged. The primary performance metrics for the high-suspicious triage study were; sensitivity, specificity, % interval cancers detected, and % next round cancer detected. The effect on the overall recall rate and arbitration rate was also assessed for all triage approaches. 𝑆𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦 = 𝑇𝑃𝑇𝑃 + 𝐹𝑁 𝑆𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦 = 𝑇𝑁𝑇𝑁 + 𝐹𝑃 Performance of each AI algorithm was compared to readers performance, using one sample one tailed z-test to determine if the algorithm was non-inferior. Subgroup analysis of the SDCs missed by the AI algorithms as part of normal triaging at threshold 1 took place for the following categories; age at screening, mammographic machine vendor, invasive tumour size, invasive tumour grade, and mammographic breast density using both Cambridge and Norwich data. Further subgroup analysis in the same categories was calculated for the AI algorithms at the 94.0% specificity rule in triage threshold for ICs and NRCs. The true integer values and percentages were reported as well as Chi squared c2 test was used to investigate if there was a statistically significance between categories285. Data is presented as integer number and percentage (n (%)), or median and interquartile range (IQR) [25th – 75th centile range] as appropriate. DeLong’s test was used to assess for a statistically significant difference between the AUROC curve of AI algorithms using 2000 bootstrapping examples. In all analyses, p-values < 0.05 were considered statistically significant and 95% confidence intervals were calculated, using bootstrapping with 2000 samples or through an approximation method from Simel et al using the epiR package291. 7.3.6 Reporting Each AI algorithm was assigned a DL Identifier (ID) for the purposes of this study. For additional details please refer to Section 5.3.7 in Chapter 5. 7.4 Results 7.4.1 Data In total 78,849 cases were included. 24,563 (31.2%) cases were from Cambridge and 54,286 (68.8%) cases were from Norwich. The median age of the cohort was 59.0 years old [IQR 54.0–63.0]. The study cases characteristics are detailed in Table 7-1. 152 Cambridge Norwich Total Cases n 24563 54286 Year of Screen 2015 - 21017 (38.7%) 2016 - 19219 (35.4%) 2017 11956 (48.7%) 14050 (25.9%) 2018 12607 (51.3%) - FFDM Manufacturer GE 235 (1.0%) 54286 (100%) Philips 24328 (99.0%) - Age at Screening Median [IQR] 58.0 [54.0-63.0] 59.0 [52.0-64.0] 47-49 76 (0.3%) 4030 (7.4%) 50-59 13871 (56.5%) 26752 (49.3%) 60-69 9607 (39.1%) 20785 (38.3%) 70+ 1009 (4.1%) 2719 (5.0%) Density BI-RADS Volparab DL-3a DL-3a a 3965 (16.1%) 5682 (23.1%) 8109 (14.9%) b 11123 (45.3%) 13527 (55.1%) 31260 (57.6%) c 6607 (26.9%) 5252 (21.4%) 14180 (26.1%) d 2516 (10.2%) 102 (0.4%) 737 (1.4%) Missing 48 (0.2%) 0 (0.0%) 0 (0.0%) Cancers SDC Rate per 1000 screens 342 8.5/1000 545 7.0/1000 IC Rate per 1000 screens 167 4.2/1000 272 3.5/1000 NRC Rate per 1000 screens 184* 6.5/1000 504 8.1/1000 FRC - 149* Next round Interval cancers 18* 181* Table 7-1 – Summary of testing dataset characteristics. Integer values and percentages in brackets (%) and Interquartile range in square brackets [IQR]. BI-RADS: Breast imaging-reporting and data system, FRC: Future round cancer, FFDM: Full Field Digital Mammography, GE: General Electric, IC: Interval cancer, NRC: Next round cancer, NRIC: Next round interval cancer, SDC: Screen detected cancer. aDL-3 5th edition BI-RADS density scores on processed full field digital mammograms. b Volpara 5th edition BI-RADS mammographic breast density from raw full field digital mammograms for Cambridge data. *Incomplete follow-up time period information from which to calculate an accurate rate. In total 69.0% of the study mammograms were from GE machines, whereas 31.0% were from Philips machines with predominantly Philips machines at Cambridge and GE machines at Norwich. Approximately 1/6th of the cohort of women aged 67-69 are were not included as they would not have self-referred to have a repeat screen, thus would not have met the threshold for the ‘normal’ ground truth used in this study. This has a potential knock-on effect for the overall density percentages reported, as women aged 67-69 have a higher proportion of BI-RADS category a and b cases. The cohort contains 887 (1.1%) SDCs, 439 (0.6%) ICs, and 688 (0.9%) NRCs. The characteristics of the SDC and NRC cases in the study cohort are shown in Table 7-2. 153 Cambridge SDC n (%) Cambridge NRC n (%) Norwich SDC n (%) Norwich NRC n (%) Total Cases n 342 184 545 504 Total Lesions n 359 191 562 527 Round Length* (days) 1088 [1066-1105] 1221 [1080-1332] 1063 [1036-1085] 1078 [1064-1128] Round Length* (months) 35.8 [35.0-36.3] 40.1 [35.5-43.8] 35.0 [34.1-35.7] 35.4 [35.0-37.1] Age at Screening Median [IQR] 62.0 [57.0-68.0] 60.0 [54.0-65.0] 62.0 [56.0-68.0] 61.0 [55.0-65.0] 47-49 0 (0.0%) 1 (0.5%) 33 (6.1%) 25 (5.0%) 50-59 123 (36.0%) 84 (45.7%) 171 (31.3%) 199 (39.5%) 60-69 164 (47.9%) 86 (46.7%) 259 (47.5%) 237 (47.0%) 70+ 55 (16.1%) 13 (7.1%) 82 (15.0%) 43 (8.5%) Invasive Status Invasive 292 (81.3%) 160 (83.8%) 478 (85.1%) 432 (82.0%) Non-invasive 66 (18.4%) 30 (15.7%) 82 (14.6%) 92 (17.5%) Missing 1 (0.3%) 1 (0.5%) 2 (0.4%) 3 (0.6%) Invasive Tumour Sized < 15 154 (52.7%) 72 (45.0%) 261 (54.6%) 244 (56.5%) >= 15 128 (43.8%) 60 (37.5%) 198 (41.4%) 154 (35.6%) Missing 10 (3.4%) 28 (17.5%) 19 (4.0%) 34 (7.9%) Invasive Tumour Graded 1 52 (17.8%) 24 (15.0%) 132 (27.6%) 133 (30.8%) 2 170 (58.2%) 98 (61.2%) 233 (48.7%) 187 (43.3%) 3 53 (18.2%) 24 (15.0%) 104 (21.8%) 86 (19.9%) Missing 17 (5.8%) 14 (8.8%) 9 (1.9%) 26 (6.0%) Volparab DL-3a Volparab DL-3a DL-3a DL-3a Density BI-RADSl a 47 (13.7%) 72 (21.1%) 27 (14.7%) 41 (22.3%) 52 (9.5%) 59 (11.7%) b 169 (49.4%) 197 (57.6%) 95 (51.6%) 98 (53.3%) 334 (61.3%) 303 (60.1%) c 87 (25.4%) 72 (21.1%) 47 (25.5%) 45 (24.5%) 153 (28.1%) 141 (28.0%) d 35 (10.2%) 1 (0.29%) 15 (8.2%) 0 (0.0%) 6 (1.1%) 1 (0.2%) Missing 4 (1.2%) 0 (0.0%) 0 (0.0%) 0 (0.0%) 0 (0.0%) 0 (0.0%) Table 7-2 – Screen detected (SDC) and next round cancer (NRC) characteristics by lesions and cases. With integer values and percentages in brackets (%). Invasive Tumour Size in millimetres (mm). BI-RADS: Breast imaging-reporting and data system, NRC: Next round cancer, SDC: Screen detected cancer. dInvasive lesions only. lCases only. aDL-3 5th edition BI-RADS density scores on processed full field digital mammograms. bVolpara 5th edition BI-RADS mammographic breast density from raw full field digital mammograms for Cambridge data. *Cases not screened before or screened more than 6 years ago were not included in this calculation. Results for screen detected cancers between sites are comparable. However the results for next round cancers are not comparable between sites. This is because cases from 2017 onwards were effected by the pause in screening during the Covid-19 pandemic, which is described in Chapter 4. Cambridge data includes 2017-2018 cases (~84.8% cases effected) whereas Norwich includes 2015-2017 cases (~21.2% cases effected), thus this effect is seen in the Cambridge next round cancer results where the round length is increased. It is expected that the size and grade of cancers would increase as a consequence, however due to the increase use of hormone therapy in the pandemic this impact may not been seen in the histopathological size and grade. 154 Of the 439 ICs, 167 (38.0%) were from Cambridge and 272 (62.0%) were from Norwich. The characteristics of the IC cases in the study cohort are shown in Table 7-3. Cambridge IC n (%) Norwich IC n (%) Total Cases n 167 272 Total Lesions n 170 275 Age at Screening Median [IQR] 59.0 [54.0-66.0] 62.0 [55.0-68.0] 47-49 1 (0.6%) 18 (6.6%) 50-59 85 (50.9%) 99 (36.4%) 60-69 58 (34.7%) 114 (41.9%) 70+ 23 (13.8%) 41 (15.1%) Invasive Status Invasive 145 (85.3%) 260 (94.5%) Non-invasive 14 (8.2%) 14 (5.1%) Missing 11 (6.5%) 1 (0.4%) Invasive Tumour Sized < 15 39 (26.9%) 87 (33.5%) >= 15 86 (59.3%) 142 (54.6%) Missing 20 (13.8%) 31 (11.9%) Invasive Tumour Graded 1 24 (16.6%) 31 (11.9%) 2 62 (42.8%) 127 (48.8%) 3 57 (39.3%) 95 (36.5%) Missing 2 (1.4%) 7 (2.7%) Volparab DL-3a DL-3a Density BI-RADSl a 12 (7.2%) 15 (9.0%) 10 (3.7%) b 65 (38.9%) 85 (50.9%) 131 (48.2%) c 58 (34.7%) 63 (37.7%) 122 (44.9%) d 31 (18.6%) 4 (2.4%) 9 (3.3%) Missing 1 (0.6%) 0 (0.0%) 0 (0.0%) Interval (months)l 0-12 25 (15.0%) 44 (16.1%) 12-24 56 (33.5%) 100 (36.8%) 24-36 86 (51.5%) 128 (47.1%) 36-40 0 (0.0%) 0 (0.0%) Missing 0 (0.0%) 0 (0.0%) Radiological Audit Classificationl Normal/ Benign 138 (82.6%) 199 (73.2%) Uncertain 19 (11.4%) 59 (21.7%) Suspicious 0 (0.0%) 9 (3.3%) Unclassifiable 1 (0.6%) 5 (1.8%) Missing 9 (5.4%) 0 (0.0%) Table 7-3 – Interval cancer (IC) characteristics by lesions and cases. With integer values and percentages in brackets (%). BI-RADS: Breast imaging-reporting and data system, IC: Interval cancer. dInvasive lesions only. lCases only. Invasive Tumour Size in millimetres (mm). aDL-3 5th edition BI-RADS density scores on processed full field digital mammograms.b Volpara 5th edition BI-RADS mammographic breast density from raw full field digital mammograms for Cambridge data. The median time to diagnosis was 741.0 [IQR 502.5–963.0] days for all ICs at Cambridge and 703.5 [IQR 450.0–944.2] days at Norwich. 155 The double reader performance and single first reader performance, when combining outputs from both sites, is shown below in Table 7-4. Double reading First reader Sensitivity 68.9% 63.6% Specificity 97.6% 97.0% Precision 32.8% 26.4% Arbitration 2.5% - Recall rate 3.5% 4.1% n detected (%) SDC 887 (100%) 807 (91.0%) IC 27 (6.2%) 36 (8.2%) NRC 20 (2.9%) 29 (4.2%) FRC 4 (2.7%) 7 (4.7%) NRIC 7 (3.5%) 10 (5.0%) FP 1840 2310 Table 7-4 – Double and single first reader performance at both Cambridge and Norwich. FRC: Future round cancer, FP: False positive, IC: Interval cancer, NRC: Next round cancer, NRIC: Next round interval cancer, SDC: Screen detected cancer. 7.4.2 Rule-out triage – Threshold 1 and 2 The overall AUROC for DL-1, DL-2, DL-3 was 0.962 [95% CI 0.955–0.969], 0.966 [95% CI 0.961–0.972] and 0.975 [95% CI 0.970–0.980] respectively, when classifying cancers as SDCs only, Figure 7-4. The AUROC of DL-3 was statistically significantly different (p < 0.05) to that of DL-1 and DL-2 when tested on all and Norwich data. As well as DL-3 AUROC performance was statistically significantly different (p < 0.05) when comparing between Norwich and Cambridge site data. 156 Figure 7-4 – Receiver operating characteristic (ROC) curves for screen detected cancers (SDCs) as cases. a) results per site, b) results per algorithm. The overall results are in grey, Cambridge in orange and Norwich in pink. The results for DL-1 are in blue, DL-2 in purple, and DL-3 in green. Area under the receiver operating characteristic curve values are provided for each site and each algorithms performance. Firstly applying the Scenario B workflow at the 99.0% sensitivity threshold (threshold 1), where SDCs were classed as cases, resulted in 65.0%, 46.8% and 44.4% cases left to be read by a double reading workflow for DL-1, DL-2 and DL-3 respectively, Table 7-5. At this threshold all algorithms ruled out 9 (1.0%) SDCs, and between 100-222 (14.5-32.3%) NRCs and 74-114 (16.9-26.0%) ICs. DL-3 ruled out the highest number of false positive (FP) cases (n = 465), whereas DL-1 and DL-2 ruled out a similar volume (DL-1 n = 318 and DL-2 n = 369). a) b) 157 Sensitivity Threshold Specificity Missed Cancers FP out % to read SDC IC NRC Cases - - 887 439 688 1840 - DL-1 1) 99.0% 35.3% [30.0-57.0] 9 (1.0%) 74 (16.9%) 100 (14.5%) 318 (14.4%) 65.0% DL-1 2) 99.9% 10.8% [10.6-28.0] 1 (0.1%) 14 (3.2%) 18 (2.6%) 68 (3.7%) 89.4% DL-2 1) 99.0% 53.8% [35.9-66.4] 9 (1.0%) 107 (24.4%) 214 (31.1%) 369 (20.1%) 46.8% DL-2 2) 99.9% 12.1% [11.9-29.2] 1 (0.1%) 12 (2.7%) 28 (4.1%) 25 (1.4%) 88.0% DL-3 1) 99.0% 56.3% [38.1-66.7] 9 (1.0%) 114 (26.0%) 222 (32.3%) 465 (25.3%) 44.4% DL-3 2) 99.9% 21.9% [21.4-38.1] 1 (0.1%) 21 (4.8%) 55 (8.0%) 131 (7.1%) 78.3% Table 7-5 – Results at 1) 99.0% sensitivity threshold 1 and 2) 99.9% sensitivity threshold 2. Missed cases are shown for screen detected cancers, next round cancers and interval cancers as well as the proportion of false positives ruled out. Where screen detected cancers were classed as cases only for the threshold identification. FP: False positive, IC: Interval cancer, NRC: Next round cancer, SDC: Screen detected cancer. The sensitivity (D -0.7%) and specificity (D +0.4%~+0.6%) is non-inferior. A lower arbitration rate (D - 0.4%~-0.7%) and recall rate (D -0.4%~-0.6%) was also observed at threshold 1, Table 7-6. Applying Scenario B workflow at the 99.9% sensitivity threshold (threshold 2) left in 89.4%, 88.0% and 78.3% cases required to be read by the double reading workflow for DL-1, DL-2 and DL-3 respectively. At this threshold, 1 (0.1%) SDC was missed, and between 18-55 (2.6-8.0%) NRCs and 12- 21 (2.7-4.8%) ICs were missed from rule out triage, Table 7-5. The specificity and sensitivity were both non-inferior. The arbitration rate (D 0.0%~-0.2%) and recall rate (D 0.0%~-0.1%) were observed to not change at this threshold compared to double reading, Table 7-7. 158 DL-1 + readers DL-2 + readers DL-3 + readers Sensitivity threshold 99.0% 99.0% 99.0% Sensitivity 68.2% [65.6-70.7] p < 0.01 68.2% [65.6-70.7] p < 0.01 68.2% [65.6-70.7] p < 0.01 Specificity 98.0% [97.9-98.1] p < 0.01 98.1% [98.0-98.2] p < 0.01 98.2% [98.1-98.3] p < 0.01 Precision 36.8% 37.6% 39.2% Arbitration 2.1% 1.9% 1.8% Recall rate 3.1% 3.1% 2.9% n (%) Missed SDC 9 (1.0%) 9 (1.0%) 9 (1.0%) IC 74 (16.9%) 107 (24.4%) 114 (26.0%) NRC 100 (14.5%) 214 (31.1%) 222 (32.3%) FRC 18 (12.1%) 58 (38.9%) 74 (49.7%) NRIC 32 (16.1%) 82 (41.2%) 87 (43.7%) Rule out Normal cases n (%) 27332 (34.7%) 41467 (52.6%) 43363 (55.0%) Reader FP n (%) 318 (14.4%) 369 (20.1%) 465 (25.3%) Table 7-6 – Results for DL-1, DL-2 and DL-3 at the 99.0% sensitivity (threshold 1) Scenario B. FP: False positive, FRC: Future round cancer, IC: Interval cancer, NRC: Next round cancer, NRIC: Next round interval cancer, SDC: Screen detected cancer. p values are calculated using a one-sided z-test. DL-1 + readers DL-2 + readers DL-3 + readers Sensitivity threshold 99.9% 99.9% 99.9% Sensitivity 68.9% [66.3-71.3] p < 0.01 68.9% [66.3-71.3] p < 0.01 68.9% [66.3-71.3] p < 0.01 Specificity 97.7% [97.6-97.8] p < 0.01 97.6% [97.5-97.7] p < 0.01 97.8% [97.6-97.9] p < 0.01 Precision 33.6% 33.1% 34.4% Arbitration 2.4% 2.5% 2.3% Recall rate 3.4% 3.5% 3.4% n (%) Missed SDC 1 (0.1%) 1 (0.1%) 1 (0.1%) IC 14 (3.2%) 12 (2.7%) 21 (4.8%) NRC 18 (2.6%) 28 (4.1%) 55 (8.0%) FRC 2 (0.6%) 7 (4.7%) 28 (18.8%) NRIC 7 (3.5%) 14 (7.0%) 21 (10.6%) Rule out Normal cases n (%) 8342 (10.6%) 9371 (11.9%) 17017 (21.6%) Reader FP n (%) 68 (3.7%) 25 (1.4%) 131 (7.1%) Table 7-7 – Results for DL-1, DL-2 and DL-3 at the 99.9% sensitivity (threshold 2) Scenario B. FP: False positive, FRC: Future round cancer, IC: Interval cancer, NRC: Next round cancer, NRIC: Next round interval cancer, SDC: Screen detected cancer. p values are calculated using a one-sided z-test. 159 Applying Scenario C at the 99.0% sensitivity threshold (threshold 1), resulted in 35.0-55.6% of cases classified as to be read by one reader only. Whilst maintaining SDC detection (99.9-100%) as well as an observed lower the arbitration rate (D -0.4%~-0.7%). However, the recall rate was observed to be higher (D +0.2%~+0.3%), Table 7-8. Specificity and sensitivity performance was non-inferior (p < 0.01). DL-1 + readers DL-2 + readers DL-3 + readers Sensitivity 68.9% [66.3-71.3] p < 0.01 69.0% [66.4-71.5] p < 0.01 68.9% [66.4-71.4] p < 0.01 Specificity 97.4% [97.3-97.5] p < 0.01 97.4% [97.2-97.5] p < 0.01 97.4% [97.3-97.5] p < 0.01 Precision 31.5% 30.8% 30.9% Arbitration 2.1% 1.9% 1.8% Recall rate 3.7% 3.8% 3.7% n (%) Detected SDC 886 (99.9%) 886 (99.9%) 887 (100%) IC 27 (6.2%) 29 (6.6%) 27 (6.2%) NRC 20 (2.9%) 21 (3.1%) 22 (3.2%) FRC 4 (2.7%) 4 (2.7%) 5 (3.4%) NRIC 8 (4.0%) 9 (4.5%) 7 (3.5%) Rule out Single reading (%) 27565 (35.0%) 41937 (53.2%) 43869 (55.6%) False positives n 1956 2019 2008 Table 7-8 – Results for DL-1, DL-2 and DL-3 at the 99.0% sensitivity (threshold 1) Scenario C. FRC: Future round cancer, IC: Interval cancer, NRC: Next round cancer, NRIC: Next round interval cancer, SDC: Screen detected cancer. p values are calculated using a one-sided z-test. 7.4.3 Rule-out triage – Threshold 3 and 4 The AUROC for DL-1, DL-2, DL-3 was 0.813 [95% CI 0.802–0.824], 0.814 [95% CI 0.803–0.825], and 0.821 [95% CI 0.886–0.906] respectively, when classifying cancers as SDCs, ICs and NRCs, Figure 7-5. There was a statistically significant difference (p < 0.05) between the AUROC of DL-3 compared to DL-1 and DL-2 when tested on Norwich data. Applying Scenario B at the 85.0% sensitivity threshold (threshold 3), where SDCs, NRCs and ICs were classed as cases, the percentage of cases requiring double reading after applying the AI algorithm threshold was 49.8% for DL-1, 49.5% for DL-2, and 48.4% for DL-3. Applying Scenario B at the 70.0% specificity threshold (threshold 4), where SDCs, NRCs and ICs were classed as cases, the percentage of cases requiring double reading after applying the AI algorithm threshold was 31.2% for DL-1, 31.1% for DL-2, and 31.2% for DL-3. The results of this analysis for each AI algorithm are shown in Table 7-9. 160 Figure 7-5 – Receiver operating characteristic (ROC) curves for screen detected cancers (SDCs), next round cancers (NRCs) and interval cancers (ICs) as cases. a) for each site, b) for each algorithm. The overall results are in grey, Cambridge in orange and Norwich in pink. The results for DL-1 are in blue, DL-2 in purple, and DL-3 in green. Area under the receiver operating characteristic curve values are provided for each site and each algorithms performance. a) b) 161 Sensitivity Threshold Specificity Missed Cancers FP out % to read SDC IC NRC Cases - - 887 439 688 1840 - DL-1 1) 85.0% 51.3% [47.5-56.9] 15 (1.7%) 118 (26.9%) 169 (24.6%) 507 (27.6%) 49.8% DL-2 1) 85.0% 51.6% [48.4-54.7] 8 (0.9%) 95 (21.6%) 199 (28.9%) 338 (18.4%) 49.5% DL-3 1) 85.0% 52.6% [50.0-55.6] 7 (0.8%) 97 (22.1%) 198 (28.8%) 420 (22.8%) 48.4% Specificity Threshold DL-1 2) 70.0% 75.3% [73.4-77.3] 34 (3.8%) 182 (41.5%) 281 (40.8%) 827 (45.0%) 31.2% DL-2 2) 70.0% 74.8% [72.9-76.7] 20 (2.3%) 172 (39.2%) 316 (45.9%) 661 (35.9%) 31.1% DL-3 2) 70.0% 75.8% [73.8-77.7] 19 (2.1%) 156 (35.5%) 313 (45.5%) 696 (37.8%) 31.2% Table 7-9 – Results at 1) 85.0% sensitivity (threshold 3) and 2) results at 70.0% specificity (threshold 4). Missed cases are shown for screen detected cancers, next round cancers and interval cancers as well as the proportion of false positives ruled out. Where screen detected cancers, next round cancers and interval cancers were classed as cases for the threshold identification. FP: False positive, IC: Interval cancer, NRC: Next round cancer, SDC: Screen detected cancer. DL-1 + readers DL-2 + readers DL-3 + readers Sensitivity 69.0% [66.4-71.5] p < 0.01 69.0% [66.4-71.5] p < 0.01 68.9% [66.4-71.4] p < 0.01 Specificity 97.4% [97.2-97.5] p < 0.01 97.4% [97.3-97.5] p < 0.01 97.4% [97.3-97.5] p < 0.01 Precision 30.8% 31.0% 31.1% Arbitration 1.8% 2.0% 1.9% Recall rate 3.8% 3.8% 3.7% n (%) Detected SDC 886 (99.9%) 886 (99.9%) 887 (100%) IC 29 (6.6%) 29 (6.6%) 27 (6.2%) NRC 19 (2.8%) 21 (3.1%) 21 (3.1%) FRC 4 (2.7%) 4 (2.7%) 5 (3.4%) NRIC 9 (4.5%) 9 (4.5%) 7 (3.5%) Rule out Single reading (%) 39625 (50.3%) 39923 (50.6%) 40719 (51.6%) False positive n 2023 2005 1985 Table 7-10 – Results for DL-1, DL-2 and DL-3 at the 85.0% sensitivity (threshold 3) Scenario C. FRC: Future round cancer, IC: Interval cancer, NRC: Next round cancer, NRIC: Next round interval cancer, SDC: Screen detected cancer. p values are calculated using a one-sided z-test. 162 DL-1 + readers DL-2 + readers DL-3 + readers Sensitivity 68.8% [66.2-71.3] p < 0.01 68.8% [66.2-71.3] p < 0.01 68.8% [66.2-71.3] p < 0.01 Specificity 97.2% [97.1-97.4] p < 0.01 97.2% [97.1-97.4] p < 0.01 97.3% [97.2-97.4] p < 0.01 Precision 29.9% 29.9% 30.3% Arbitration 1.3% 1.5% 1.5% Recall rate 3.9% 3.9% 3.8% n (%) Detected SDC 883 (99.5%) 882 (99.4%) 885 (99.8%) IC 29 (6.6%) 30 (6.8%) 27 (6.2%) NRC 18 (2.6%) 20 (2.9%) 21 (3.1%) FRC 4 (2.7%) 5 (3.6%) 6 (4.0%) NRIC 9 (4.5%) 9 (4.5%) 7 (3.5%) Rule out Single reading (%) 54281 (68.8%) 54292 (68.9%) 54266 (68.8%) False positive n 2110 2102 2063 Table 7-11 – Results for DL-1, DL-2 and DL-3 at the 70.0% specificity (threshold 4) Scenario C. FRC: Future round cancer, IC: Interval cancer, NRC: Next round cancer, NRIC: Next round interval cancer, SDC: Screen detected cancer. p values are calculated using a one-sided z-test. Implementing the alternative Scenario C and threshold 3, specificity and sensitivity performance was found to be non-inferior (p < 0.01), Table 7-10. A lower arbitration rate (D -0.5%~-0.7%) was observed, whilst the recall rate (D +0.2%~+0.3%) was higher. Implementing the alternative Scenario C and threshold 4, specificity and sensitivity performance was found to be non-inferior (p < 0.01), Table 7-11. The arbitration rate (D -1.0%~-1.2%) was again lower, whilst the recall rate (D +0.3%~+0.4%) was higher. 163 The density and violin plots in Figure 7-6 show the distribution of cases and the assigned thresholds 1, 2, 3 and 4 for each AI algorithm. Figure 7-6 – Plots for rule out triage thresholds. a) Density plot for screen detected cancers as cases, b) density plot for screen detected, next round and interval cancers as cases where the cancers are in blue and normal cases in red, c) violin plot for all cancer case types, where the blue dot in the violin plot is the mean score and the red is the median score. The pink line represents threshold 1 (99.0% sensitivity), the green line represents threshold 2 (99.9% sensitivity), the purple line represents threshold 3 (85.0% sensitivity) and the orange line represents threshold 4 (70.0% specificity). FRC: Future round cancer, IC: Interval cancer, NRC: Next round cancer, NIC: Next round interval cancer, SDC: Screen detected cancer. 7.4.4 Rule-in triage Scenario D applied 94.0-99.0% specificity cut-off to determine the percentage of ICs and NRCs with a high suspicion that should be referred for additional supplemental imaging / assessment, Table 7-12. At the lower 94.0% specificity threshold 101-115 (23.0-26.2%) ICs and 142-157 (20.6-22.8%) NRCs were ruled in for further assessment. At the 99.0% specificity threshold 26-46 (5.9-10.5%) ICs and 40-44 (5.8-6.4%) NRCs were recalled for further assessment. a) b) c) 164 SDC, IC, NRC as cases ruled in Spec DL-1 DL-2 DL-3 %TRR IC NRC %TRR IC NRC %TRR IC NRC 99.0% 4.5% 40 (9.1%) 44 (6.4%) 4.5% 26 (5.9%) 40 (5.8%) 4.4% 46 (10.5%) 41 (6.0%) 98.0% 5.5% 56 (12.8%) 69 (10.0%) 5.4% 48 (10.9%) 82 (11.9%) 5.4% 62 (14.1%) 77 (11.2%) 97.0% 6.4% 68 (15.5%) 94 (13.7%) 6.4% 68 (15.5%) 96 (14.0%) 6.3% 74 (16.9%) 96 (14.0%) 96.0% 7.3% 82 (18.7%) 116 (16.9%) 7.3% 75 (17.1%) 115 (16.7%) 7.2% 88 (20.0%) 114 (16.6%) 95.0% 8.3% 94 (21.4%) 137 (19.9%) 8.2% 87 (19.8%) 129 (18.8%) 8.2% 101 (23.0%) 130 (18.9%) 94.0% 9.2% 101 (23.0%) 157 (22.8%) 9.2% 103 (23.5%) 149 (21.7%) 9.1% 115 (26.2%) 142 (20.6%) Table 7-12 – Scenario D perturbations of specificity with screen detected cancers (SDCs), interval cancers (ICs) and next round cancers (NRCs) as cases. IC: Interval cancer, NRC: Next round cancer, Spec: Specificity, %TRR: Percentage total recall rate. Table 7-13 shows the proportion of FRCs and NRICs that could have been detected at these thresholds (94.0-99.0% specificity) further increasing the proportion of cancers which could potentially be detected earlier. In addition, the number of additional false positive recalls, which would ultimately lead to an increase in the recall rate from this scenario is included in Table 7-13, and increases as the specificity threshold is reduced. FP, FRC, NRIC ruled in Spec DL-1 DL-2 DL-3 FP FRC NRIC FP FRC NRIC FP FRC NRIC 99% 753 9 (6.0%) 9 (4.5%) 757 3 (2.0%) 8 (4.0%) 760 4 (2.7%) 5 (2.5%) 98% 1508 12 (8.1%) 15 (7.5%) 1521 5 (3.4%) 11 (5.5%) 1519 8 (5.4%) 10 (5.0%) 97% 2269 14 (9.4%) 21 (10.6%) 2284 7 (4.7%) 14 (7.0%) 2281 10 (6.7%) 14 (7.0%) 96% 3026 18 (12.1%) 26 (13.1%) 3051 8 (5.4%) 14 (7.0%) 3048 10 (6.7%) 16 (8.0%) 95% 3787 24 (16.1%) 30 (15.1%) 3815 12 (8.1%) 15 (7.5%) 3805 12 (8.1%) 25 (12.6%) 94% 4550 28 (18.8%) 32 (16.1%) 4573 18 (12.1%) 19 (9.6%) 4568 15 (10.1%) 28 (14.1%) Table 7-13 – Scenario D perturbations of specificity with screen detected cancers (SDCs), interval cancers (ICs) and next round cancers (NRCs) as cases – additional cancers detected. FP: False positive, FRC: Future round cancer, NRIC: Next round interval cancer, Spec: Specificity. The results of applying the 94.0-99.0% specificity thresholds for Scenario E where cases are auto recalled if above the threshold and were not recalled by human readers, are shown in Table 7-14. 165 SDC, IC, NRC as cases ruled in Spec DL-1 DL-2 DL-3 %ARR IC NRC %ARR IC NRC %ARR IC NRC 99.0% 1.0% 32 (7.3%) 42 (6.1%) 0.9% 20 (4.6%) 36 (5.2%) 0.9% 36 (8.2%) 37 (5.4%) 98.0% 1.9% 47 (10.7%) 67 (9.7%) 1.9% 38 (8.7%) 75 (10.9%) 1.8% 49 (11.2%) 71 (10.3%) 97.0% 2.9% 59 (13.4%) 92 (13.4%) 2.8% 58 (13.2%) 88 (12.8%) 2.8% 60 (13.7%) 90 (13.1%) 96.0% 3.8% 71 (16.2%) 111 (16.1%) 3.8% 65 (14.8%) 107 (15.6%) 3.7% 73 (16.6%) 106 (16.6%) 95.0% 4.8% 82 (18.7%) 132 (19.2%) 4.7% 76 (17.3%) 120 (17.4%) 4.6% 85 (19.4%) 121 (17.6%) 94.0% 5.7% 88 (20.1%) 152 (22.1%) 5.7% 90 (20.5%) 139 (20.2%) 5.6% 99 (22.6%) 132 (19.2%) Table 7-14 – Scenario E perturbations of specificity with screen detected cancers (SDCs), interval cancers (ICs) and next round cancers (NRCs) as cases. %ARR: Additional Recall rate, IC: Interval cancer, NRC: Next round cancer, Spec: Specificity. Applying this scenario at the 94.0% threshold results in an increase in the number of ICs (20.1-22.6%) and NRCs (19.2-22.1%) detected. A lower number of cases are overall detected at the higher specificity threshold of 99.0% (ICs (4.6-8.2%), NRCs (5.2-6.1%)). However the recall rate also increased at all thresholds. This is potentially further offset by increase in FRCs and NRICs detected shown in Table 7-15. FP, FRC, NRIC ruled in Spec DL-1 DL-2 DL-3 FP FRC NRIC FP FRC NRIC FP FRC NRIC 99% 675 9 (6.0%) 8 (4.0%) 668 3 (2.0%) 6 (3.0%) 630 4 (2.7%) 4 (2.0%) 98% 1369 12 (8.1%) 13 (6.5%) 1358 5 (3.4%) 9 (4.5%) 1299 8 (5.4%) 9 (4.5%) 97% 2067 14 (9.4%) 18 (9.1%) 2061 7 (4.7%) 12 (6.0%) 1992 10 (6.7%) 13 (6.5%) 96% 2774 17 (11.4%) 23 (11.6%) 2772 8 (5.4%) 12 (6.0%) 2701 10 (6.7%) 15 (7.5%) 95% 3487 23 (15.4%) 27 (13.6%) 3484 12 (8.1%) 13 (6.5%) 3405 12 (8.1%) 23 (11.6%) 94% 4204 27 (18.1%) 29 (14.6%) 4196 17 (11.4%) 16 (8.0%) 4130 15 (10.1%) 24 (12.1%) Table 7-15 – Scenario E perturbations of specificity with screen detected cancers (SDCs), interval cancers (ICs) and next round cancers (NRCs) as cases – additional cancers detected. FP: False positive, FRC: Future round cancer, NRIC: Next round interval cancer, Spec: Specificity. The density and violin plots in Figure 7-7 show the distribution of cases and the assigned 94.0% and 99.0% specificity threshold cut-off for each algorithm rule in triage approaches. 166 Figure 7-7 – Plots for rule in triage thresholds – Screen detected cancers (SDCs), next round cancers (NRCs) and interval cancers (ICs). a) Density plot for screen detected, next round and interval cancers as cases in blue and normal cases in red, b) violin plot for all cancer case types, where the blue dot in the violin plot is the mean score and the red is the median score. The green line represents 94.0% specificity, the pink line represents 99.0% specificity. FRC: Future round cancer, IC: Interval cancer, NRC: Next round cancer, NIC: Next round interval cancer, SDC: Screen detected cancer. 7.4.5 Combined approach The Combined approach entailed combining Scenario C, at the 99.0% sensitivity (threshold 1), and Scenario E at the 99.0% specificity threshold, as shown in Figure 7-8. Figure 7-8 – Violin plots for the combined approach of Scenario C and E for both rule in and rule out triage by an artificial intelligence (AI) algorithm. The normal triage threshold was set a threshold 1 99.0% sensitivity, as shown by the green line on the violin plot. The high suspicious rule in threshold was set at 99.0% specificity, as shown by the purple line on the violin plot. Where the blue dot in the violin plot is the mean score and the red is the median score. FRC: Future round cancer, IC: Interval cancer, NRC: Next round cancer, NIC: Next round interval cancer, SDC: Screen detected cancer. a) b) 167 DL-1 + readers DL-2 + readers DL-3 + readers Sensitivity 71.3% [68.7-73.7] p = 0.03516* 70.5% [68.0-73.0] p > 0.05* 71.6% [69.1-74.1] p = 0.01758* Specificity 96.5% [96.4-96.6] p < 0.01 96.4% [96.3-96.6] p < 0.01 96.5% [96.4-96.6] p < 0.01 Precision 25.8% 25.3% 25.9% Arbitration 2.1% 1.9% 1.8% Recall rate 4.7% 4.7% 4.7% n (%) Detected SDC 886 (99.9%) 886 (99.9%) 887 (100%) IC 59 (13.4%) 49 (11.2%) 63 (14.4%) NRC 62 (9.0%) 57 (8.3%) 59 (8.6%) FRC 13 (8.7%) 7 (4.7%) 9 (6.0%) NRIC 16 (8.0%) 15 (7.5%) 11 (5.5%) Rule out Single reading (%) 27565 (35.0%) 41937 (53.2%) 43869 (55.6%) Rule in % Additional RR 0.97% 0.93% 0.90% Table 7-16 – Combined approach of Scenario C and E for both rule in and rule out triage by an artificial intelligence (AI) algorithm. The normal triage threshold was set at 99.0% sensitivity (threshold 1). The high suspicious rule in threshold was set at 99.0% specificity. FRC: Future round cancer, IC: Interval cancer, NRC: Next round cancer, NRIC: Next round interval cancer, RR: Recall rate, SDC: Screen detected cancer. p values are calculated using a one-sided z-test. *Tested for superiority. Using this approach the sensitivity was superior (p < 0.05) for DL-1 and DL-3. However, this would result trade off in specificity. Overall the proportion of ICs (D +5.0%~+8.2%) and NRCs (D +5.4%~6.1%) detected was higher, Table 7-16. The recall rate (D +1.2%) was observed to be higher and arbitration rate was lower (D -0.4%~-0.7%), Table 7-16. Two settings were applied to calculate the pAUROC for DL-1, DL-2, DL-3, first at the 94.0-99.0% specificity as shown in blue in Figure 7-9, and then at either the 99.0-100% sensitivity (Figure 7-9.a) or 85.0-100% sensitivity (Figure 7-9.b). Overall the AI algorithms achieved a good pAUROC at all thresholds, with DL-3 achieving the highest pAUROC at all settings, Table 7-17. pAUROC – SDC DL-1 DL-2 DL-3 94.0-99.0% Specificity 89.6% [88.3-90.8] 89.2% [87.9-90.4] 93.0% [91.9-94.0] 99.0-100% Sensitivity 62.7% [58.1-68.1] 66.8% [61.2-75.3] 70.1% [65.1-77.4] pAUROC – SDC / NRC / IC DL-1 DL-2 DL-3 94.0-99.0% Specificity 71.2% [70.1-72.2] 70.7% [69.6-71.8] 72.7% [71.6-73.8] 85.0-100% Sensitivity 62.1% [60.7-63.8] 63.1% [61.7-64.6] 63.6% [62.2-65.0] Table 7-17 – Partial area under the receiver operator characteristic (pAUROC) curve results. 95.0% CI in square brackets. IC: Interval cancer, NRC: Next round cancer, pAUROC: Partial area under the receiver operator characteristic curve, SDC: Screen detected cancer. 168 Figure 7-9 – Partial receiver characteristic (pROC) curves. a) Screen detected cancers as cases applying a 99.0- 100% sensitivity to reflect rule out threshold 1 and 2 in the study as shown in green, and a 94.0-99.0% specificity to reflect the rule in thresholds used in this study as shown in blue. b) Screen detected cancers, next round cancers and interval cancers as cases applying 85.0-100% sensitivity to reflect rule out threshold 1, 2 and 3 in the study as shown in green, and a 94.0-99.0% specificity to reflect the rule in thresholds used in this study as shown in blue. 7.4.6 Sub-group analysis Performance of the AI algorithms was further assessed at threshold 1 (99.0% sensitivity) for the SDCs that were missed by each AI algorithm at this threshold [n = 9 (1.0%)] using Scenario B, Table 7-18. There was no statistically significant difference in the types of cancers missed relative to the true distribution of cancer cases using Chi squared c2 test ( p > 0.05). a) b) 169 n DL-1 DL-2 DL-3 Total Cases n 887 9 9 9 Total Lesions n 921 9 9 9 Total invasive lesions 770 8 5 7 Age at Screening < 60 327 3 (0.9%) 4 (1.2%) 4 (1.2%) >= 60 560 6 (1.1%) 5 (0.9%) 5 (0.9%) p value - 0.82696 0.639289 0.639289 FFDM Vendor GE 554 5 (0.9%) 7 (1.3%) 7 (1.3%) Philips 333 4 (1.2%) 2 (0.6%) 2 (0.6%) p value - 0.670613 0.34459 0.34459 Invasive Tumour Sized < 15 mm 415 3 (0.7%) 1 (0.2%) 5 (1.2%) >= 15 mm 326 5 (1.5%) 4 (1.2%) 2 (0.6%) Missing 29 0 (0.0%) 0 (0.0%) 0 (0.0%) p value - 0.294494 0.106186 0.413069 Invasive Tumour Grade d 1 184 0 (0.0%) 0 (0.0%) 0 (0.0%) 2 403 7 (1.7%) 4 (1.0%) 5 (1.2%) 3 157 1 (0.6%) 1 (0.6%) 2 (1.3%) Missing 8 0 (0.0%) 0 (0.0%) 0 (0.0%) p value - - - - Density BI-RADSa a 124 2 (1.6%) 0 (0.0%) 0 (0.0%) b 531 4 (0.8%) 7 (1.3%) 5 (0.9%) c 225 3 (1.3%) 2 (0.9%) 4 (1.8%) d 7 0 (0.0%) 0 (0.0%) 0 (0.0%) p value - - - - Density BI-RADS b a 47 0 (0.0%) 0 (0.0%) 0 (0.0%) b 169 2 (1.2%) 1 (0.6%) 1 (0.6%) c 87 0 (0.0%) 0 (0.0%) 0 (0.0%) d 35 2 (5.7%) 0 (0.0%) 1 (2.9%) Missing 549 5 (0.9%) 8 (1.5%) 7 (1.3%) p value - - - - Table 7-18 – Sub group analysis of DL-1, DL-2, DL-3 set at the threshold of 99.0% sensitivity (threshold 1) using Scenario B for the screen detected cancers (SDCs) missed. BI-RADS: Breast imaging-reporting and data system, FFDM: Full Field Digital Mammography, GE: General Electric. dLesions reported. aDL-3 BI-RADS density scores on processed full field digital mammograms. b Volpara 5th edition BI-RADS mammographic breast density from raw full field digital mammograms for Cambridge data. p values were determined by using Chi squared c2 test to compare against the detected proportion of cancers cases by the true distribution for each cancer characteristic category. p values < 0.05 were considered statistically significant. At the 94.0% specificity threshold applied in Scenario E, for the auto recall of cases with a high suspicion not recalled by double reading, The types of cases detected at the 94.0% specificity threshold are outlined in Table 7-19 for ICs and Table 7-20 for NRCs. 170 n DL-1 DL-2 DL-3 Total Cases n 439 88 90 99 Total Lesions n 445 91 93 101 Total invasive lesions 405 87 85 94 Age at Screening < 60 203 30 (14.8%) 28 (13.8%) 38 (18.7%) >= 60 236 58 (24.6%) 62 (26.3%) 61 (25.9%) p value - 0.036197 0.008378 0.155554 FFDM Vendor GE 275 72 (26.2%) 56 (20.4%) 55 (20.0%) Philips 164 16 (9.8%) 34 (20.7%) 44 (26.8%) p value - 0.000536 0.940191 0.190878 Invasive Tumour Sized < 15 mm 126 25 (19.8%) 23 (18.3%) 16 (12.7%) >= 15 mm 228 51 (22.4%) 52 (22.8%) 66 (29.0%) Missing 51 11 (21.6%) 10 (19.6%) 12 (23.5%) p value - 0.904813 0.700864 0.019921 Invasive Tumour Grade d 1 55 11 (20.0%) 11 (20.0%) 14 (25.5%) 2 189 48 (25.4%) 49 (25.9%) 46 (24.3%) 3 152 26 (17.1%) 24 (15.8%) 32 (21.1%) Missing 9 2 (22.2%) 1 (11.1%) 2 (22.2%) p value - 0.516076 0.280206 0.933244 Time interval (months) 0-12 69 13 (18.8%) 12 (17.4%) 19 (27.5%) 13-24 156 31 (19.9%) 25 (16.0%) 38 (24.4%) 25-36 214 44 (20.6%) 53 (24.8%) 42 (19.6%) p value - 0.966804 0.21088 0.482711 Radiological classification Normal / Benign 337 44 (13.1%) 52 (15.4%) 56 (16.6%) Uncertain 78 37 (47.4%) 31 (39.7%) 36 (46.2%) Suspicious 9 5 (55.6%) 5 (55.6%) 4 (44.4%) Unclassifiable 6 2 (33.3%) 0 (0.0%) 2 (33.3%) Missing 9 0 (0.0%) 2 (22.2%) 1 (11.1%) p value - < 0.01 - 0.000567 171 Density BI-RADS b a b a b a b a a 12 25 0 (0.0%) 3 (12.0%) 1 (8.3%) 5 (20.0%) 5 (41.7%) 6 (24.0%) b 65 216 5 (7.7%) 54 (25.0%) 14 (21.5%) 49 (22.7%) 17 (26.2%) 43 (19.9%) c 58 185 8 (13.8%) 29 (15.7%) 12 (20.7%) 34 (18.4%) 13 (22.4%) 47 (25.4%) d 31 13 3 (9.7%) 2 (15.4%) 7 (22.6%) 2 (15.4%) 9 (29.0%) 3 (23.1%) Missing 273 0 72 (26.4%) 0 (0.0%) 56 (20.5%) 0 (0.0%) 55 (20.1%) 0 (0.0%) p value - - - 0.2139 0.9271 0.8256 0.6093 0.7744 Table 7-19 – Sub group analysis of DL-1, DL-2, DL-3 set at 96.0% specificity threshold, using Scenario E for the interval cancers (ICs) detected. BI-RADS: Breast imaging-reporting and data system, FFDM: Full Field Digital Mammography, GE: General Electric. dLesions reported. aDL-3 BI-RADS density scores on processed full field digital mammograms. b Volpara 5th edition BI-RADS mammographic breast density from raw full field digital mammograms for Cambridge data. p values were determined by using Chi squared c2 test to compare against the detected proportion of cancers cases by the true distribution for each cancer characteristic category. p values < 0.05 were considered statistically significant. 172 n DL-1 DL-2 DL-3 Total Cases n 688 152 139 132 Total Lesions n 718 159 144 133 Total invasive lesions 592 134 122 111 Age at Screening < 60 309 61 (19.7%) 44 (14.2%) 49 (15.9%) >= 60 379 91 (24.0%) 95 (25.1%) 83 (21.9%) p value - 0.3208 0.0047 0.104 FFDM Vendor GE 504 132 (26.2%) 106 (21.0%) 75 (14.9%) Philips 184 20 (10.9%) 33 (17.9%) 57 (31.0%) p value - 0.0003 0.5262 0.0002 Invasive Tumour Sized < 15 mm 316 67 (21.2%) 54 (17.1%) 39 (12.3%) >= 15 mm 214 54 (25.2%) 58 (27.1%) 63 (29.4%) Missing 62 13 (21.0%) 10 (16.1%) 9 (14.5%) p value - 0.667276 0.61173 0.00023 Invasive Tumour Grade d 1 157 42 (26.8%) 33 (21.0%) 26 (16.6%) 2 285 58 (20.4%) 65 (22.8%) 63 (22.1%) 3 110 29 (26.4%) 19 (17.3%) 19 (17.3%) Missing 40 5 (12.5%) 5 (12.5%) 3 (7.5%) p value - 0.305361 0.53279 0.224504 Time Interval Median [IQR] (months) 35.7 [35.0-39.2] 35.4 [35.0-37.3] 35.8 [35.0-39.3] 35.6 [34.9-40.4] Density BI-RADS b a b a b a b a a 27 100 5 (18.5%) 16 (16.0%) 3 (11.1%) 17 (17.0%) 7 (0.0%) 18 (18.0%) b 95 401 6 (6.3%) 92 (22.9%) 20 (21.1%) 82 (20.5%) 29 (0.9%) 72 (18.0%) c 47 186 6 (12.8%) 44 (23.7%) 7 (14.9%) 40 (21.5%) 14 (1.8%) 42 (22.6%) d 15 1 3 (20.0%) 0 (0.0%) 3 (20.0%) 0 (0.0%) 7 (0.0%) 0 (0.0%) Missing 504 0 132 (26.2%) 0 (0.0%) 106 (21.0%) 0 (0.0%) 95 (18.9%) 0 (0.0%) p value - - 0.004912 - 0.78437 - 0.0821 - Table 7-20 – Sub group analysis of DL-1, DL-2, DL-3 set at 96.0% specificity threshold, using Scenario E for the next round cancers (NRCs) detected. BI-RADS: Breast imaging-reporting and data system, FFDM: Full Field Digital Mammography, GE: General Electric. dLesions reported. aDL-3 BI-RADS density scores on processed full field digital mammograms. b Volpara 5th edition BI-RADS mammographic breast density from raw full field digital mammograms for Cambridge data. p values were determined by using Chi squared c2 test to compare against the detected proportion of cancers cases by the true distribution for each cancer characteristic category. p values < 0.05 were considered statistically significant. 173 DL-1 detected 88 ICs and 152 NRCs, DL-2 detected 90 ICs and 139 NRCs, and DL-3 detected 99 ICs and 132 NRCs. It is proposed at this threshold, and using the Scenario E approach, that these cases could be recalled for supplemental imaging using a modality such as abbreviated MRI for earlier detection. There was a statistically significant difference for the ICs detected at the 94.0% specificity for the age at screening (DL-1 and DL-2), FFDM vendor (DL-1), invasive tumour size (DL-3) and radiological classification (DL1 and DL-3), Table 7-19. In addition, there was a statistically significant difference (p < 0.05) for the NRCs detected for age at screening (DL-2), FFDM vendor (DL-1 and DL-3) as well as invasive tumour size (DL-3) and Volpara mammographic breast density (DL-1), Table 7-20. 7.4.7 Failure analysis Of the 9 (1.0%) SDCs case missed by each AI algorithm at threshold 1 (99.0% sensitivity), only one case missed overlaps for all algorithms, Figure 7-10. Figure 7-10 – Venn diagram – not proportional, for screen detected cancers (SDCs) missed at threshold 1, Scenario B. For DL-1 in blue, DL-2 in purple, and DL-3 in green. The SDC case that was missed by all AI algorithms was from a 63-year-old patient, diagnosed with a left sided grade 2 16 mm invasive cancer from Cambridge screening, Figure 7-11. Figure 7-11 – Missing case analysis, case missed by artificial intelligence (AI). Screening image, with a blue bounding box to show the location of the cancer. The images were annotated by a breast radiologist to show the true location of the cancer. 174 Of the ICs case detected by each AI algorithm at the 94.0% threshold 34 cases overlap and for NRCs 57 cases overlap, Figure 7-12. Figure 7-12 – Venn diagram – not proportional, for a) interval cancers (ICs) and b) next round cancers (NRCs) detected at the 94.0% specificity threshold Scenario E. For DL-1 in blue, DL-2 in purple, and DL-3 in green. 7.5 Discussion 7.5.1 Overall performance Implementing a CADt rule out workflow found a large proportion of cases could be read by either no readers (Scenario B) or one reader (Scenario C) whilst missing between 0-34 (0.0-3.8%) SDCs. Simulating the effect on overall screening performance found the specificity and sensitivity remained non-inferior to the double reading performance at all thresholds. Similar results for rule out CADt were found in previously published papers, reporting between 17.0-91.0% cases could be not read by human readers whilst estimated to miss 0-7.0% of the cancer cases133. However, many of these previous studies used enriched and small datasets135,231. The implementation of DL for CADt could have a positive impact on the efficiency of screening and help in places where there is a shortage of trained expert human readers. Furthermore, this reduction in workload to improve efficiency could also help offset the increase in workload from the rule in triage approach to improve earlier detection of cancers. However, the question still remains as to where an acceptable threshold should be set for triage ruling out CADt applications. At the highest specificity threshold (99.0%) for Scenario E, auto recall cases with a high suspicion not recalled by human readers, a small proportion of ICs (4.6-8.2%) and NRCs (5.2-6.1%) were recalled for supplemental imaging / assessment and could potentially be detected. However, this is a smaller number than that reported in Dembrower et al, 1.0% highest scores 12.0% ICs and 14.0% NRCs134. This is possibly due to the method used for threshold identification in our study, where we set each AI algorithm at a 94.0-99.0% specificity as appose to taking the cases with the highest 1.0-5.0% scores. Further guidance is also required for this CADt approach, as to what modality or form of a) b) 175 assessment would be best suited to detect these ‘occult’ cancers recalled by AI systems only as well as what location prompting should be provided by the AI algorithms to radiologist carrying out the additional review. MRI with the increase sensitivity compared to mammography could be offered for these mammographically ‘occult’ cancers. During the running of this study two tools were updated. All the data reported in the study is from the same version of the updated algorithms. It is important in future work to monitor for the changes in performance with these updates in algorithms as well as that these processes must be time efficient to account for these frequent changes. Alternative CADt workflows are also possible such as sending any suspicious cases back to the second reader only with the AI prompts for review, or even just using the AI algorithms to generate a smart worklist with cases prioritised in order of suspicion so the most suspicious cases are read first when potentially the readers are most alert. 7.5.2 Further analysis When using both the rule in (Scenario C) and rule out (Scenario E) combination approach the sensitivity was found to be superior, with a trade off in specificity for two out of the three algorithms. In Lauritzen et al, the sensitivity was non-inferior (p = 0.02) and the specificity was higher (p < 0.001)296. Although in their study it was proposed normal cases were not read by human readers if the case reached the auto recall out threshold296. Keeping a reader in the loop in the first instance when deploying such an automated AI workflow would be beneficial for two reasons; 1) to provide human oversight to AI algorithm decisions acting as a safety net and 2) to build trust and knowledge regarding these systems by radiologists who have not trained with these systems. Interestingly, one case was missed by all AI algorithms at the set auto rule out threshold 1 demonstrating that these systems both detect and miss different cancers. Further work into the use of these systems together should be carried out to see if there is an added benefit. 7.5.3 Limitations There are several limitations to this study. The data was from one region in the UK. The study was retrospective and so the impact on the reader can only be simulated and the true effect on reading a smaller proportion of cases cannot be evaluated. In addition, it is not definitive that the cancers flagged for supplemental imaging or assessment will be detected. Thresholds for this study were found on the study dataset and not an independent dataset, thus causing bias. All available data was used in study test set to provide a sufficient sample size, thus there was no independent dataset, without overlap with the study cohort, from which to identify thresholds. A proportion of women aged 67-69 were excluded from this study as they did not have sufficient follow-up for the required ground truth. We used a strict ground truth definition for this study. However, in recent studies a 176 sufficient follow-up time with a no cancer outcome for the case has been used, and is an alternative way of defining a case that would limit the loss of cases from the normal case ground truth. 7.6 Conclusion CAD triage applications of the latest DL algorithms provide multiple workflow solutions. A large proportion of cases can be triaged out of double reading to either an automated decision of no recall or for single reading, whilst estimated to miss only 0.0-3.8% of SDCs. The potential benefit of efficiency from automated rule out triage could offset the increase in recall rate from an automated rule in approach, which provides the opportunity to improve IC and NRC detection and thus the earlier detection of some cancers. Prospective studies implementing one or more of these workflows are required to further investigate performance and the effect on reader performance. It is important to evaluate the readers acceptability of these thresholds as well as reader interaction with AI systems, as this is not possible to evaluate in simulated studies. 177 Chapter 8 – Contributions, Future Work and Conclusions 8.1 Contributions to knowledge This thesis evaluated the use of artificial intelligence (AI) in breast cancer screening. The major contributions to knowledge from this thesis include a systematic review and meta- analysis of the stand-alone use of AI in breast cancer screening, the creation of a large curated mammographic imaging database (The Cambridge Cohort – Mammography East Anglia Digital Imaging Archive (CC-MEDIA)) which provides multiple representative year data from two National Health Service Breast Screening Programme (NHSBSP) sites, a comparative analysis of three different AI algorithms for the early detection of interval cancers, a study investigating the use of three different AI algorithms as stand-alone screen readers, and an evaluation of three different AI algorithms for normal rule out and high suspicion rule in triage approaches in breast cancer screening. The systematic review and meta-analysis presented in Chapter 3 highlighted the rapid increase in published literature over the past six years investigating the latest deep learning algorithms performance in breast cancer screening. Two key workflow applications were the focus of this review, stand-alone screen reading and triage. The performance of AI systems was comparable to the human readers. Furthermore, a large proportion of cases could be triaged whilst missing a small proportion of cancers. However, this review also established there is a high level of bias due to the use of internal datasets and no pre-setting of the algorithm threshold. In addition, the evidence was from a limited number of studies using small and enriched datasets. The gaps in evidence and standard methodology within the field identified through this review, such as ground truth classification, were then applied in Chapters 5, 6 and 7. The creation of CC-MEDIA database outlined in Chapter 4, highlights the governance and technical processes required to create a large medical imaging database. The patient and public involvement (PPI) work carried out during the database creation helped ensure the transparent communication with patients about the use of their data in this research. This database has also been used in three separate research projects from this thesis by researchers at the University of Cambridge Radiology Department. In addition, the image extraction pipeline created, and lessons learned from deploying this method at both sites, has contributed to the ongoing development of an image extraction protocol from Cambridge University Hospitals NHS Foundation Trust PACS to the University of Cambridge Radiology Department. Chapter 5 demonstrated that AI algorithms are able to detect breast cancer at an earlier time point using the screening mammogram. By comparing the three different algorithms on the same data set 178 this work identified that the interval cancers detected by each AI algorithms do differ. Fluctuations in performance were identified when translating pre-identified thresholds from other sites, thus the stability of algorithm performance when transferring between sites is an important measure and should be considered when evaluating algorithm performance. The feasibility of installing and running multiple AI systems, processing of data from the CC-MEDIA database, and methods for comparative analysis were established in this chapter and provided the basis from which to carry out the analysis in Chapters 6 and 7. The comparative study for stand-alone screen reading presented in Chapter 6 adds to the growing body of literature for the use of AI as a stand-alone reader either entirely independent or in combination with a human reader in a double reading system. All three algorithms demonstrated non-inferiority at clinically relevant thresholds, thus reaching the required benchmark for prospective testing. All algorithms were generalisable to the NHSBSP even though less than 10% of training data was from the UK. Furthermore all algorithms were generalisable to both mammographic machine vendors (Philips and GE) included in the data, with no statistically significant difference in performance when comparing the two sites using different machines, despite all algorithms training on less than 1% Philips data. The importance of reporting metrics such as sensitivity, specificity and partial area under the receiver operating characteristic (pAUROC) curve alongside area under the receiver operating characteristic (AUROC) curve was also detailed to account for the class imbalance in screening as well as that the algorithms are operating at high specificity in screening, so as not to increase recall rates. Chapter 7 presents the first comparative study of AI algorithms for triage applications using the same external dataset. The rule out triage approach demonstrated that all algorithms could class a large proportion of cases as ‘normal’ whilst missing a very small proportion of screen detected cancers. This loss of screen detected cancers could be offset by the rule in triage approach for the earlier detection of cancers. The acceptable trade-off between these two approaches requires clarification in future work through discussion between breast radiologists and the national screening programme. 8.2 Future work 8.2.1 AI in the NHS Whilst there have been reports, briefings, and proposals for the adoption of AI into the NHS there is no established pathway for the approval of AI algorithms to be used in the NHS297–299. The National Screening Committee (NSC) report published in 2021 concluded that “the current evidence is a long way from the quality and quantity required for implementation into clinical practice” and so does not to support the implementation of AI into the NHSBSP136. Thus no algorithms are currently being 179 used in the programme. The report detailed that further evidence is required from both retrospective and prospective studies. Overall the aim of the studies in Chapters 5, 6 and 7 was to address the gaps in evidence highlighted in the 2021 NSC report. The results in these three chapters demonstrate the generalisability and acceptable performance from all three AI algorithms using NHSBSP data. However, as these studies were retrospective and required simulation, the performance can only be estimated. This emphasises the need for prospective studies deploying the various workflow approaches applied in these chapters with sufficient follow-up time to account for the earlier detection cancer benefit that could be obtained from these systems. 8.2.2 Retrospective studies Retrospective studies allow for the faster review of multiple AI algorithms for multiple different workflow applications at clinically relevant thresholds. The speed of these studies is important due to the continuous updates of AI systems as well as that there are now more than fourteen different algorithms approved by the Food and Drug Administration (FDA) for mammographic screening applications300. In order to improve the generalisability of results in this thesis the 127,000 case CC- MEDIA database could be used in collaboration with other databases, such as The Optimam Mammography Image Database (OMI-DB), or additional NHSBSP sites could be added to the database in order to provide a national test set that is more representative of the seventy-five NHSBSP sites. Furthermore testing across continents using the large medical imaging databases that have been established over the past ten years, detailed in Chapter 4 Table 4-1, could allow for broader generalisability testing. This geographical expansion is also important to include a more diverse screening population to investigate for bias in algorithms. The recording of ethnicity and socioeconomic information is an important aspect of this work, however as shown in Chapter 4 Section 4.4.4 this information is often not available. The inclusion of additional sites and databases also provides wider coverage of mammographic manufacturers e.g. Hologic and Siemens and screening programmes e.g. single reader or biennial round length, not evaluated in this work of this thesis. Retrospective testing could play a role in the future benchmarking of AI algorithms to a set programme performance standard for an already approved algorithm application that has proceeded through prospective testing. As it is not feasible for all algorithms to be tested prospectively for all applications. 8.2.3 Prospective studies Prospective studies allow for the assessment of the impact AI system have on the reader performance, overall effect on programme performance as well as the acceptability of an AI adapted workflow approach by readers. Prospective studies have been funded in the UK to evaluate both the 180 Kheiron system at up to fifteen NHSBSP sites and Google system at three NHS sites, although for the Google study the “AI system would not be used in patient care during the study”301–303. Other prospective studies around the world are taking place in the Spain, South Korea, Sweden, Norway, China and Russia, testing various deployment approaches and algorithms238,239,304–308. Ideally these prospective studies should be randomised to provide the highest level of evidence. In Denmark, Transpara has already been incorporated into routine screen reading to help with the Covid-19 pandemic screening backlog, where the system will be used for a triage application of cases with low scores to be read by one reader and high scores continuing to double reading, like the approach shown in Chapter 7 Scenario C of this thesis309. The feasibility of implementing AI into the NHSBSP is also important to consider. In this thesis all algorithms were hosted via on premises hardware in a bespoke research environment specifically designed to carry out this work. Installing and maintain such systems within the NHS is an important point to consider as technical expertise and technical infrastructure varies between sites. It has been acknowledged by the NHS that AI systems could be hosted in one of the two approved cloud providers (Microsoft Azure or Amazon Web Services) which could facilitate a more centralised oversight of these systems at each site310. Furthermore, the recording of AI outputs in NBSS has not yet been tested. Extracting data from NBSS to create the CC-MEDIA database required the development of unique Crystal Report queries and was a complex process which depended on expertise in this field. Lastly, the NHS Trust information governance procedures that have to be satisfied in order to deploy a new system within the NHS firewall can be extensive and take a long time for approval which should be factored into the planning of any study. As part of our work we have gone through the local Trust governance approvals for one out of the three algorithms included in this thesis to both evaluate the feasibility of this sign off process and for the initial setup of prospective work. As outlined in Chapter 2, the factors to consider when implementing AI into the clinical workflow extend beyond the technical requirements as the ethical and legal implications should also be clarified. The Royal College of Radiologists have incorporated AI training into the updated curriculum for trainee radiologists311. But it is important to establish the type of training that existing radiologists should undertake before using these systems. Alongside this thesis Professor Gilbert and I have developed an online teaching module for the National Breast Imaging Academy titled “Computer-Aided Detection (CAD) and Artificial Intelligence (AI)” to provide an overview of AI in breast cancer screening for healthcare professionals. Questions that should be addressed in future prospective studies include: • What is an acceptable performance of an AI system to achieve for the workflow approach? 181 • What to do when there is a disagreement between an AI algorithm and human reader in each type of workflow deployment? o It is likely that these cases should proceed to arbitration for further review using the prompts provided by the AI system. • What percentage of additional cases is it both feasible and acceptable to triage to supplemental imaging for the earlier detection of cancers? • Does each algorithm have to be tested prospectively before deployment for each workflow application? • What is the cost effectiveness of using the AI systems for a specific workflow application? • Should consent be obtained from all women whose mammograms are read by AI systems? o This is unclear as the aim of systems is to provide standard of care if not improve detection. However, as shown by the PPI work undertaken in this thesis clear communication with the women participating in screening is required and centralised clear communication at point of invite would be most appropriate. • Where does the legal responsibility lie when using systems as stand-alone readers? 8.2.4 Future work - AI research questions The CC-MEDIA database described in this thesis is approved for the collection of data from 2011 until 2020 at two NHSBSP sites. Thus, the database will continue to be expanded in order to include prior screening rounds for patients which will allow for the evaluation of AI systems that incorporate the prior image, potentially improving AI performance. Other areas of interest for future work include the use of AI systems together either via an ensemble method, as explored in this thesis and in Schaffter et al, or in tandem to obtain the benefit from the difference in cancers detected by each system137. Whilst this thesis explores the main applications of AI in breast cancer screening there are remaining gaps not addressed in this thesis which were highlighted in the NSC report136. These include evaluating the performance for subgroup populations, including cases with breast implants or a previous cancer. Similarly projects looking at the impact of artefacts, more than four views, non- standard views on performance should be investigated. Future work should also incorporate cost effectiveness analysis for the various AI screening approaches explored in this thesis. As screening is a balance between early detection and feasibility cost effectiveness it is important aspect in the evaluation pipeline. The studies in this thesis focused on the performance of stand-alone algorithms and so did not investigate the accuracy of prompt locations provided by the AI algorithms in addition to the continuous output score. As discussed in Chapter 5, studies such as Lång et al have further investigated the accuracy of these prompts to establish the likelihood that an occult cancer could be 182 found if the AI system provided this additional guidance to the radiologists270. A researcher at the University of Cambridge is currently using the outputs from the studies in this thesis to establish the accuracy of the prompts provided by the AI algorithms. The Breast Screening – Risk Adaptive Imaging for Density (BRAID) study is a large prospective multi- centre trial, investigating the use of supplemental imaging for women with Breast Imaging-Reporting and Data System (BI-RADS) classified C and D mammographic breast density94. The CC-MEDIA database is being used by researchers as part of this study to investigate the use of risk prediction and mammographic breast density algorithms. These studies aim to establish the best threshold to guide the use of supplemental imaging. In the BRAID study patients also complete the BOADICEA risk questionnaire, designed by researchers at the University of Cambridge97. The risk information from the BOADICEA risk questionnaire where possible will also be extracted from the CC-MEDIA database to be used in combination with risk prediction and mammographic breast density algorithms. The inclusion of prior screening rounds in the database allows for the assessment of change in mammographic breast density overtime. In addition, the long-term follow-up information also included allows for the calculation of five-year risk. It has been proposed that the CC-MEDIA database will then be used to build new risk prediction tools. An application has been submitted to the CC-MEDIA Database Access Committee for a team at the University of Cambridge Maths Department to receive secure access to the database in order to facilitate the development of new algorithms. Digital Breast Tomosynthesis (DBT) is already used in screening in the USA. In the UK a large multi- centre prospective trial is currently underway, Prospective Trial of DBT in Breast Cancer Screening (PROSPECTS) trial, to establish the added benefit from DBT in the NHSBSP312. Numerous publications have shown improved reading times and good AI algorithm performance with DBT313–315. The studies performed as part of this thesis should be replicated using DBT data in the UK screening programme if DBT is implemented into the NHSBSP in the future. 8.3 Conclusions 1. Stand-alone AI algorithms achieve a similar performance compared to human reader performance, although the evidence is from a small number of studies many of which used small and enriched retrospective cohorts leading to high rates of bias in reported studies. 2. Development of a medical imaging database requires extensive ethical approvals, patient and public involvement, governance procedures as well as technical expertise. 3. AI algorithms are able to detected interval cancers at the previous screening timepoint. 183 4. In this UK dataset the three AI algorithms tested achieved non-inferior performance compared to the single first human reader at both screening sites when used as a stand- alone reader. In addition, when combined with the single human first reader all AI algorithms achieved a non-inferior performance compared to double reading. 5. Each AI algorithm detected different interval and next round cancers. 6. A high proportion of cases (35.0%-68.9%) can be ruled out as ‘normal’ or assigned for single reading only by the AI systems whilst missing a small proportion (0.0-3.8%) of screen detected cancers. 7. Up to 20% of interval and next round cancers can be detected at a high specificity threshold which could be recalled for assessment and supplemental imaging. 8. A combined approach using both rule in and rule out triage, led to a superior sensitivity performance with a trade off in specificity. A lower arbitration rate and higher recall rate was observed. My proposal for the future of AI in clinical practice is that AI will not replace the vital role of radiologists, rather it will enhance early detection of cancer. 184 References 1. World Health Organization. Breast Cancer. /web/20220321085732/https://www.who.int/news-room/fact-sheets/detail/breast-cancer (2021). 2. Office for National Statistics. Cancer registration statistics, England: 2016. /web/20220321091917/https://www.ons.gov.uk/peoplepopulationandcommunity/healthan dsocialcare/conditionsanddiseases/bulletins/cancerregistrationstatisticsengland/final2016 (2018). 3. Harbeck, N. et al. Breast cancer. Nature Reviews Disease Primers vol. 5 (2019). 4. Hu, K. et al. Global patterns and trends in the breast cancer incidence and mortality according to sociodemographic indices: An observational study based on the global burden of diseases. BMJ Open 9, 1–8 (2019). 5. World Health Organization. Cancer. /web/20220321094701/https://www.who.int/news- room/fact-sheets/detail/cancer (2022). 6. Feng, Y. et al. Breast cancer development and progression: Risk factors, cancer stem cells, signaling pathways, genomics, and molecular pathogenesis. Genes Dis. 5, 77–106 (2018). 7. Loibl, S., Poortmans, P., Morrow, M., Denkert, C. & Curigliano, G. Breast cancer. Lancet 397, 1750–1769 (2021). 8. Viale, G. The current state of breast cancer classification. Ann. Oncol. 23, (2012). 9. Berry, D. A. et al. Effect of screening and adjuvant therapy on mortality from breast cancer. NEJM 353, 1784–92 (2005). 10. Cancer Research UK. Breast cancer survival statistics. https://web.archive.org/web/20220207071809/https://www.cancerresearchuk.org/health- professional/cancer-statistics/statistics-by-cancer-type/breast-cancer/survival. 11. Office for National Statistics. Cancer survival in England - adults diagnosed. /web/20220321100459/https://www.ons.gov.uk/peoplepopulationandcommunity/healthan dsocialcare/conditionsanddiseases/datasets/cancersurvivalratescancersurvivalinenglandadult sdiagnosed (2019). 12. Berry, D. A., Cronin, K. A. & Plevritis, S. K. Influence of tumour stage at breast cancer detection on survival in modern times: Population based study in 173 797 patients. BMJ 351, (2015). 13. Kalager, M. et al. Improved breast cancer survival following introduction of an organized 185 mammography screening program among both screened and unscreened women: A population-based cohort study. Breast Cancer Res. 11, 1–9 (2009). 14. Tsang, J. Y. S. & Tse, G. M. Molecular Classification of Breast Cancer. Adv. Anat. Pathol. 27, 27–35 (2020). 15. McCart Reed, A. E., Kalinowski, L., Simpson, P. T. & Lakhani, S. R. Invasive lobular carcinoma of the breast: the increasing importance of this special subtype. Breast Cancer Res. 23, 1–16 (2021). 16. Wilson, N., Ironside, A., Diana, A. & Oikonomidou, O. Lobular Breast Cancer: A Review. Front. Oncol. 10, 1–13 (2021). 17. Rakha, E. A. et al. Breast cancer prognostic classification in the molecular era: The role of histological grade. Breast Cancer Res. 12, (2010). 18. Tan, P. H. et al. The 2019 World Health Organization classification of tumours of the breast. Histopathology 77, 181–185 (2020). 19. World Health Organization. WHO Classification of Tumours, 5th Edition, Volume 2: Breast Tumours. (2019). 20. Sinn, H. P. & Kreipe, H. A brief overview of the WHO classification of breast tumors, 4th edition, focusing on issues and updates from the 3rd edition. Breast Care 8, 149–154 (2013). 21. Elston, C. W. & Ellis, I. O. Pathological prognostic factors in breast cancer. I. The value of histological grade in breast cancer: Experience from a large study with long-term follow-up. Histopathology 19, 403–410 (1991). 22. IO Ellis et al. Pathology reporting of breast disease in surgical excision specimens incorporating the dataset for histological reporting of breast cancer. The Royal College of Pathologists https://www.rcpath.org/uploads/assets/7763be1c-d330-40e8- 95d08f955752792a/G148_BreastDataset-hires-Jun16.pdf (2016). 23. Dai, X. et al. Breast cancer intrinsic subtype classification, clinical use and future trends. Am. J. Cancer Res. 5, 2929–2943 (2015). 24. Senkus, E. et al. Primary breast cancer: ESMO Clinical Practice Guidelines for diagnosis, treatment and follow-up. Ann. Oncol. 26, 8–30 (2015). 25. Harbeck, N., Thomssen, C. & Gnant, M. St. Gallen 2013: Brief preliminary summary of the consensus discussion. Breast Care 8, 102–109 (2013). 26. Falck, A. K., Fernö, M., Bendahl, P. O. & Rydén, L. St Gallen molecular subtypes in primary breast cancer and matched lymph node metastases - aspects on distribution and prognosis for patients with luminal A tumours: Results from a prospective randomised trial. BMC Cancer 13, 1–10 (2013). 186 27. Kalli, S. et al. American joint committee on cancer’s staging system for breast cancer, eighth edition: What the radiologist needs to know. Radiographics 38, 1921–1933 (2018). 28. Koh, J. & Kim, M. J. Introduction of a new staging system of breast cancer for radiologists: An emphasis on the prognostic stage. Korean J. Radiol. 20, 69–82 (2019). 29. Giuliano, A. E. et al. Breast Cancer-Major changes in the American Joint Committee on Cancer eighth edition cancer staging manual. CA. Cancer J. Clin. 67, 290–303 (2017). 30. Curtis, C. et al. The genomic and transcriptomic architecture of 2,000 breast tumours reveals novel subgroups. Nature 486, 346–352 (2012). 31. Wilson, J. M. G., Jungner, G. & World Health Organization. Principles and practice of screening for disease. /web/20220321153328/https://apps.who.int/iris/handle/10665/37650 (1968). 32. World Health Organization. Screening Programmes: A short guide. WHO Press vol. 1 https://apps.who.int/iris/bitstream/handle/10665/330829/9789289054782-eng.pdf (2020). 33. Public Health England. Consolidated Standards for NHS Breast Screening Programme. https://www.gov.uk/government/publications/breast-screening-consolidated-programme- standards/nhs-breast-screening-programme-screening-standards-valid-for-data-collected- from-1-april-2017 (2017). 34. Birnbaum, J. K., Duggan, C., Anderson, B. O. & Etzioni, R. Early detection and treatment strategies for breast cancer in low-income and upper middle-income countries: a modelling study. Lancet Glob. Heal. 6, 885–893 (2018). 35. Duffy, S. W., Chen, T. H.-H., Smith, R. A., Yen, A. M.-F. & Tabar, L. Real and artificial controversies in breast cancer screening. Breast Cancer Manag. 2, 519–528 (2013). 36. Marmot, M. G. et al. The benefits and harms of breast cancer screening: An independent review. Br. J. Cancer 108, 2205–2240 (2013). 37. Duffy, S. W. et al. Mammography screening reduces rates of advanced and fatal breast cancers: Results in 549,091 women. Cancer 126, 2971–2979 (2020). 38. Gilbert, F. J. et al. Opportunities in cancer imaging: risk-adapted breast imaging in screening. Clin. Radiol. 76, 763–773 (2021). 39. Clift, A. K. et al. The current status of risk-stratified breast screening. Br. J. Cancer 126, 533– 550 (2022). 40. Forrest P. Breast cancer screening. Report to the Health Ministers of England Wales Scotland and N Ireland by a working group chaired by Professor Sir Patrick Forrest. HMSO. https://webarchive.nationalarchives.gov.uk/ukgwa/20150506221529/http://www.cancerscr eening.nhs.uk//breastscreen/publications/forrest-report.html (1986). 41. Advisory Committee on Breast Cancer Screening. Screening for breast cancer in England: Past 187 and future. J. Med. Screen. 13, 59–61 (2006). 42. Raftery, J. & Chorozoglou, M. Possible net harms of breast cancer screening: Updated modelling of Forrest report. BMJ 344, 1–8 (2012). 43. Public Health England. Achieving and maintaining the 36 month round length. /web/20220321160434/https://www.gov.uk/government/publications/breast-screening-set- and-maintain-round-length/achieving-and-maintaining-the-36-month-round-length-aug19 (2019). 44. European commission. Screening ages and frequencies. https://healthcare- quality.jrc.ec.europa.eu/european-breast-cancer-guidelines/screening-ages-and-frequencies (2022). 45. Schünemann, H. J. et al. Breast cancer screening and diagnosis: A synopsis of the european breast guidelines. Ann. Intern. Med. 172, 46–56 (2020). 46. Monticciolo, D. L. et al. Breast Cancer Screening Recommendations Inclusive of All Women at Average Risk: Update from the ACR and Society of Breast Imaging. J. Am. Coll. Radiol. 18, 1280–1288 (2021). 47. Lagerlund, M., Åkesson, A. & Zackrisson, S. Population-based mammography screening attendance in Sweden 2017–2018: A cross-sectional register study to assess the impact of sociodemographic factors. Breast 59, 16–26 (2021). 48. National Institute for Public Health and the Enviroment. Breast Cancer Screening Programme. https://www.rivm.nl/en/breast-cancer-screening-programme (2022). 49. Cancer Registry of Norway. BreastScreen Norway. https://www.kreftregisteret.no/en/screening/BreastScreen_Norway/breastscreen-norway/ (2021). 50. Australian Goverment Department of Health. BreastScreen Australia Program. https://www.health.gov.au/initiatives-and-programs/breastscreen-australia-program (2022). 51. National Health Commission of the People’s Republic of China. Chinese guidelines for diagnosis and treatment of breast cancer 2018 (English version). Chinese J. Cancer Res. 31, 259–277 (2019). 52. US Preventative Services Taskforce. Breast Cancer: Screening. /web/20220321172647/https://uspreventiveservicestaskforce.org/uspstf/recommendation/ breast-cancer-screening (2016). 53. Canadian Taskforce on Preventative Healthcare. Breast Cancer Update (2018). https://canadiantaskforce.ca/guidelines/published-guidelines/breast-cancer-update/ (2018). 54. NHS Digital. Breast screening programme. England 2018-19. 188 /web/20220321174209/https://digital.nhs.uk/data-and- information/publications/statistical/breast-screening-programme/england---2018-19 (2020). 55. Taylor-Phillips, S. & Stinton, C. Double reading in breast cancer screening: Considerations for policy-making. Br. J. Radiol. 93, (2020). 56. Gale, A. G. PERFORMS - a self assessment scheme for radiologists in breast screening. (2019). 57. Taylor-Phillips, S. et al. Double reading in breast cancer screening: Cohort evaluation in the CO-OPS trial. Radiology 287, 749–757 (2018). 58. The Royal College of Radiologists. Clinical Radiology UK Workforce Census 2020 Report. (2021). 59. National Institue for Health and Care Excellence. Familial breast cancer: classification, care and managing breast cancer and related risks in people with a family history of breast cancer. https://www.nice.org.uk/guidance/cg164 (2019). 60. Radiology Café. Production of X-rays. https://www.radiologycafe.com/frcr-physics-notes/x- ray-imaging/production-of-x-rays/ (2021). 61. Radiopaedia. Bremsstrahlung radiation. https://radiopaedia.org/articles/bremsstrahlung- radiation?lang=gb (2022). 62. Public Health England. National Diagnostic Reference Levels (NDRLs) from 19 August 2019. https://www.gov.uk/government/publications/diagnostic-radiology-national-diagnostic- reference-levels-ndrls/ndrl (2019). 63. Public Health England. NHS Breast Screening Programme Guidance for breast screening mammographers Third edition. (2017). 64. The Royal College of Radiologists. Guidance on screening and symptomatic breast imaging: Fourth edition. (2019). 65. Winkler, N. S., Raza, S., Mackesy, M. & Birdwell, R. L. Breast density: Clinical implications and assessment methods. Radiographics 35, 316–324 (2015). 66. Harvey, J. A. & Bovbjerg, V. E. Quantitative Assessment of Mammographic Breast Density: Relationship with Breast Cancer Risk. Radiology 230, 29–41 (2004). 67. Lian, J. & Li, K. A Review of Breast Density Implications and Breast Cancer Screening. Clin. Breast Cancer 20, 283–290 (2020). 68. Yaffe, M. J. Mammographic density. Measurement of mammographic density. Breast Cancer Res. 10, 1–10 (2008). 69. Sprague, B. L. et al. Variation in Mammographic Breast Density Assessments among Radiologists in Clinical Practice: A Multicenter Observational Study. Ann. Intern. Med. 165, 457–464 (2016). 189 70. D’Orsi C, Sickles EA, M. E. M. Breast Imaging Reporting and Data System: ACR BI-RADS breast imaging atlas. 5th ed. Reston, Va: American College of Radiology. (2013). 71. Fowler, E. E., Sellers, T. A., Lu, B. & Heine, J. J. Breast Imaging Reporting and Data System (BI- RADS) breast composition descriptors: Automated measurement development for full field digital mammography. Med. Phys. 40, 1–9 (2013). 72. Alomaim, W. et al. Variability of Breast Density Classification Between US and UK Radiologists. J. Med. Imaging Radiat. Sci. 50, 53–61 (2019). 73. Ciatto, S. et al. Categorizing breast mammographic density: Intra- and interobserver reproducibility of BI-RADS density categories. Breast 14, 269–275 (2005). 74. Lehman, C. D. et al. National performance benchmarks for modern screening digital mammography: Update from the Breast Cancer Surveillance Consortium. Radiology 283, 49– 58 (2017). 75. Sprague, B. L. et al. Prevalence of mammographically dense breasts in the United States. J. Natl. Cancer Inst. 106, (2014). 76. Alonzo-Proulx, O., Mawdsley, G. E., Patrie, J. T., Yaffe, M. J. & Harvey, J. A. Reliability of automated breast density measurements. Radiology 275, 366–376 (2015). 77. Vinnicombe, S. J. Breast density: why all the fuss? Clin. Radiol. 73, 334–357 (2018). 78. Brandt, K. R. et al. Measurements: Implications for risk prediction and supplemental screening. Radiology 279, 710–719 (2016). 79. Astley, S. M. et al. A comparison of five methods of measuring mammographic density: A case-control study. Breast Cancer Res. 20, 1–13 (2018). 80. Matthews, T. P. et al. A multisite study of a breast density deep learning model for full-field digital mammography and synthetic mammography. Radiol. Artif. Intell. 3, (2021). 81. Wu, N. et al. Breast density classification with deep convolutional neural networks. Arxiv [Preprint] (2017) doi:10.1109/ICASSP.2018.8462671. 82. Lehman, C. D. et al. Mammographic breast density assessment using deep learning: Clinical implementation. Radiology 290, 52–58 (2019). 83. Yala, A., Lehman, C., Schuster, T., Portnoi, T. & Barzilay, R. A deep learning mammography- based model for improved breast cancer risk prediction. Radiology 292, 60–66 (2019). 84. Dontchos, B. N., Yala, A., Barzilay, R., Xiang, J. & Lehman, C. D. External Validation of a Deep Learning Model for Predicting Mammographic Breast Density in Routine Clinical Practice. Acad. Radiol. 28, 475–480 (2021). 85. Destounis, S. et al. Using volumetric breast density to quantify the potential masking risk of mammographic density. Am. J. Roentgenol. 208, 222–227 (2017). 190 86. Destounis, S., Arieno, A., Morgan, R., Roberts, C. & Chan, A. Qualitative Versus Quantitative Mammographic Breast Density Assessment: Applications for the US and Abroad. Diagnostics 7, 30 (2017). 87. McCormack, V. A. & Dos Santos Silva, I. Breast density and parenchymal patterns as markers of breast cancer risk: A meta-analysis. Cancer Epidemiol. Biomarkers Prev. 15, 1159–1169 (2006). 88. Boyd, N. F., Martin, L. J., Yaffe, M. J. & Minkin, S. Mammographic density and breast cancer risk: Current understanding and future prospects. Breast Cancer Res. 13, 1–12 (2011). 89. Boyd, N. F. et al. Mammographic signs as risk factors for breast cancer. Br. J. Cancer 185 (1982). 90. Melnikow, J. et al. Supplemental screening for breast cancer in women with dense breasts: A systematic review for the U.S. Preventive services task force. Ann. Intern. Med. 164, 268–278 (2016). 91. Miles, R. C., Lehman, C., Warner, E., Tuttle, A. & Saksena, M. Patient-Reported Breast Density Awareness and Knowledge after Breast Density Legislation Passage. Acad. Radiol. 26, 726– 731 (2019). 92. Mann, R. M. et al. Breast cancer screening in women with extremely dense breasts recommendations of the European Society of Breast Imaging (EUSOBI). Eur. Radiol. 4036– 4045 (2022) doi:10.1007/s00330-022-08617-6. 93. Yala, A. et al. Multi-Institutional Validation of a Mammography-Based Breast Cancer Risk Model. J. Clin. Oncol. 8–10 (2021) doi:10.1200/jco.21.01337. 94. ClinicalTrials.gov. Breast Screening - Risk Adaptive Imaging for Density (BRAID). /web/20220322123321/https://clinicaltrials.gov/ct2/show/NCT04097366 (2020). 95. Destounis, S. V., Santacroce, A. & Arieno, A. Update on breast density, risk estimation, and supplemental screening. Am. J. Roentgenol. 214, 296–305 (2020). 96. Brentnall, A. R. et al. A Case-Control Study to Add Volumetric or Clinical Mammographic Density into the Tyrer-Cuzick Breast Cancer Risk Model. J. Breast Imaging 1, 99–106 (2019). 97. Lee, A. et al. BOADICEA: a comprehensive breast cancer risk prediction model incorporating genetic and nongenetic risk factors. Genet. Med. 21, 1708–1718 (2019). 98. Pal Choudhury, P. et al. Comparative validation of the BOADICEA and Tyrer-Cuzick breast cancer risk models incorporating classical risk factors and polygenic risk in a population-based prospective cohort of women of European ancestry. Breast Cancer Res. 23, 1–5 (2021). 99. Van Veen, E. M. et al. Use of single-nucleotide polymorphisms and mammographic density plus classic risk factors for breast cancer risk prediction. JAMA Oncol. 4, 476–482 (2018). 191 100. Bennett, R. L., Sellars, S. J. & Moss, S. M. Interval cancers in the NHS breast cancer screening programme in England, Wales and Northern Ireland. Br. J. Cancer 104, 571–577 (2011). 101. Public Health England. NHS Breast Screening Programme Reporting, classification and monitoring of interval cancers and cancers following previous assessment. https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment_d ata/file/801400/Guidance_on_Interval_cancers_Final.pdf (2017). 102. MacInnes, E. G. et al. Radiological audit of interval breast cancers: Estimation of tumour growth rates. Breast 51, 114–119 (2020). 103. Cornford, E. & Sharma, N. Interval Cancers and Duty of Candour, a UK Perspective. Curr. Breast Cancer Rep. 11, 89–93 (2019). 104. Kerlikowske, K. et al. Identifying women with dense breasts at high risk for interval cancer a cohort study. Ann. Intern. Med. 162, 673–681 (2015). 105. Wanders, J. O. P. et al. Volumetric breast density affects performance of digital screening mammography. Breast Cancer Res. Treat. 162, 95–103 (2017). 106. Wanders, J. O. P. et al. The effect of volumetric breast density on the risk of screen-detected and interval breast cancers: A cohort study. Breast Cancer Res. 19, 1–13 (2017). 107. Turing, A. M. Computing machinery and intelligence. MIND LIX, 433–460 (1950). 108. Moor, J. The Dartmouth College Artificial Intelligence Conference: The next fifty years. AI Mag. 27, 87–91 (2006). 109. van Leeuwen, K. G., Schalekamp, S., Rutten, M. J. C. M., van Ginneken, B. & de Rooij, M. Artificial intelligence in radiology: 100 commercially available products and their scientific evidence. Eur. Radiol. 31, 3797–3804 (2021). 110. International Organization for Standardization [ISO]. ISO/IEC TR 24028:2020(en) Information technology — Artificial intelligence — Overview of trustworthiness in artificial intelligence. /web/20220323154646/https://www.iso.org/obp/ui/ (2020). 111. Chartrand, G. et al. Deep Learning: A Primer for Radiologists. Radiographics 37, 2113–2131 (2017). 112. Le, E. P. V., Wang, Y., Huang, Y., Hickman, S. & Gilbert, F. J. Artificial intelligence in breast imaging. Clin. Radiol. 74, 357–366 (2019). 113. Russakovsky, O. et al. ImageNet Large Scale Visual Recognition Challenge. Int. J. Comput. Vis. 115, 211–252 (2015). 114. Cheng, P. M. et al. Deep learning: An update for radiologists. Radiographics 41, 1427–1445 (2021). 115. NHS England and NHS Improvment. Diagnostic Imaging Dataset Statistical Release. NHS 192 England vol. 1 https://www.england.nhs.uk/statistics/wp- content/uploads/sites/2/2020/01/Provisional-Monthly-Diagnostic-Imaging-Dataset-Statistics- 2020-01-23.pdf (2020). 116. Geras, K. J. et al. High-Resolution Breast Cancer Screening with Multi-View Deep Convolutional Neural Networks. Arxiv [Preprint] 1–9 (2017). 117. Shen, L. et al. Deep Learning to Improve Breast Cancer Detection on Screening Mammography. Sci. Rep. 9, 1–12 (2019). 118. Krizhevsky, A., Sutskever, I. & Hinton, G. E. ImageNet classification with deep convolutional neural networks. NeurIPS Proc 1–9 (2012). 119. Kim, D. W., Jang, H. Y., Kim, K. W., Shin, Y. & Park, S. H. Design characteristics of studies reporting the performance of artificial intelligence algorithms for diagnostic analysis of medical images: Results from recently published papers. Korean J. Radiol. 20, 405–410 (2019). 120. Giger, M. L., Chan, H. P. & Boone, J. Anniversary paper: History and status of CAD and quantitative image analysis: The role of Medical Physics and AAPM. Med. Phys. 35, 5799– 5820 (2008). 121. Boyer, B., Balleyguier, C., Granat, O. & Pharaboz, C. CAD in questions / answers Review of the literature. 69, 24–33 (2009). 122. Seung, J. K. et al. Computer-aided detection in full-field digital mammography: Sensitivity and reproducibility in serial examinations. Radiology 246, 71–80 (2008). 123. Gilbert, F. J. & Lemke, H. Computer-aided diagnosis. Br. J. Radiol. 78, 1–2 (2005). 124. Skaane, P., Kshirsagar, A., Hofvind, S., Jahr, G. & Castellino, R. A. Mammography screening using independent double reading with consensus: Is there a potential benefit for computer- aided detection? Acta radiol. 53, 241–248 (2012). 125. Lehman, C. D. et al. Diagnostic accuracy of digital screening mammography with and without computer-aided detection. JAMA Intern. Med. 175, 1828–1837 (2015). 126. Keen, J. D., Keen, J. M. & Keen, J. E. Utilization of Computer-Aided Detection for Digital Screening Mammography in the United States, 2008 to 2016. J. Am. Coll. Radiol. 15, 44–48 (2018). 127. Rao, V. M. et al. How widely is computer-aided detection used in screening and diagnostic mammography? J. Am. Coll. Radiol. 7, 802–805 (2010). 128. Gilbert, F. J. et al. Single Reading with Computer-Aided Detection for Screening Mammography. N. Engl. J. Med. 359, 1675–1684 (2008). 129. Taylor, P. & Potts, H. W. W. Computer aids and human second reading as interventions in 193 screening mammography: Two systematic reviews to compare effects on cancer detection and recall rate. Eur. J. Cancer 44, 798–807 (2008). 130. Khoo, L. A. L., Taylor, P. & Given-Wilson, R. M. Computer-aided detection in the United Kingdom National Breast Screening Programme: Prospective study. Radiology 237, 444–449 (2005). 131. Tchou, P. M. et al. Interpretation time of computer-aided detection at screening mammography. Radiology 257, 40–46 (2010). 132. Hupse, R., Samulski, M., Lobbes, M. B. & Ritse M. Mann. Computer-aided detection of masses at mammography: Interactive decision support versus prompts. Radiology 266, 123–9 (2013). 133. Hickman, S. E. et al. Machine Learning for Workflow Applications in Screening Mammography: Systematic Review and Meta-Analysis. Radiology 302, 88–104 (2022). 134. Dembrower, K. et al. Effect of artificial intelligence-based triaging of breast cancer screening mammograms on cancer detection and radiologist workload: a retrospective simulation study. Lancet Digit. Heal. 2, e468–e474 (2020). 135. Rodriguez-Ruiz, A. et al. Can we reduce the workload of mammographic screening by automatic identification of normal exams with artificial intelligence? A feasibility study. Eur. Radiol. 29, 4825–4832 (2019). 136. Freeman, K. et al. Use of artificial intelligence for image analysis in breast cancer screening - Rapid review and evidence map (UK NSC). (2021). 137. Schaffter, T. et al. Evaluation of Combined Artificial Intelligence and Radiologist Assessment to Interpret Screening Mammograms. JAMA Netw. open 3, e200265 (2020). 138. McKinney, S. M. et al. International evaluation of an AI system for breast cancer screening. Nature 577, 89–94 (2020). 139. Hickman, S. E., Baxter, G. C. & Gilbert, F. J. Adoption of artificial intelligence in breast imaging: evaluation, ethical constraints and limitations. Br. J. Cancer 125, 15–22 (2021). 140. The Royal College of Radiologists. Clinical Radiology UK Workforce Census 2019 Report. (2020). 141. Lecun, Y., Bengio, Y. & Hinton, G. Deep learning. Nature 521, 436–444 (2015). 142. Liu, X. et al. A comparison of deep learning performance against health-care professionals in detecting diseases from medical imaging: a systematic review and meta-analysis. Lancet Digit. Heal. 1, e271–e297 (2019). 143. House of Lords Select Committee on Artificial Intelligence. AI in the UK: ready, willing and able? (2018). 144. NHSX. Artificial Intelligence : how to get it right Holistic guidance for the development and 194 deployment of AI in health and care. (2019). 145. Geis, J. R. et al. Ethics of artificial intelligence in radiology: summary of the joint European and North American multisociety statement. Radiology 293, 436–440 (2019). 146. Park, S. H. & Han, K. Methodologic guide for evaluating clinical performance and effect of artificial intelligence technology for medical diagnosis and prediction. Radiology 286, 800– 809 (2018). 147. NHSX. A Buyer’s Guide to AI in Health and Care. (2020). 148. Willemink, M. J. et al. Preparing medical imaging data for machine learning. Radiology 295, 4–15 (2020). 149. Salim, M. et al. External Evaluation of 3 Commercial Artificial Intelligence Algorithms for Independent Assessment of Screening Mammograms. JAMA Oncol. 6, 1581–1588 (2020). 150. OPTIMAM. OMI-DB Database Information (tabular view). https://medphys.royalsurrey.nhs.uk/omidb/stats_table/ (2020). 151. Health Data Research UK. Health Data Research Innovation Gateway: About. https://www.hdruk.ac.uk/about-us/ (2020). 152. Wilkinson, M. D. et al. The FAIR Guiding Principles for scientific data management and stewardship. Sci. Data 3, 160018 (2016). 153. Suckling, J. et al. The Mammographic Image Analysis Society Digital Mammogram Database. Expert. Medica, Int. Congr. Ser. 1069, 375–378 (1994). 154. Lee, R.S., Gimenez, F.L., Hoogi, A., Rubin, D. Curated Breast Imaging Subset of DDSM [Dataset]. The Cancer Imaging Archive. https://doi.org/10.7937/K9/TCIA.2016.7O02S9CY (2020). 155. Newitt, D., Hylton, N. on behalf of the I-SPY 1 Network and ACRIN 6657 Trial Team. Multi- center breast DCE-MRI data and segmentations from patients in the I-SPY 1/ACRIN 6657 trials. The Cancer Imaging Archive. https://doi.org/10.7937/K9/TCIA.2016.HdHpgJLK (2020). 156. Moreira, I. C. et al. INbreast: Toward a Full-field Digital Mammographic Database. Acad. Radiol. 19, 236–248 (2012). 157. Dembrower, K., Lindholm, P. & Strand, F. A Multi-million Mammography Image Dataset and Population-Based Screening Cohort for the Training and Evaluation of Deep Neural Networks—the Cohort of Screen-Aged Women (CSAW). J. Digit. Imaging 33, 408–413 (2020). 158. Wu, N., Phang, J., Park, J., Shen, Y., Kim, S.G., Heacock, L. et al. The NYU Breast Cancer Screening Dataset v1.0. https://cs.nyu.edu/~kgeras/reports/datav1.0.pdf (2019). 159. Breast Cancer Digital Repository. More about BCDR. https://bcdr.eu/information/about (2020). 195 160. Lingle W Erickson BJ Zuley ML Jarosz R Bonaccio E Filippini J et al. Radiology Data from The Cancer Genome Atlas Breast Invasive Carcinoma [TCGA-BRCA] collection [Data set]. The Cancer Imaging Archive. https://doi.org/10.7937/K9/TCIA.2016.AB2NAZRP (2020). 161. UK National Screening Committe. Interim guidance for those wishing to incorporate artificial intelligence into the National Breast Screening Programme. (2019). 162. Hosny, A., Parmar, C., Quackenbush, J., Schwartz, L. H. & Aerts, H. J. W. L. Artificial intelligence in radiology. Nat. Rev. Cancer 18, 500–510 (2018). 163. Nagendran, M. et al. Artificial intelligence versus clinicians: Systematic review of design, reporting standards, and claims of deep learning studies in medical imaging. BMJ 368, 1–12 (2020). 164. NHSX. AI in Health and Care Award winners. https://www.nhsx.nhs.uk/ai-lab/ai-lab- programmes/ai-health-and-care-award/ai-health-and-care-award-winners/ (2020). 165. Liu X Cruz Rivera S Moher D Calvert MJ Denniston AK and The SPIRIT-AI and CONSORT-AI Working Group. Reporting guidelines for clinical trial reports for interventions involving artificial intelligence: the CONSORT-AI extension. Nat. Med. 26, 1364–1374 (2020). 166. Cruz Rivera S Liu X Chan A Denniston AK Calvert MJ The SPIRIT-AI and CONSORT-AI Working Group SPIRIT-AI and Group CONSORT-AI Steering Group and SPIRIT-AI and CONSORT-AI Consensus Group. Guidelines for clinical trial protocols for interventions involving artificial intelligence: the SPIRIT-AI extension. Nat. Med. 26, 1351–1363 (2020). 167. Mongan, J., Moy, L. & Kahn, C. E. Checklist for Artificial Intelligence in Medical Imaging (CLAIM): A Guide for Authors and Reviewers. Radiol. Artif. Intell. 2, e200029 (2020). 168. Collins, G. S. & Moons, K. G. M. Reporting of artificial intelligence prediction models. Lancet 393, 1577–1579 (2019). 169. Sounderajah, V. et al. Developing specific reporting guidelines for diagnostic accuracy studies assessing AI interventions: The STARD-AI Steering Group. Nat. Med. 26, 807–808 (2020). 170. Halligan, S., Altman, D. G. & Mallett, S. Disadvantages of using the area under the receiver operating characteristic curve to assess imaging tests: A discussion and proposal for an alternative approach. Eur. Radiol. 25, 932–939 (2015). 171. Recht, M. P. et al. Integrating artificial intelligence into the clinical practice of radiology: challenges and recommendations. Eur. Radiol. 30, 3576–3584 (2020). 172. Pianykh, O. S. et al. Continuous learning AI in radiology: Implementation principles and early applications. Radiology 297, 6–14 (2020). 173. Ghafur, S., Fontana, G., Halligan, J., Shaughnessy, J. O. & Darzi, A. NHS data: Maximising its impact on the health and wealth of the United Kingdom. (2020). 196 174. Gilbert, F. J., Smye, S. W. & Schönlieb, C. B. Artificial intelligence in clinical imaging: a health system approach. Clin. Radiol. 75, 3–6 (2020). 175. Salim, M., Dembrower, K., Eklund, M., Lindholm, P. & Strand, F. Range of Radiologist Performance in a Population-based Screening Cohort of 1 Million Digital Mammography Examinations. Radiology 297, 33–39 (2020). 176. Conant, E. F. et al. Improving Accuracy and Efficiency with Concurrent Use of Artificial Intelligence for Digital Breast Tomosynthesis. Radiol. Artif. Intell. 1, e180096 (2019). 177. Rasti, R., Teshnehlab, M. & Phung, S. L. Breast cancer diagnosis in DCE-MRI using mixture ensemble of convolutional neural networks. Pattern Recognit. 72, 381–390 (2017). 178. Dalmış, M. U. et al. Fully automated detection of breast cancer in screening MRI using convolutional neural networks. J. Med. Imaging 5, 014502 (2018). 179. Zhou, J. et al. Weakly supervised 3D deep learning for breast cancer classification and localization of the lesions in MR images. J. Magn. Reson. Imaging 50, 1144–1151 (2019). 180. Reeves, G. K. et al. Comparison of the effects of genetic and environmental risk factors on in situ and invasive ductal breast cancer. Int. J. Cancer Comp. 131, 930–7 (2011). 181. Green, J. et al. Cohort Profile : the Million Women Study. 48, 28–29 (2019). 182. Vilmun, B. M. et al. Impact of adding breast density to breast cancer risk models: A systematic review. Eur. J. Radiol. 127, (2020). 183. Dench, E. et al. Measurement challenge: protocol for international case–control comparison of mammographic measures that predict breast cancer risk. BMJ Open 9, e031041 (2019). 184. Qu, Y. H. et al. Prediction of pathological complete response to neoadjuvant chemotherapy in breast cancer using a deep learning (DL) method. Thorac. Cancer 11, 651–658 (2020). 185. Ravichandran, K., Braman, N., Janowczyk, A. & Madabhushi, A. A deep learning classifier for prediction of pathological complete response to neoadjuvant chemotherapy from baseline breast DCE-MRI. in Proc. SPIE 10575, Medical Imaging 2018: Computer-Aided Diagnosis vol. 10575 105750C (2018). 186. Huynh, B. Q., Antropova, N. & Giger, M. L. Comparison of breast DCE-MRI contrast time points for predicting response to neoadjuvant chemotherapy using deep convolutional neural network features with transfer learning. Med. Imaging 2017 Comput. Diagnosis 10134, 101340U (2017). 187. Braman, N. et al. Deep learning-based prediction of response to HER2-targeted neoadjuvant chemotherapy from pre-treatment dynamic breast MRI: A multi-institutional validation study. pre print arXiv2001.08570 (2020). 188. Ha, R. et al. Convolutional Neural Network Using a Breast MRI Tumor Dataset Can Predict 197 Oncotype Dx Recurrence Score. J. Magn. Reson. Imaging 49, 518–524 (2019). 189. Department of Health and Social Care. Code of conduct for data-driven health and care technology. https://www.gov.uk/government/publications/code-of-conduct-for-data-driven- health-and-care-technology (2020). 190. Office for Artifical Intelligence. Department for Digital Culture, Media and Sport. Joint statement from founding members of the Global Partnership on Artificial Intelligence. https://www.gov.uk/government/publications/joint-statement-from-founding-members-of- the-global-partnership-on-artificial-intelligence/joint-statement-from-founding-members-of- the-global-partnership-on-artificial-intelligence (2020). 191. Mudgal, K. S. & Das, N. The ethical adoption of artificial intelligence in radiology. BJR|Open 2, 20190020 (2020). 192. Ledford, H. Google health-data scandal spooks researchers. https://www.nature.com/articles/d41586-019-03574-5 (2020). 193. DeCamp, M. & Lindvall, C. Latent bias and the implementation of artificial intelligence in medicine. J. Am. Med. Informatics Assoc. 27, 2020–2023 (2020). 194. Chen, I. Y. et al. Ethical Machine Learning in Health. pre print arXiv2009.10576 (2020). 195. Kahn, C. E. Combatting Bias in Medical AI Systems. https://pubs.rsna.org/page/ai/blog/2020/7/ryai_editorsblog0715 (2020). 196. Department of Health and Social Care. The NHS Constitution for England. https://www.gov.uk/government/publications/the-nhs-constitution-for-england/the-nhs- constitution-for-england (2020). 197. Department of Health and Social Care. Creating the right framework to realise the benefits for patients and the NHS where data underpins innovation. https://www.gov.uk/government/publications/creating-the-right-framework-to-realise-the- benefits-of-health-data/creating-the-right-framework-to-realise-the-benefits-for-patients- and-the-nhs-where-data-underpins-innovation (2020). 198. Legislation.go.uk. Data Protection Act 2018. http://www.legislation.gov.uk/ukpga/2018/12/contents (2020). 199. Intersoft Consulting. General Data Protection Regulation (GDPR). (2020). 200. Information Comissioners Office. What are the rules on special category data? https://ico.org.uk/for-organisations/guide-to-data-protection/guide-to-the-general-data- protection-regulation-gdpr/special-category-data/what-are-the-rules-on-special-category- data/#scd1 (2020). 201. Information Comissioners Office. Anonymisation: managing data protection risk code of 198 practice. https://ico.org.uk/media/1061/anonymisation-code.pdf (2012). 202. HRA. Confidentiality Advisory Group. https://www.hra.nhs.uk/approvals-amendments/what- approvals-do-i-need/confidentiality-advisory-group/ (2020). 203. The Wellcome Trust. The One-Way Mirror: Public attitudes to commercial access to health data. (2016). 204. NHS Digital. National data opt-out. https://digital.nhs.uk/services/national-data-opt-out (2020). 205. National Data Guardian. Caldicott Principles: a consultation about revising, expanding and upholding the principles. https://www.gov.uk/government/consultations/caldicott- principles-a-consultation-about-revising-expanding-and-upholding-the-principles (2020). 206. Ming, C. et al. Machine learning-based lifetime breast cancer risk reclassification compared with the BOADICEA model: impact on screening recommendations. Br. J. Cancer 123, 860– 867 (2020). 207. Wachter, R. M. Making it work: harnessing the power of health information technology to improve care in England. (2016). 208. Department For Digital Culture Media and Sport. National Data Strategy. https://www.gov.uk/government/publications/uk-national-data-strategy/national-data- strategy#about-the-national-data-strategy (2020). 209. Hern, A. NHS could have avoided WannaCry hack with ‘basic IT security’, says report. https://www.theguardian.com/technology/2017/oct/27/nhs-could-have-avoided-wannacry- hack-basic-it-security-national-audit-office (2017). 210. Moore, S. M. et al. De-identification of medical images with retention of scientific research value. Radiographics 35, 727–735 (2015). 211. NHS. NHS Digital Academy. https://www.england.nhs.uk/digitaltechnology/nhs-digital- academy/ (2020). 212. The Topol Review. Preparing the healthcare workforce to deliver the digital future. (2019). 213. The Royal College of Radiologists. Clinical Radiology Specialty Training Curriculum. (2020). 214. American College of Radiology Data Science Institute®. FDA Cleared AI Algorithms. https://www.acrdsi.org/DSI-Services/FDA-Cleared-AI-Algorithms. (2020). 215. Sechopoulos, I. & Mann, R. M. Stand-alone artificial intelligence - The future of breast cancer screening? Breast 49, 254–260 (2020). 216. Watanabe, L. The Power of Triage (CADt) in Breast Imaging. Applied Radiology https://www.appliedradiology.com/articles/the-power-of-triage-cadt-in-breast-imaging. (2020). 199 217. DeAngelis, C. D. & Fontanarosa, P. B. US Preventive Services Task Force and Breast Cancer Screening. JAMA 303, 172–173 (2010). 218. Pharoah, P. D. P., Sewell, B., Fitzsimmons, D., Bennett, H. S. & Pashayan, N. Cost effectiveness of the NHS breast screening programme: life table model. BMJ 346, 1–8 (2013). 219. Kohli, A. & Jha, S. Why CAD Failed in Mammography. J. Am. Coll. Radiol. 15, 12–14 (2017). 220. Philpotts, L. E. Can computer-aided detection be detrimental to mammographic interpretation? Radiology 253, 17–22 (2009). 221. McInnes, M. D. F. et al. Preferred Reporting Items for a Systematic Review and Meta-analysis of Diagnostic Test Accuracy Studies The PRISMA-DTA Statement. JAMA - J. Am. Med. Assoc. 319, 388–396 (2018). 222. Whiting, P. F., Rutjes, A. W. ., Westwood, M. . & Al, E. QUADAS-2: A Revised Tool for the Quality Assessment of Diagnostic Accuracy Studies. Ann. Intern. Med. 18, 529–536 (2011). 223. UK National Screening Committe. Use of artificial intelligence for image analysis in breast cancer screening. Rapid review and evidence map. (2021). 224. Wolff, R. F. et al. PROBAST: A tool to assess the risk of bias and applicability of prediction model studies. Ann. Intern. Med. 170, 51–58 (2019). 225. R Core Team (2020). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. https://www.r-project.org/. 226. Philipp Doebler (2020). mada: Meta-Analysis of Diagnostic Accuracy. R package version 0.5.10. https://CRAN.R-project.org/package=mada. 227. Angelo Canty and Brian Ripley (2020). boot: Bootstrap R (S-Plus) Functions. R package version 1.3-25. https://cran.r-project.org/web/packages/boot/boot.pdf. 228. Reitsma, J. B. et al. Bivariate analysis of sensitivity and specificity produces informative summary measures in diagnostic reviews. J. Clin. Epidemiol. 58, 982–990 (2005). 229. Yala, A., Schuster, T., Miles, R., Barzilay, R. & Lehman, C. A deep learning model to triage screening mammograms: A simulation study. Radiology 293, 38–46 (2019). 230. Balta, C., Rodriguez-Ruiz, A., Mieskes, C., Karssemeijer, N. & Heywang-Köbrunner, S. H. Going from double to single reading for screening exams labeled as likely normal by AI: what is the impact? Proc. SPIE 11513, 15th Int. Work. Breast Imaging 66 (2020) doi:10.1117/12.2564179. 231. Kyono, T., Gilbert, F. J. & van der Schaar, M. MAMMO: A Deep Learning Solution for Facilitating Radiologist-Machine Collaboration in Breast Cancer Diagnosis. Arxiv [Preprint] 1– 18 (2018). 232. Kyono, T., Gilbert, F. J. & van der Schaar, M. Improving Workflow Efficiency for Mammography Using Machine Learning. J. Am. Coll. Radiol. 17, 56–63 (2020). 200 233. Lotter, W. et al. Robust breast cancer detection in mammography and digital breast tomosynthesis using annotation-efficient deep learning approach. Arxiv [Preprint] 1–16 (2019). 234. Rodríguez-Ruiz, A. et al. Detection of breast cancer with mammography: Effect of an artificial intelligence support system. Radiology 290, 1–10 (2019). 235. Rodriguez-Ruiz, A. et al. Stand-Alone Artificial Intelligence for Breast Cancer Detection in Mammography: Comparison With 101 Radiologists. J. Natl. Cancer Inst. 111, 916–922 (2019). 236. Kim, H. E. et al. Changes in cancer detection and false-positive recall in mammography using artificial intelligence: a retrospective, multireader study. Lancet Digit. Heal. 2, e138–e148 (2020). 237. Kheiron. Press release: Kheiron wins UK Government award to help solve critical challenges in UK breast screening with Mia. https://www.kheironmed.com/news/https/www.nhsx.nhs.uk/news/nhs-ai-lab-speed-cancer- and-heart-care/ (2020). 238. ClinicalTrials.gov. Development of Artificial Intelligence System for Detection and Diagnosis of Breast Lesion Using Mammography. https://clinicaltrials.gov/ct2/show/NCT03708978 (2021). 239. ClinicalTrials.gov. Experiment on the Use of Innovative Computer Vision Technologies for Analysis of Medical Images in the Moscow Healthcare System. https://clinicaltrials.gov/ct2/show/NCT04489992 (2021). 240. IBM Research Editorial Staff. DREAM Challenge results: Can machine learning help improve accuracy in breast cancer screening? https://www.ibm.com/blogs/research/2017/06/dream- challenge-results/ (2020). 241. Heaven, W. D. AI is wrestling with a replication crisis. MIT Technology Review https://www.technologyreview.com/2020/11/12/1011944/artificial-intelligence-replication- crisis-science-big-tech-google-deepmind-facebook-openai/?utm_source=pocket-newtab- global-en-GB. (2020). 242. Haibe-Kains, B. et al. The importance of transparency and reproducibility in artificial intelligence research. Nature 586, E14-18 (2020). 243. Lowes, S. & Paul, S. British Society of Breast Radiology Virtual Annual Scientific Meeting 2021. Breast Cancer Res. 23, 1–9 (2021). 244. Goldacre, B. & Morley, J. Better, Broader, Safer: Using Health Data for Research and Analysis. https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment_d ata/file/1067053/goldacre-review-using-health-data-for-research-and-analysis.pdf (2022). 245. Prior, F. W. et al. TCIA: An information resource to enable open science. Proc. Annu. Int. Conf. 201 IEEE Eng. Med. Biol. Soc. EMBS 1282–1285 (2013) doi:10.1109/EMBC.2013.6609742. 246. Johnson, A. E. W. et al. MIMIC-CXR, a de-identified publicly available database of chest radiographs with free-text reports. Sci. Data 6, 1–8 (2019). 247. Bien, N. et al. Deep-learning-assisted diagnosis for knee magnetic resonance imaging: Development and retrospective validation of MRNet. PLoS Med. 15, 1–19 (2018). 248. Suckling J, Parker J, Dance D, Astley S, Hutt I, Boggis C, Ricketts I, et al. Mammographic Image Analysis Society (MIAS) database v1.21 [Dataset]. https://www.repository.cam.ac.uk/handle/1810/250394 (2015). 249. Heath M, Bowyer K, Kopans d, M. R. and K. J. P. The Digital Database for Screening Mammography. http://www.eng.usf.edu/cvprg/mammography/database.html#:~:text=The Digital Database for Screening,Medical Research and Materiel Command. (1998). 250. Halling-Brown, M. D. et al. OPTIMAM mammography image database: A large-scale resource of mammography images and clinical data. Radiol. Artif. Intell. 3, (2021). 251. Lee, R. S. et al. Data Descriptor: A curated mammography data set for use in computer-aided detection and diagnosis research. Sci. Data 4, 1–9 (2017). 252. Lee, R.S., Gimenez, F.L., Hoogi, A., Rubin, D. Curated Breast Imaging Subset of DDSM [Dataset]. The Cancer Imaging Archive. https://doi.org/10.7937/K9/TCIA.2016.7O02S9CY (2020). 253. Biokeanos. INbreast Database. https://biokeanos.com/source/INBreast (2022). 254. BCDR. Breast Cancer Digital Repository. https://www.bcdr.eu/information/about (2022). 255. Raul Ramon Pollan. Improving multilayer perceptron classifiers AUC performance. (2011). 256. Jeong, J. J. et al. The EMory BrEast imaging Dataset (EMBED): A Racially Diverse, Granular Dataset of 3.5M Screening and Diagnostic Mammograms. arXiv 2013–2015 (2022). 257. Moser, K. et al. Extending the age range for breast screening in England: Pilot study to assess the feasibility and acceptability of randomization. J. Med. Screen. 18, 96–102 (2011). 258. OFFIS DICOM Toolkit. DCMTK. https://support.dcmtk.org/docs/index.html (2022). 259. DICOM Standards Committee. DICOM PS3.15 2022b - Security and System Management Profiles. https://dicom.nema.org/medical/dicom/current/output/chtml/part15/ps3.15.html (2022). 260. G. van Rossum. Python tutorial, Technical Report CS-R9526, Centrum voor Wiskunde en Informatica (CWI), Amsterdam. (1995). 261. NHS. NHS Breast Screening Programme Central Return Data Set (KC62). https://www.datadictionary.nhs.uk/data_sets/central_return_data_sets/nhs_breast_screeni ng_programme_central_return_data_set__kc62_.html (2022). 202 262. NHS Digital. Breast Screening Programme. https://digital.nhs.uk/data-and- information/publications/statistical/breast-screening-programme (2022). 263. Public Health England. Interval cancers explained in the NHS Breast Screening Programme. https://www.gov.uk/government/publications/nhs-screening-programmes-duty-of- candour/interval-cancers-explained-in-the-nhs-breast-screening-programme-notes-for- professionals-and-patients (2020). 264. Gov UK. Regional ethnic diversity. https://www.ethnicity-facts-figures.service.gov.uk/uk- population-by-ethnicity/national-and-regional-populations/regional-ethnic-diversity/latest (2020). 265. Heller, S. L., Hudson, S. & Wilkinson, L. S. Breast density across a regional screening population: Effects of age, ethnicity and deprivation. Br. J. Radiol. 88, (2015). 266. Maroni, R. et al. A case-control study to evaluate the impact of the breast screening programme on mortality in England. Br. J. Cancer 124, 736–743 (2021). 267. General Medical Council. The professional duty of candour. https://www.gmc-uk.org/ethical- guidance/ethical-guidance-for-doctors/candour---openness-and-honesty-when-things-go- wrong/the-professional-duty-of-candour (2022). 268. Mainprize, J. G. et al. Prediction of Cancer Masking in Screening Mammography Using Density and Textural Features. Acad. Radiol. 26, 608–619 (2019). 269. Sheth, M. M. & McElligott, S. E. Case-based Review of Subtle Signs of Breast Cancer at Mammography. Radiographics 39, 630–631 (2019). 270. Lång, K., Hofvind, S., Rodríguez-Ruiz, A. & Andersson, I. Can artificial intelligence reduce the interval cancer rate in mammography screening? Eur. Radiol. 31, 5940–5947 (2021). 271. Larsen, M., Aglen, C. F., Lee, M. A. C. I. & Hoff, M. S. S. R. Artificial Intelligence Evaluation of 122 969 Mammography Examinations from a Population-based Screening Program. Radiology 000, 1–9 (2022). 272. Hinton, B. et al. Deep learning networks find unique mammographic differences in previous negative mammograms between interval and screen-detected cancers: A case-case study. Cancer Imaging 19, 1–9 (2019). 273. Graewingholt, A. & Rossi, P. G. Retrospective analysis of the effect on interval cancer rate of adding an artificial intelligence algorithm to the reading process for two-dimensional full-field digital mammography. J. Med. Screen. 28, 369–371 (2021). 274. Arkin, C. F., Mitchell, S. & Wachtel, M. How Many Patients Are Necessary to Assess Test Performance? JAMA J. Am. Med. Assoc. 264, 2074–2075 (1990). 275. Cohen, J. F. et al. STARD 2015 guidelines for reporting diagnostic accuracy studies: 203 Explanation and elaboration. BMJ Open 6, 1–17 (2016). 276. Wickham H, François R, Henry L, Müller K (2022). dplyr: A Grammar of Data Manipulation. https://dplyr.tidyverse.org, https://github.com/tidyverse/dplyr. 277. Wickham H, Girlich M (2022). tidyr: Tidy Messy Data. https://tidyr.tidyverse.org, https://github.com/tidyverse/tidyr. 278. Bates D, Mächler M, Bolker B, W. S. Fitting Linear Mixed-Effects Models Using lme. J. Stat. Softw. 4, 1–48 (2015). 279. Robin X, Turck N, Hainard A, Tiberti N, Lisacek F, Sanchez J, M. M. pROC: an open-source package for R and S+ to analyze and compare ROC curves. BMC Bioinformatics 12, (2011). 280. Saito T, R. M. Precrec: fast and accurate precision-recall and ROC curve calculations in R. Bioinformatics 33, 145-147. (2017). 281. Grolemund G, W. H. Dates and Times Made Easy with lubridate. J. Stat. Softw. 40, 1–25 (2011). 282. Carstensen B, Plummer M, Laara E, Hills M (2022). Epi: A Package for Statistical Analysis in Epidemiology. R package version 2.46. (2022). 283. Matt Dowle (2021). data.table. R package version 1.14.2. https://cran.r- project.org/web/packages/data.table/index.html. 284. Chen, H. & Boutros, P. C. VennDiagram: A package for the generation of highly-customizable Venn and Euler diagrams in R. BMC Bioinformatics 12, (2011). 285. Social Science Statistics. Easy Fisher Exact Test Calculator. https://www.socscistatistics.com/tests/fisher/default2.aspx (2022). 286. Patel, M. N., Looney, P., Young, K. & Halling-Brown, M. D. Automated collection of medical images for research from heterogeneous systems: trials and tribulations. Med. Imaging 2014 PACS Imaging Informatics Next Gener. Innov. 9039, 90390C (2014). 287. Wanders, A. J. T. et al. Interval Cancer Detection Using a Neural Network and Breast Density in Women with Negative Screening Mammograms. Radiology (2022) doi:10.1148/radiol.210832. 288. Gilbert, F. J. et al. Single reading with computer-aided detection and double reading of screening mammograms in the United Kingdom national breast screening program. Radiology 241, 47–53 (2006). 289. Freeman, K. et al. Use of artificial intelligence for image analysis in breast cancer screening programmes: Systematic review of test accuracy. BMJ 374, (2021). 290. Sharma, N. et al. Large-scale evaluation of an AI system as an independent reader for double reading in breast cancer screening. medRxiv 2021.02.26.21252537 (2021). 204 291. Simel, D. L., Samsa, G. P. & Matchar, D. B. Likelihood ratios with confidence: Sample size estimation for diagnostic test studies. J. Clin. Epidemiol. 44, 763–770 (1991). 292. Wu, N. et al. Deep Neural Networks Improve Radiologists’ Performance in Breast Cancer Screening. IEEE Trans. Med. Imaging 39, 1184–1194 (2020). 293. Lotter, W. et al. Robust breast cancer detection in mammography and digital breast tomosynthesis using an annotation-efficient deep learning approach. Nat. Med. 27, 244–249 (2021). 294. Taylor-Phillips, S., Clarke, A., Wheaton, M., Kearins, O. & Wallis, M. Fatigue and performance in interpreting breast screening mammograms. Breast Cancer Res. 14, P24–P24 (2012). 295. Lång, K. et al. Identifying normal mammograms in a large screening population using artificial intelligence. Eur. Radiol. 31, 1687–1692 (2021). 296. Lauritzen, A. D. & Lynge, E. An Artificial Intelligence – based Mammography Screening Protocol for Breast Cancer : Outcome and. Radiology 1–9 (2022). 297. National Institue for Health and Care Excellence. Artificial intelligence in mammography - medtech innovation briefing (MIB242). https://www.nice.org.uk/advice/mib242 (2021). 298. NHS. The National Strategy for AI in Health and Social Care. https://www.nhsx.nhs.uk/ai- lab/ai-lab-programmes/the-national-strategy-for-ai-in-health-and-social-care/. 299. Harwich, E. & Laycock, K. Thinking on its own: AI in the NHS. Reform vol. Jan https://reform.uk/research/thinking-its-own-ai-nhs (2018). 300. American college of radiology. ACR Data Science Institute AI Central. https://aicentral.acrdsi.org (2022). 301. Research, N. I. for H. and C. AI in Health and Care Awards - funded projects 2020. https://www.nihr.ac.uk/documents/ai-in-health-and-care-awards-funded-projects- 2020/25625#Mia_Mammography_Intelligent_Assessment_-_Kheiron_Medical_Technologies (2020). 302. Kheiron Medical Technologies. The AI in Health and Care Award and Kheiron Medical. https://www.kheironmed.com/nhsx-and-kheiron-medical/ (2022). 303. Imperial College London. AI breast cancer screening project wins government funding for NHS trial. https://www.imperial.ac.uk/news/222653/ai-breast-cancer-screening-project- wins/ (2021). 304. ClinicalTrials.gov. Artificial Intelligence in Large-scale Breast Cancer Screening (ScreenTrustCAD). https://clinicaltrials.gov/ct2/show/NCT04778670?term=screening+artificial+intelligence&con d=breast+cancer&draw=2&rank=3 (2021). 205 305. ClinicalTrials.gov. Artificial Intelligence in Breast Cancer Screening Programs in Córdoba (AITIC) (AITIC). https://clinicaltrials.gov/ct2/show/NCT04949776?term=screening+artificial+intelligence&con d=breast+cancer&draw=2&rank=2 (20221). 306. ClinicalTrials.gov. Artificial Intelligence for breaST canceR scrEening in mAMmography (AI- STREAM). https://clinicaltrials.gov/ct2/show/NCT05024591?term=screening+artificial+intelligence&con d=breast+cancer&draw=2&rank=4 (2021). 307. ClinicalTrials.gov. Artificial Intelligence in Breast Cancer Screening in Region Östergötland Linkoping (AI-ROL). https://clinicaltrials.gov/ct2/show/NCT05048095?term=AI&cond=breast+cancer+screening& draw=2&rank=1 (2022). 308. ClinicalTrials.gov. Mammography Screening With Artificial Intelligence (MASAI) (MASAI). https://clinicaltrials.gov/ct2/show/NCT04838756?term=AI&cond=breast+cancer+screening& draw=2&rank=10 (2022). 309. ScreenPoint Medical. Transpara® Breast Care AI to help radiologists in Denmark reduce Covid screening backlog. https://www.prnewswire.co.uk/news-releases/transpara-r-breast-care-ai- to-help-radiologists-in-denmark-reduce-covid-screening-backlog-819864219.html (2021). 310. NHS Digital. Cloud products, tools and assets. https://digital.nhs.uk/services/cloud-centre-of- excellence/cloud-products-tools-and-assets (2022). 311. The Royal College of Radiologists. Clinical Radiology Specialty Training Curriculum. www.rcr.ac.uk (2021). 312. ClinicalTrials.gov. Prospective Trial of Digital Breast Tomosynthesis (DBT) in Breast Cancer Screening. (PROSPECTS). (2019). 313. van Winkel, S. L. et al. Impact of artificial intelligence support on accuracy and reading time in breast tomosynthesis image interpretation: a multi-reader multi-case study. Eur. Radiol. 31, 8682–8691 (2021). 314. Geras, K. J., Mann, R. M. & Moy, L. Artificial intelligence for mammography and digital breast tomosynthesis: Current concepts and future perspectives. Radiology 293, 246–259 (2019). 315. Conant, E. F. et al. Improving Accuracy and Efficiency with Concurrent Use of Artificial Intelligence for Digital Breast Tomosynthesis. Radiol. Artif. Intell. 1, e180096 (2019). 206 Appendix 1 Definitions of commonly used terms in this review: • Computer-aided detection (CADe) = A system which locates an abnormality within an image and provides a prompt or marker to assist a human reader. • Computer-aided diagnosis (CADx) = A system which provides a classification for the type of abnormality found in an image. For example, at the level of cancer or no cancer for a case. • Computer-aided triage (CADt) = A system which automatically assigns cases to normal or abnormal category. Providing a possible final case decision for normal cases and highlights abnormal cases for further human reader review. • Stand-alone = An algorithm that interprets the whole mammogram case / exam and provides an outcome independent of human interaction or interpretation. • Reader = A breast clinician, radiologists or reporting radiographer who reports mammographic images. • 2D standard-view mammography = An x-ray image of breast tissue which includes two views (mediolateral oblique and cranial caudal views) for each breast (right and left). • Adapted screening = The adjustment of radiological screening workflow by changing reading protocols. Such as using a CADt algorithm for machine only reading of normal cases and presenting a proportion of suspicious cases to a single or double reader system. Other adjustments include the possibility of using a CADe and CADx algorithm as a stand-alone system to substitute one of the readers in a double reading system. • Testing = The evaluation of an algorithm’s performance. • Development = The training, tuning and validation of an algorithm. • Pre-assigned thresholds = ML algorithm test performance levels (e.g., sensitivity and specificity) which are determined in the protocol and specified according to current evidence or national performance. This is in contrast to thresholds that are altered to find the optimum performance following the completion of the test. • Clinically relevant thresholds = are the current screening programme targets (sensitivity and specificity) as well as current screening reader performance, 207 which ML algorithm performance is required to reach or provide a workflow solution where these standards are met. For example, in a double reading system if ML is to be used as a stand-alone reader alongside another human reader, then the thresholds for the ML algorithm could be set at current single reader performance. • *Open database = “Neither login nor registration are required for these data collections”. We have defined this also as a public database. • *Safeguarded database = “The safeguards include knowing who is using the data and for what purpose. The EUL outlines the restrictions on use for a particular data collection”. • *Controlled database = “These data are only available to users who have been accredited and their data usage has been approved by the relevant Data Access Committee”. • Private database = This is a controlled or safeguarded database as outlined above. • External testing = When an algorithm is tested by an independent third party who has not been involved in the development of the algorithm. • Internal testing = When an algorithm is tested by the company / academic institution that developed it. • External dataset = A dataset that is from a different dataset to the dataset that was used for development (training and validation). This can be either geographically (from a different site or country), temporally (from a different time period) or both geographically and temporally different. • Internal dataset = A dataset that is from the same dataset as the dataset that was used for development (training and validation), which is used for testing. • **Gray literature = “evidence not published in commercial publications”. *UK Data Service. Data access policy. https://www.ukdataservice.ac.uk/get-data/data-access-policy. Accessed 6 January 2021. **Paez A. Gray literature: An important resource in systematic reviews. J Evid Based Med. 2017;10(3):233–240. 208 Appendix 2 Protocol registration PROSPERO (CRD42019156016) https://www.crd.york.ac.uk/prospero/display_record.php?RecordID=156016 Link to protocol https://www.crd.york.ac.uk/PROSPEROFILES/156016_PROTOCOL_20200909.pdf Registered amendments 1. 29/10/2019 submitted initial application following completion of preliminary searches 2. 12/05/2020 updated time period for the review and fields for extraction, submitted prior to final search execution 3. 12/05/2020 updated authors, submitted prior to final search execution 4. 09/09/2020 submitted an update to the final search execution in protocol Deviations from the protocol 5. Data collection – additional items collected which were not included in the protocol, through this may introduce bias in these fields (e.g. processed mammography adjusted from processed / raw) it was felt that these fields added significant information to the review. 6. Data collected – certain data collected as part of this review is not reported in paper, however this is available on request for access to the originally extracted raw data from the authors. 7. Data collected – study authors were not contacted for further information as this was felt that it could possibly bias the results of reporting as well as confuse the metrics used to evaluate quality of reporting (CLAIM, QUADAS, PROBAST). Therefore, we have reported based on what was available in the original manuscript and supplemental material only. To ensure data extraction was robust this was checked by a third reviewer with a computer science background. 8. Meta-analysis – this was conducted only for external studies as this allowed for consistency in reporting and a larger enough number of studies to be compared. The methods from Liu et al were used to direct this analysis. Conflicts of interest FJG undertakes consulting for technology companies, and both FJG and SEH have research collaborations with technology companies as detailed in the conflicts of interest statement. None of these organizations had any role in the funding, conduct, or publication of the study. 209 Appendix 3 Digital Literature Database Search: EMBASE (EXCERPTA MEDICA DATABASE) Database: Embase <1996 to 2020 Week 35> Search Strategy: -------------------------------------------------------------------------------- 1 (breast* adj2 (cancer* or carcino* or tumour* or tumor* or malignan*)).ti,ab. 2 (breast* adj2 (lump* or lesion* or mass*)).ti,ab. 3 exp breast cancer/ 4 (Breast adj2 (screen* or imag*)).ti,ab. 5 mammogra*.ti,ab. 6 (mammo-graph* or mastograph*).ti,ab. 7 exp mammography/ 8 ((convolutional or transfer or ensemble or deep or machine*) adj2 learning).ti,ab. 9 ((deep or artificial or convolutional or neural) adj2 net*).ti,ab. 10 "artificial intelligence".ti,ab. 11 ("computer assisted diagnosis" or "computer assisted detection" or "computer aided detection" or "computer aided diagnosis").ti,ab. 12 (CNN or CAD).ti,ab. 13 exp machine learning/ 14 exp artificial intelligence/ 15 (Radiolo* or radiographer* or reader* or expert* or expertise or specialist* or clinician* or physician* or practitioner* or human* or doctor* or person*).ti,ab. 16 (workflow* or "clinical practice" or standalone or stand-alone or independent* or automat* or "screening tool" or "triage tool" or comput*).ti,ab. 17 1 or 2 or 3 18 4 or 5 or 6 or 7 19 8 or 9 or 10 or 11 or 12 or 13 or 14 20 15 or 16 21 17 and 18 and 19 and 20 22 limit 21 to yr="2012 - 2020" MEDLINE (MEDICAL LITERATURE ANALYSIS AND RETRIEVAL SYSTEM ONLINE) Database: Ovid MEDLINE(R) and Epub Ahead of Print, In-Process & Other Non-Indexed Citations, Daily and Versions(R) <1946 to September 02, 2020> Search Strategy: -------------------------------------------------------------------------------- 1 (breast* adj2 (cancer* or carcino* or tumour* or tumor* or malignan*)).ti,ab. 2 (breast* adj2 (lump* or lesion* or mass*)).ti,ab. 3 exp breast cancer/ 4 (Breast adj2 (screen* or imag*)).ti,ab. 5 mammogra*.ti,ab. 6 (mammo-graph* or mastograph*).ti,ab. 7 exp mammography/ 8 ((convolutional or transfer or ensemble or deep or machine*) adj2 learning).ti,ab. 9 ((deep or artificial or convolutional or neural) adj2 net*).ti,ab. 210 10 "artificial intelligence".ti,ab. 11 ("computer assisted diagnosis" or "computer assisted detection" or "computer aided detection" or "computer aided diagnosis").ti,ab. 12 (CNN or CAD).ti,ab. 13 exp machine learning/ 14 exp artificial intelligence/ 15 (Radiolo* or radiographer* or reader* or expert* or expertise or specialist* or clinician* or physician* or practitioner* or human* or doctor* or person*).ti,ab. 16 (workflow* or "clinical practice" or standalone or stand-alone or independent* or automat* or "screening tool" or "triage tool" or comput*).ti,ab. 17 1 or 2 or 3 18 4 or 5 or 6 or 7 19 8 or 9 or 10 or 11 or 12 or 13 or 14 20 15 or 16 (6353106) 21 17 and 18 and 19 and 20 22 limit 21 to yr="2012 - 2020" SCOPUS ( ( TITLE-ABS- KEY ( breast* W/2 ( cancer* OR carcino* OR tumour* OR tumor* OR malignan* ) ) ) OR ( TITLE -ABS-KEY ( breast* W/2 ( lump* OR lesion* OR mass* ) ) ) ) AND ( ( TITLE-ABS- KEY ( breast* W/2 ( screen* OR imag* ) ) ) OR ( TITLE-ABS-KEY ( mammogra* ) ) OR ( TITLE-ABS- KEY ( mammo-graph* OR mastograph* ) ) ) AND ( ( TITLE-ABS- KEY ( ( convolutional OR transfer OR ensemble OR deep OR machine* ) W/2 learning ) ) OR ( TI TLE-ABS-KEY ( ( deep OR artificial OR convolutional OR neural ) W/2 net* ) ) OR ( TITLE-ABS- KEY ( "artificial intelligence" ) ) OR ( TITLE-ABS-KEY ( "computer assisted diagnosis" OR "computer assisted detection" OR "computer aided detection" OR "computer aided diagnosis" ) ) OR ( TITLE- ABS-KEY ( cnn OR cad ) ) ) AND ( ( TITLE-ABS- KEY ( radiolo* OR radiographer* OR reader* OR expert* OR expertise OR specialist* OR clinici an* OR physician* OR practitioner* OR human* OR doctor* OR person* ) ) OR ( TITLE-ABS- KEY ( workflow* OR "clinical practice" OR standalone OR stand- alone OR independent* OR automat* OR "screening tool" OR "triage tool" OR comput* ) ) ) AND ( LIMIT-TO ( PUBYEAR , 2020 ) OR LIMIT- TO ( PUBYEAR , 2019 ) OR LIMIT-TO ( PUBYEAR , 2018 ) OR LIMIT- TO ( PUBYEAR , 2017 ) OR LIMIT-TO ( PUBYEAR , 2016 ) OR LIMIT- TO ( PUBYEAR , 2015 ) OR LIMIT-TO ( PUBYEAR , 2014 ) OR LIMIT- TO ( PUBYEAR , 2013 ) OR LIMIT-TO ( PUBYEAR , 2012 ) ) WEB OF SCIENCE (CORE COLLECTION) # 18 1,998 #16 AND #15 AND #14 AND #13 Refined by: PUBLICATION YEARS: ( 2020 OR 2012 OR 2019 OR 2018 OR 2017 OR 2016 OR 2015 OR 2014 OR 2013 ) Indexes=SCI-EXPANDED, SSCI, A&HCI, CPCI-S, CPCI-SSH, BKCI-S, BKCI-SSH, ESCI, CCR-EXPANDED, IC Timespan=All years # 17 3,395 #16 AND #15 AND #14 AND #13 Indexes=SCI-EXPANDED, SSCI, A&HCI, CPCI-S, CPCI-SSH, BKCI-S, BKCI-SSH, ESCI, CCR-EXPANDED, IC Timespan=All years 211 # 16 11,039,655 #12 OR #11 Indexes=SCI-EXPANDED, SSCI, A&HCI, CPCI-S, CPCI-SSH, BKCI-S, BKCI-SSH, ESCI, CCR-EXPANDED, IC Timespan=All years # 15 645,943 #10 OR #9 OR #8 OR #7 OR #6 Indexes=SCI-EXPANDED, SSCI, A&HCI, CPCI-S, CPCI-SSH, BKCI-S, BKCI-SSH, ESCI, CCR-EXPANDED, IC Timespan=All years # 14 59,602 #5 OR #4 OR #3 Indexes=SCI-EXPANDED, SSCI, A&HCI, CPCI-S, CPCI-SSH, BKCI-S, BKCI-SSH, ESCI, CCR-EXPANDED, IC Timespan=All years # 13 561,062 #2 OR #1 Indexes=SCI-EXPANDED, SSCI, A&HCI, CPCI-S, CPCI-SSH, BKCI-S, BKCI-SSH, ESCI, CCR-EXPANDED, IC Timespan=All years # 12 5,354,666 TS = (workflow* or "clinical practice" or standalone or stand-alone or independent* or automat* or "screening tool" or "triage tool" or comput*) Indexes=SCI-EXPANDED, SSCI, A&HCI, CPCI-S, CPCI-SSH, BKCI-S, BKCI-SSH, ESCI, CCR-EXPANDED, IC Timespan=All years # 11 6,491,265 TS = (Radiolo* or radiographer* or reader* or expert* or expertise or specialist* or clinician* or physician* or practitioner* or human* or doctor* or person*) Indexes=SCI-EXPANDED, SSCI, A&HCI, CPCI-S, CPCI-SSH, BKCI-S, BKCI-SSH, ESCI, CCR-EXPANDED, IC Timespan=All years # 10 107,204 TS = (CNN or CAD) Indexes=SCI-EXPANDED, SSCI, A&HCI, CPCI-S, CPCI-SSH, BKCI-S, BKCI-SSH, ESCI, CCR-EXPANDED, IC Timespan=All years # 9 13,595 TS = ("computer assisted diagnosis" or "computer assisted detection" or "computer aided detection" or "computer aided diagnosis") Indexes=SCI-EXPANDED, SSCI, A&HCI, CPCI-S, CPCI-SSH, BKCI-S, BKCI-SSH, ESCI, CCR-EXPANDED, IC Timespan=All years # 8 53,218 TS = ("artificial intelligence") Indexes=SCI-EXPANDED, SSCI, A&HCI, CPCI-S, CPCI-SSH, BKCI-S, BKCI-SSH, ESCI, CCR-EXPANDED, IC Timespan=All years # 7 388,724 TS = ((deep or artificial or convolutional or neural) NEAR/2 net*) Indexes=SCI-EXPANDED, SSCI, A&HCI, CPCI-S, CPCI-SSH, BKCI-S, BKCI-SSH, ESCI, CCR-EXPANDED, IC Timespan=All years 212 # 6 202,363 TS = ((convolutional or transfer or ensemble or deep or machine*) NEAR/2 learning) Indexes=SCI-EXPANDED, SSCI, A&HCI, CPCI-S, CPCI-SSH, BKCI-S, BKCI-SSH, ESCI, CCR-EXPANDED, IC Timespan=All years # 5 30 TS = (mammo-graph* or mastograph*) Indexes=SCI-EXPANDED, SSCI, A&HCI, CPCI-S, CPCI-SSH, BKCI-S, BKCI-SSH, ESCI, CCR-EXPANDED, IC Timespan=All years # 4 46,265 TS = (mammogra*) Indexes=SCI-EXPANDED, SSCI, A&HCI, CPCI-S, CPCI-SSH, BKCI-S, BKCI-SSH, ESCI, CCR-EXPANDED, IC Timespan=All years # 3 25,523 TS = (Breast NEAR/2 (screen* or imag*) ) Indexes=SCI-EXPANDED, SSCI, A&HCI, CPCI-S, CPCI-SSH, BKCI-S, BKCI-SSH, ESCI, CCR-EXPANDED, IC Timespan=All years # 2 17,412 TS = (breast* NEAR/2 (lump* or lesion* or mass*) ) Indexes=SCI-EXPANDED, SSCI, A&HCI, CPCI-S, CPCI-SSH, BKCI-S, BKCI-SSH, ESCI, CCR-EXPANDED, IC Timespan=All years # 1 553,266 TS = (breast* NEAR/2 (cancer* or carcino* or tumour* or tumor* or malignan*) ) Indexes=SCI-EXPANDED, SSCI, A&HCI, CPCI-S, CPCI-SSH, BKCI-S, BKCI-SSH, ESCI, CCR-EXPANDED, IC Timespan=All years CENTRAL (COCHRANE CENTRAL REGISTER OF CONTROLLED TRIALS) ID Search #1 ((breast* near/2 (cancer* or carcino* or tumour* or tumor* or malignan*))):ti,ab,kw (Word variations have been searched) #2 (breast* near/2 (lump* or lesion* or mass*)) #3 (Breast near/2 (screen* or imag*)) #4 (mammogra*) #5 (mammo-graph* or mastograph*) #6 ((convolutional or transfer or ensemble or deep or machine*) near/2 learning) #7 ((deep or artificial or convolutional or neural) near/2 net*) #8 ("artificial intelligence") #9 ("computer assisted diagnosis" or "computer assisted detection" or "computer aided detection" or "computer aided diagnosis") #10 (CNN or CAD) #11 (Radiolo* or radiographer* or reader* or expert* or expertise or specialist* or clinician* or physician* or practitioner* or human* or doctor* or person*) 1090834 #12 (workflow* or "clinical practice" or standalone or stand-alone or independent* or automat* or "screening tool" or "triage tool" or comput*) 167454 #13 #1 OR #2 #14 #3 OR #4 OR #5 #15 #6 OR #7 OR #8 OR #9 #10 #16 #11 OR #12 #17 #13 AND #14 AND #15 AND #16 with Publication Year from 2012 to 2020, in Trials 213 Grey Database Search: DBLP (DATABASE SYSTEMS AND LOGIC PROGRAMMING) Machine learning Breast cancer Mammography Then separate search for: Deep Learning Breast cancer Mammography ACM (ASSOCIATION FOR COMPUTER MACHINERY, FULL TEXT COLLECTION) [[All: machine learning] AND [All: breast cancer] AND [All: mammography]] OR [[All: deep learning] AND [All: breast cancer] AND [All: mammography]] AND [Publication Date: (01/01/2012 TO 31/12/2020)] IEEE (("All Metadata":Machine learning AND Breast cancer AND Mammography) OR "All Metadata":Deep Learning AND Breast cancer AND Mammography) arXiv Query: order: -announced_date_first; size: 200; date_range: from 2012-01-01 to 2020-12-31; include_cross_list: True; terms: AND all=Machine learning AND Breast cancer AND Mammography; OR all=Deep Learning AND Breast cancer AND Mammography 214 Appendix 4 Fields included in the data extraction: Table 1 Study details 1. Journal 2. Year 3. Author 4. Title Study design 5. Design (Retrospective/ prospective) 6. Algorithm name 7. Traditional ML / Deep ML 8. Workflow application 9. Decision level Study population (train + validation dataset) 10. Total number of cases 11. Total number of images 12. Number of normal cases (*not reported in main tables, please contact authors for the extraction tables) 13. Number of cancer cases (*not reported in main tables, please contact authors for the extraction tables) 14. Number of benign cases (*not reported in main tables, please contact authors for the extraction tables) 15. Vendor (*not reported in main tables, please contact authors for the extraction tables) 16. Country (*not reported in main tables, please contact authors for the extraction tables) Human readers 17. Readers (number + experience) 18. Single / double / multi-reader 19. Clinical information available to readers (*not reported in main tables, please contact authors for the extraction tables) 20. Prior mammogram available to readers (*not reported in main tables, please contact authors for the extraction tables) 21. Reader reading as part of real time workflow / reader study 22. Ground truth Algorithm performance 23. Internal / external 24. Algorithm threshold set 215 25. Randomised / non-randomised data split (*not reported in main tables, please contact authors for the extraction tables) 26. Bootstrapping / cross validation (resampling) (*updated to include other types of study format) 27. %normals (CI) 28. Negative Predictive Value (NPV) 29. False Negatives (FN) 30. Area Under the Curve (AUC) (CI) 31. Sensitivity (CI) 32. Specificity (CI) Other 33. Data augmentation (flip / rotate / synthetic images) (*not reported in main tables, please contact authors for the extraction tables) 34. Handling missing data (*not reported in main tables, please contact authors for the extraction tables) 35. Compute time (*not reported in main tables, please contact authors for the main extraction tables) 36. Interpretability - e.g. heatmap / locator (*not reported in main tables, please contact authors for the extraction tables) 37. Algorithm code available 38. Funding Source (*not reported in main tables, please contact authors for the extraction tables) 39. Additional information relevant to testing (*not reported in main tables, please contact authors for the extraction tables) Table 2 Study details 1. Journal 2. Year 3. Author 4. Title Study population (test dataset) 5. Dataset name 6. Country where mammograms were taken 7. No. Centres 8. Year of studies 9. Vendor 10. Screen / Diagnostic 40. Digital / Film 41. Raw / Processed (*adjusted field to algorithm processing, raw and processed reported in main tables, please contact authors for the extraction tables) 216 11. Public / Private 12. Internal / External test set 13. Dataset Size cases 14. Dataset Size images 15. Proportion of cancers 16. Proportion of cancers that are (screen detected + subsequent round + interval) (*not reported in main tables, please contact authors for the extraction tables) Training, validation and testing 17. Used for testing (*not reported in main tables, please contact authors for the extraction tables) 18. Dataset for testing same as train + validation (*not reported in main tables, please contact authors for the extraction tables) 19. Train / validation / test split (*not reported in main tables, please contact authors for the extraction tables) 20. Density measure 21. Average lesion size (*not reported in main tables, please contact authors for the extraction tables) 22. Age *For clarity a refined selection of fields was included in the main extraction tables (table 1,2,3 and 4). For the details of the additional fields extracted please contact authors for these extraction tables. Varying terminology in reported studies made the identification of data for extraction challenging. Studies included in this review were allowed to focus on ML development, validation, or both. 217 Appendix 5 Further description of methods for primary meta-analysis Studies were included in the primary study level meta-analysis if they were conducted with an external dataset, the ground-truth was similar to the set standard of histopathology plus follow-up of more than one year, and enough information was provided to produce contingency tables for both the algorithm and reader (tested on the same dataset). If a study reported exams only then this was used as the case number for analysis. When a simulated case cohort (e.g. using bootstrapping) was reported, this was used for the total and cancer case size. If the same algorithm was reported in different articles for the same workflow application, then the most recent version of the algorithm was included. If a study reported multiple algorithms, then the highest performing algorithm (at the test stage) defined by AUROC was used. If multiple results for the same algorithm or reader were available in the same article, then only the highest reported study result by either AUROC or if AUROC was not available then by positive prediction (total number of true positives and true negatives) was used (from the test stage). 218 Appendix 6 Included articles references:116,134,135,137,138,149,229–236 1. *+**Rodriguez-Ruiz A, Lång K, Gubern-Merida A, et al. Stand-alone artificial intelligence for breast cancer detection in mammography: comparison with 101 radiologists. J Natl Cancer Inst. 2019;111(9):916–922. 2. Rodriguez-Ruiz A, Lång K, Gubern-Merida A, et al. Can we reduce the workload of mammographic screening by automatic identification of normal exams with artificial intelligence? A feasibility study. Eur Radiol. European Radiology. 2019;29(9):4825–4832. 3. Yala A, Schuster T, Miles R, Barzilay R, Lehman C. A Deep learning model to triage screening mammograms: a simulation study. Radiology. 2019;293(1):38–46. 4. Kyono T, Gilbert FJ, van der Schaar M. Improving workflow efficiency for mammography using machine learning. J Am Coll Radiol. 2020;17(1):56–63. 5. *+**McKinney SM, Sieniek M, Godbole V, et al. International evaluation of an AI system for breast cancer screening. Nature. 2020;577(7788):89–94. 6. **Kim HE, Kim HH, Han BK, et al. Changes in cancer detection and false-positive recall in mammography using artificial intelligence: a retrospective, multireader study. Lancet Digit Heal. 2020;2(3):e138–e148. 7. *+**Schaffter T, Buist DSM, Lee CI, et al. Evaluation of combined artificial intelligence and radiologist assessment to interpret screening mammograms. JAMA Netw open. 2020;3(3):e200265. 8. Rodríguez-Ruiz A, Krupinski E, Mordang JJ, et al. Detection of breast cancer with mammography: effect of an artificial intelligence support system. Radiology. 2019;290(3):1– 10. 9. *+**Lotter W, Diab AR, Haslam B, et al. Robust breast cancer detection in mammography and digital breast tomosynthesis using annotation-efficient deep learning approach. Arxiv [Preprint]. 2019;1–16. http://arxiv.org/abs/1912.11027. 10. Kyono T, Gilbert FJ, van der Schaar M. MAMMO: a deep learning solution for facilitating radiologist-machine collaboration in breast cancer diagnosis. Arxiv [Preprint]. 2018;1–18. http://arxiv.org/abs/1811.02661. 11. Geras KJ, Wolfson S, Shen Y, et al. High-resolution breast cancer screening with multi-view deep convolutional neural networks. Arxiv [Preprint]. 2017;1–9. http://arxiv.org/abs/1703.07047. 12. *+**Salim M, Wåhlin E, Dembrower K, et al. External evaluation of 3 commercial artificial intelligence algorithms for independent assessment of screening mammograms. JAMA Oncol. 2020;6(10):1581–1588. 13. Dembrower K, Wåhlin E, Liu Y, et al. Effect of artificial intelligence-based triaging of breast cancer screening mammograms on cancer detection and radiologist workload: a retrospective simulation study. Lancet Digit Heal. 2020;2(9):e468–e474. 14. Balta C, Rodriguez-Ruiz A, Mieskes C, Karssemeijer N, Heywang-Köbrunner SH. Going from double to single reading for screening exams labeled as likely normal by AI: what is the impact? Proc SPIE 11513, 15th International Workshop on Breast Imaging (IWBI2020), 115130D (22 May 2020). *Studies included in primary meta-analysis **Studies included in secondary meta-analysis 219 Appendix 7 Tabular presentation for Prediction model Risk Of Bias ASsessment Tool (PROBAST) results Study RISK OF BIAS APPLICABILITY ROB PARTICIPANTS OUTCOME ANALYSIS PARTICIPANTS OUTCOME OVERALL McKinney (2020) L J J ? J L Kim (2020) L ? J L ? L Rodriguez-Ruiz [1] (2019) ? J J L J ? Rodriguez-Ruiz [2] (2019) ? J L L J L Yala (2019) J J L J J L Kyono [1] (2019) ? L L L L L Schaffter (2020) J J J J J J Kyono [2] (2018) ? L L L L L Rodriguez-Ruiz [3] (2019) L J L L J L Geras (2017) L ? L L ? L Lotter (2019) L J J L J L Dembrower (2020) L J L J J L Salim (2020) L J J J J L Balta (2020) J L L J L L Note.- JLow Risk LHigh Risk ? Unclear Risk 220 Tabular presentation for Quality Assessment of Diagnostic Accuracy Studies 2 (QUADAS-2) results Study RISK OF BIAS APPLICABILITY CONCERNS PATIENT SELECTION INDEX TEST REFERENCE STANDARD FLOW AND TIMING PATIENT SELECTION INDEX TEST REFERENCE STANDARD McKinney (2020) L J J J ? J J Kim (2020) L L ? ? L L ? Rodriguez-Ruiz [1] (2019) ? L ? J L L J Rodriguez-Ruiz [2] (2019) ? L ? J L L J Yala (2019) L J J J J J J Kyono [1] (2019) L J L L L J L Schaffter (2020) J J J J J J J Kyono [2] (2018) L J L L L J L Rodriguez-Ruiz [3] (2019) L L ? J L L J Geras (2017) L L ? ? L L ? Lotter (2019) L L ? J L L J Dembrower (2020) L L J J J J J Salim (2020) L J J J J J J Balta (2020) J L L ? J J L Note.- JLow Risk LHigh Risk ? Unclear Risk 221 Appendix 8 Meta-analysis results Appendix 8 – Table 8.1 - Primary analysis – contingency table Study Cases Cancer ML Reader Sens Spec TP FN FP TN Sens Spec TP FN FP TN Rodriguez-Ruiz [1] (2019) 199 79 0.800 0.790 63 16 25 95 0.770 0.790 61 18 25 95 Schaffer (2020) 68 008 780 0.771 0.925 601 179 5042 62186 0.771 0.967 601 179 2219 65009 Lotter (2019) 285 131 0.820 0.909 107 24 14 140 0.820 0.669 107 24 51 103 Salim (2020) 113 663 739 0.819 0.966 605 134 3839 109085 0.774 0.966 572 167 3839 109085 McKinney (2020) 3 097 686 0.562 0.843 386 300 379 2032 0.481 0.808 330 356 463 1948 Note.- FN = False Negative, FP = False Positive, ML = Machine Learning, Sens = Sensitivity, Spec = Specificity, TP = True Positive, TN = True Negative 222 Appendix 8 – Table 8.2 - Secondary analysis – contingency table Study Cases Cancer ML Reader Sens Spec TP FN FP TN Sens Spec TP FN FP TN Rodriguez-Ruiz [1] (2019) 199 79 0.800 0.790 63 16 25 95 0.770 0.790 61 18 25 95 Rodriguez-Ruiz [1] (2019) 129 40 0.850 0.490 34 6 45 44 0.840 0.490 34 6 45 44 Rodriguez-Ruiz [1] (2019) 469 68 0.850 0.670 58 10 132 269 0.770 0.670 52 16 132 269 Rodriguez-Ruiz [1] (2019) 298 49 0.860 0.540 42 7 115 134 0.820 0.540 40 9 115 134 Rodriguez-Ruiz [1] (2019) 326 104 0.810 0.510 84 20 109 113 0.830 0.510 86 18 109 113 Rodriguez-Ruiz [1] (2019) 585 113 0.860 0.680 97 16 151 321 0.840 0.680 95 18 151 321 Rodriguez-Ruiz [1] (2019) 179 75 0.750 0.750 56 19 26 78 0.760 0.750 57 18 26 78 Rodriguez-Ruiz [1] (2019) 204 82 0.810 0.730 66 16 33 89 0.830 0.730 68 14 33 89 Kim (2020) 320 160 0.888 0.819 142 18 29 131 0.753 0.720 120 40 45 115 Schaffer (2020) 68 008 780 0.771 0.925 601 179 5042 62186 0.771 0.967 601 179 2219 65009 Schaffer (2020) 68 008 780 0.771 0.880 601 179 8067 59161 0.839 0.985 654 126 1008 66220 Lotter (2019) 285 131 0.962 0.669 126 5 51 103 0.820 0.669 107 24 51 103 Lotter (2019) 285 131 0.820 0.909 107 24 14 140 Salim (2020) 113 663 739 0.819 0.966 605 134 3839 109085 0.774 0.966 572 167 3839 109085 Salim (2020) 113 663 739 0.670 0.966 495 244 3839 109085 0.850 0.985 628 111 1694 111230 Salim (2020) 113 663 739 0.674 0.967 498 241 3726 109198 McKinney (2020) 3 097 686 0.562 0.843 386 300 379 2032 0.481 0.808 330 356 463 1948 Note.- FN = False Negative, FP = False Positive, ML = Machine Learning, Sens = Sensitivity, Spec = Specificity, TP = True Positive, TN = True Negative 223 Appendix 8 – Table 8.3 - Heterogeneity Study N studies Cases Cancer Heterogeneity Sens (95% CI) Spec (95% CI) AUROC (95% CI) I 2 Cochrane Q – p value Primary - Algorithm 5 185 252 2 415 0.000% 0.621 0.754 (0.656-0.832) 0.906 (0.829-0.950) 0.892 (0.838-0.982) Primary - Reader 5 185 252 2 415 0.000% 0.609 0.730 (0.607-0.826) 0.886 (0.724-0.958) 0.849 (0.779-0.971) Secondary – Algorithm 17 185 572 2 575 0.625% 0.446 0.804 (0.755-0.846) 0.821 (0.727-0.888) 0.864 (0.841-0.901) Secondary - Reader 15 185 572 2 575 0.000% 0.783 0.785 (0.738-0.825) 0.826 (0.692-0.909) 0.836 (0.814-0.876) Note.- AUROC = Area Under the receiver operating characteristic curve, N = Number, Sens = Sensitivity, Spec = Specificity 224 Appendix 9 – z-test results A z-test was applied to the pooled AUROC results for comparison between the ML algorithms and readers in both the primary and secondary meta-analysis, with a p-value <.05 indicating a statistically significant result. Primary analysis pooled AUROC of ML algorithm compared to pooled AUROC of readers p-value = .53 Secondary meta-analysis pooled AUROC of ML algorithm compared to pooled AUROC of readers p-value = .84 225 Appendix 10 Forest plots Supplemental Figure 1 - Primary analysis – Forest plot Supplemental Figure 2 - Secondary analysis – Forest plot 226 Appendix 11 Funnel plot Supplemental Figure 3 - Primary analysis – Funnel plots Each algorithm and reader study result that is included in the primary meta-analysis is represented by a diamond shape. The log of diagnostic odds ratio (DOR) is plotted against standard error, with a vertical line for the median and dashed lines for the 95% confidence intervals. For the primary analysis there are an insufficient number of studies to assess for funnel asymmetry. Supplemental Figure 4 - Secondary analysis – Funnel plots Each algorithm and reader study result that is included in the secondary meta-analysis is represented by a diamond shape. The log of diagnostic odds ratio (DOR) is plotted against standard error, with a vertical line for the median and dashed lines for the 95% confidence intervals. Visual assessment of the secondary analysis funnel plots did not show asymmetry and thus does not suggest publication bias. 227 Appendix 12 All private DICOM tags were removed as part of the anonymisation process. The table below details the de-identification process for each DICOM tag. (0002,0000) FileMetaInformationGroupLength Keep (0002,0001) FileMetaInformationVersion Keep (0002,0002) MediaStorageSOPClassUID Keep (0002,0003) MediaStorageSOPInstanceUID Hash (0002,0010) TransferSyntaxUID Keep (0002,0012) ImplementationClassUID Keep (0002,0013) ImplementationVersionName MATLAB (0002,0016) SourceApplicationEntityTitle Keep (0008,0005) SpecificCharacterSet Keep (0008,0008) ImageType Keep (0008,0016) SOPClassUID Keep (0008,0018) SOPInstanceUID Hash (0008,0020) StudyDate 01/MM/YYYY (0008,0023) ContentDate Blank (0008,0030) StudyTime Blank (0008,0033) ContentTime Blank (0008,0050) AccessionNumber Exam ID (0008,0060) Modality Keep (0008,0068) PresentationIntentType Keep (0008,0070) Manufacturer Keep (0008,0080) InstitutionName Blank (0008,0090) ReferringPhysicianName Blank (0008,1030) StudyDescription Keep (0008,1032) ProcedureCodeSequence Blank (0008,0100) CodeValue Blank (0008,0102) CodingSchemeDesignator Keep (0008,0103) CodingSchemeVersion Keep (0008,0104) CodeMeaning Keep (0008,103E) SeriesDescription Keep (0008,1090) ManufacturerModelName Keep (0008,2218) AnatomicRegionSequence Blank (0008,0100) CodeValue Keep (0008,0102) CodingSchemeDesignator Keep (0008,0104) CodeMeaning Keep (0010,0010) PatientName Trial ID (0010,0020) PatientID Trial ID (0010,0030) PatientBirthDate 01/01/YYYY (0010,0040) PatientSex Blank (0010,1010) PatientAge Keep (0012,0062) PatientIdentityRemoved Yes (0018,0015) BodyPartExamined Keep (0018,0060) KVP Keep (0018,1020) SoftwareVersions Kee 228 (0018,1110) DistanceSourceToDetector Keep (0018,1111) DistanceSourceToPatient Keep (0018,1114) EstimatedRadiographicMagnificationFactor Keep (0018,1130) TableHeight Keep (0018,1150) ExposureTime Keep (0018,1151) XRayTubeCurrent Keep (0018,1152) Exposure Keep (0018,1153) ExposureInuAs Keep (0018,1164) ImagerPixelSpacing Keep (0018,1191) AnodeTargetMaterial Keep (0018,11A0) BodyPartThickness Keep (0018,11A2) CompressionForce Keep (0018,1405) RelativeXRayExposure Keep (0018,1508) PositionerType Keep (0018,1510) PositionerPrimaryAngle Keep (0018,5101) ViewPosition Keep (0018,7004) DetectorType Keep (0018,7005) DetectorConfiguration Keep (0018,700C) DateOfLastDetectorCalibration Blank (0018,700E) TimeOfLastDetectorCalibration Blank (0018,7020) DetectorElementPhysicalSize Keep (0018,7022) DetectorElementSpacing Keep (0018,7050) FilterMaterial Keep (0020,000D) StudyInstanceUID Hash (0020,000E) SeriesInstanceUID Hash (0020,0010) StudyID Exam ID (0020,0011) SeriesNumber Keep (0020,0013) InstanceNumber Keep (0020,0020) PatientOrientation Keep (0020,0052) FrameOfReferenceUID Keep (0020,0062) ImageLaterality Keep (0020,1040) PositionReferenceIndicator Blank (0028,0002) SamplesPerPixel Keep (0028,0004) PhotometricInterpretation Keep (0028,0006) PlanarConfiguration Keep (0028,0010) Rows Keep (0028,0011) Columns Keep (0028,0100) BitsAllocated Keep (0028,0101) BitsStored Keep (0028,0102) HighBit Keep (0028,0103) PixelRepresentation Keep (0028,0106) SmallestImagePixelValue Keep (0028,0107) LargestImagePixelValue Keep (0028,0301) BurnedInAnnotation Keep (0028,1040) PixelIntensityRelationship Keep (0028,1041) PixelIntensityRelationshipSign Keep (0028,1052) RescaleIntercept Keep (0028,1053) RescaleSlope Keep (0028,1054) RescaleType Keep (0028,1300) BreastImplantPresent Keep 229 (0028,1350) PartialView Keep (0028,2110) LossyImageCompression Keep (0040,0316) OrganDose Keep (0040,0318) OrganExposed Keep (0040,8302) EntranceDoseInmGy Keep (0054,0220) ViewCodeSequence Blank (0008,0100) CodeValue Keep (0008,0102) CodingSchemeDesignator Keep (0008,0104) CodeMeaning Keep (0054,0222) ViewModifierCodeSequence Blank (2050,0020) PresentationLUTShape Keep (7FE0,0010) PixelData Blank