Accepted Manuscript Title: Critical Care Health Informatics Collaborative (CCHIC): data, tools and methods for reproducible research: a multi-centre UK intensive care database Authors: Steve Harris, Sinan Shib, David Brealey, Niall S. MacCallum, Spiros Denaxas, David Perez-Suarez, Ari Ercole, Peter Watkinson, Andrew Jones, Simon Ashworth, Richard Beale, Duncan Young, Stephen Brett, Mervyn Singer PII: S1386-5056(18)30007-8 DOI: https://doi.org/10.1016/j.ijmedinf.2018.01.006 Reference: IJB 3637 To appear in: International Journal of Medical Informatics Received date: 2-10-2017 Revised date: 6-12-2017 Accepted date: 8-1-2018 Please cite this article as: Steve Harris, Sinan Shib, David Brealey, Niall S.MacCallum, Spiros Denaxas, David Perez-Suarez, Ari Ercole, Peter Watkinson, Andrew Jones, Simon Ashworth, Richard Beale, Duncan Young, Stephen Brett, Mervyn Singer, Critical Care Health Informatics Collaborative (CCHIC): data, tools and methods for reproducible research: a multi-centre UK intensive care database, International Journal of Medical Informatics https://doi.org/10.1016/j.ijmedinf.2018.01.006 This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain. https://doi.org/10.1016/j.ijmedinf.2018.01.006 https://doi.org/10.1016/j.ijmedinf.2018.01.006 Page 1 of 24 Version 2018-01-10t13:23:28 1 Title page 1.1 Working title Critical Care Health Informatics Collaborative (CCHIC): data, tools and methods for reproducible research: a multi-centre UK intensive care database 1.2 Author list • Steve Harrisa,h * • Sinan Shib * • David Brealeya,h • Niall S MacCalluma,h • Spiros Denaxasg • David Perez-Suarezb • Ari Ercolec • Peter Watkinsond • Andrew Jonese • Simon Ashworthf • Richard Bealee,i • Duncan Youngd • Stephen Brettf • Mervyn Singera *Joint first authorship (based on contribution to research and manuscript preparation) 1.3 Affiliations a. Bloomsbury Institute of Intensive Care Medicine, University College Hospital, London, UK b. Research Software Engineering, University College London, London, United Kingdom c. Division of Anaesthesia, Department of Medicine, Cambridge University, UK d. Critical Care Research Group (Kadoorie Centre), Nuffield Department of Clinical Neurosciences, Medical Sciences Division, Oxford University e. Critical Care, Guy’s and St. Thomas’ NHS Foundation Trust, London, UK f. Critical Care, St. Mary’s Hospital, Imperial College Healthcare NHS Trust, London, UK 8 Critical Care, Hammersmith Hospital, Imperial College Healthcare NHS Trust, London, UK g. Institute of Health Informatics, University College London, Gower Street, London, WC1E 6BT, United Kingdom h. Critical Care, University College London Hospitals NHS Foundation Trust, London, UK i. Division of Asthma, Allergy and Lung Biology, King’s College, London, UK ACCEPTED M ANUSCRIP T Page 2 of 24 Version 2018-01-10t13:23:28 Graphical abstract ACCEPTED M ANUSCRIP T Page 3 of 24 Version 2018-01-10t13:23:28 2 Structured abstract Objective 1. To build and curate a linkable multi-centre database of high resolution longitudinal electronic health records (EHR) from adult Intensive Care Units (ICU) 2. To develop a set of open-source tools to make these data ‘research ready’ while protecting patient’s privacy with a particular focus on anonymisation Materials and Methods We developed a scalable EHR processing pipeline for extracting, linking, normalising and curating and anonymising EHR data. Patient and public involvement was sought from the outset, and approval to hold these data was granted by the NHS Health Research Authority’s Confidentiality Advisory Group (CAG). The data are held in a certified Data Safe Haven. We followed sustainable software development principles throughout, and defined and populated a common data model that links to other clinical areas. Results Longitudinal EHR data were loaded into the CCHIC database from eleven adult ICUs at 5 UK teaching hospitals. From January 2014 to January 2017, this amounted to 21,930 and admissions (18,074 unique patients). Typical admissions have 70 data-items pertaining to admission and discharge, and a median of 1030 (IQR 481 to 2335) time- varying measures. Training datasets were made available through virtual machine images emulating the data processing environment. An open source R package, cleanEHR, was developed and released that transforms the data into a square table readily analysable by most statistical packages. A simple language agnostic configuration file will allow the user to select and clean variables, and impute missing data. An audit trail makes clear the provenance of the data at all times. Discussion Making health care data available for research is problematic. CCHIC is a unique multi-centre longitudinal and linkable resource that prioritises patient privacy through the highest standards of data security, but also provides tools to clean, organise, and anonymise the data. We believe the development of such tools are essential if we are to meet the twin requirements of respecting patient privacy and working for patient benefit. Conclusion The CCHIC database is now in use by health care researchers from academia and industry. The 'research ready' suite of data preparation tools have facilitated access, and linkage to national databases of secondary care is underway. Keywords electronic health records; database; clinical decision support; critical care; reproducibility ACCEPTED M ANUSCRIP T Page 4 of 24 Version 2018-01-10t13:23:28 3 Introduction Empirical observation, or measurement, was the foundation of the Scientific Revolution, but was historically expensive. [1] Digitalisation and the computer age have changed this, and the electronic health record (EHR) is health care’s version of ‘big data’. Critical care will inevitably be at the forefront of the big data revolution because there is no other environment where patients are monitored more closely, or with such a broad range of measures. However, making such data available for research is problematic for three reasons. Firstly, health data is sensitive, and the protection of patient privacy must trump all other issues. Secondly, such data is frequently unusable in its raw format. The pace of research must not be mired by the need to repeatedly prepare and clean the data. Thirdly, the data should not exist in isolation. A critical care admission is just one part of an illness pathway. There are antecedents and consequences, and those consequences will impact the patient, their family, and the health service. Underlying these issues, there is also the thornier problem of data ownership. If the default position is that organisations are temporary guardians of personal data, then there is an expectation that the data should be used in the best interests of patients. In response to this we have developed the Critical Care Health Informatics Collaborative (CCHIC), a partnership between the UK’s National Institute of Health Research (NIHR) and five leading NHS hospital trusts. CCHIC attempts to deliver critical care ‘big data’ to researchers thereby facilitating research for patient benefit. Demographics, diagnostic, physiological and treatment data are abstracted from critical care admission to discharge creating a high-resolution, longitudinal EHR of unprecedented depth and breadth. Uniquely, the resource is designed to be explicitly linkable. This means that other clinical specialties can understand the disease process in their most vulnerable and unwell patients. It means that we can begin to share with patients and families a true picture of survivorship following critical care. We can report on long term outcomes, subsequent disease profiles, and use of health resources. We can in theory understand whether people return to work, and the impact of the illness on the wider family. CCHIC has a specific focus on open-access, reproducible research that is done with patient and public involvement from the outset. Making the data research ready yet robustly anonymised for as wide a community of academic and clinical collaborators as possible fulfils our ethical responsibility to the patients who provide these data. In this paper we describe the database, the pipeline (extracting, cleaning, curating, and distributing), and the tools built to enable reproducible research. 3.1 Objectives The objectives of our research were threefold: 1. To build and curate a linkable multi-centre database of high-resolution, longitudinal and multi-modal EHR data from adult Intensive Care units (ICU) 2. To create a scalable pipeline (‘Extract Transform Load’, ETL) for extracting, linking, cleaning, encoding and anonymising ICU data across multiple secondary healthcare providers 3. To develop a set of open source tools and methods for undertaking reproducible research using the database ACCEPTED M ANUSCRIP T Page 5 of 24 Version 2018-01-10t13:23:28 4 Materials and Methods In 2014, CCHIC started to recruit consecutive admissions to the general adult medical and surgical critical care units at the five founding National Institute of Health Research (NIHR) BRCs at Cambridge, Guy’s, Kings’ and St Thomas’, Imperial, Oxford and University College London (UCL). The current dataset (version 1.0) includes 264 fields comprising 108 hospital, unit, patient and episode descriptors (recorded once per admission), and 154 time- varying physiology and therapeutic fields (recorded hourly, daily etc.).* Data are currently exported on a quarterly basis with the ambition to move to near realtime collection. Biomedical Research Centre Hospital Unit Cambridge Addenbrooke’s Hospital ICU/HDU Cambridge Addenbrooke’s Hospital Neuro GSTT Guy’s Hospital ICU GSTT St Thomas’ Hospital ICU/HDU GSTT St Thomas’ Hospital OIR GSTT St Thomas’ Hosptial HDU Imperial Hammersmith Hospital ICU/HDU Imperial St Mary’s Hospital London ICU Oxford John Radcliffe ICU UCLH University College Hospital ICU/HDU UCLH Westmoreland Street ICU/HDU Table 1: Participating hospitals and critical care units (ICU: Intensive Care Unit, HDU: High Dependency Unit, OIR: Overnight Intensive Recovery) 4.1 Regulatory Approval To be of benefit to researchers the database must allow access to data that is reflective of the entire critical care cohort for their full critical illness. A direct consent model would face two challenges. The practicability of consenting thousands of patients per year, and, more importantly, the lack of capacity to consent for many critically ill patients. This is either due to the severity of the illness, the use of sedation during mechanical ventilation, or a high (circa 15%) early mortality rate. A consent based model would under-represent the most unwell patients. The project therefore approached the NHS Health Research Authority’s Confidentiality Advisory Group (CAG) who provided a legal basis for data sharing for essential medical research, and granted an exemption to the common law duty of confidentiality for the project under Section 251 of the NHS Act 2006 (14/CAG/1001). A favourable opinion was provided by the National Research Ethics Service (14/LO/103). Data sharing agreements were signed between the participating NHS Trusts and UCL which hosts the Data Safe Haven (DSH) where the data are stored. The DSH is certified to the ISO/IEC 27001:2013 information security standard and conforms to the NHS Digital’s Information Governance Toolkit. [2] All patients are provided with information regarding the project and an option by which to opt out. Public and patient involvement is actively sought through notifications at each participating unit, and other media.† * The data set is available via the http://www.hdf.nihr.ac.uk/catalogue/#/catalogue/dataModel/13 † Videos explaining the programme are available on the internet (https://www.youtube.com/watch?v=NjE9VQo-nP4&t=11s, and https://www.youtube.com/watch?v=aQJmV6i58H4) ACCEPTED M ANUSCRIP T http://www.hdf.nihr.ac.uk/catalogue/#/catalogue/dataModel/13 https://www.youtube.com/watch?v=NjE9VQo-nP4&t=11s https://www.youtube.com/watch?v=aQJmV6i58H4 Page 6 of 24 Version 2018-01-10t13:23:28 4.2 CCHIC design principles The design of CCHIC has been based on the following principles: 1. to protect the privacy of the patients 2. to support research for patient benefit (specifically excluding commercial exploitation) 3. to facilitate that research by building a scalable pipeline for extracting, processing, and sharing the data 4.2.1 Principle 1: patient privacy Being able to protect patient’s privacy with confidence is the first and foremost consideration for this data resource. Extensive patient and public engagement work has been performed to ensure that this resource is seen as a public good by a broad cross-section of constituents. The particular problem with critical care research is that the patients themselves are either temporarily or permanently incapacitated and therefore unable to offer explicit permission. In the UK, this triggers the need for an application to the Secretary of State for Health to hold these data without consent (as per Section 251 of the NHS Act 2006). Permission is only granted when the physical security of the data can be guaranteed, and when the justification for holding the data is in the public interest (hence principle 2). The data itself is encrypted before leaving each hospital, and then moved to the data safe haven at University College London. Access to the identifiable data is strictly controlled, but an anonymisation step in the data pipeline makes an extract of the data ready for the end-researcher (principle 3). 4.2.2 Principle 2: research for patient benefit Even after privacy is protected, there is a widely reported distinction in the public perception of rights to use data. Recent furore over the partnership between the Royal Free NHS Foundation Trust and Google DeepMind in 2016 was driven by suspicion of the motives of commercial organisations especially those with the pervasive reach of Google.[3] In the DeepMind case, the purported use of the data was to simply develop an alerting system for patients with acute kidney injury. However calculating the AKI class from a laboratory creatinine is so simple that it is hard to believe this was Google’s end game. In fact the Information Sharing Agreement that was signed in 2015 placed no restrictions on the data to be analysed, or the technologies that might be used. [4] For CCHIC, in contrast, the data cannot be used for profit, the research question must be explicitly for patient benefit, and even anonymised data releases must be proportional to the researcher’s need. 4.2.3 Principle 3: research ready Principle (1) protects the patient, and Principle (2) justifies the risks, however small, of making health care data available. Principle (3) enables the researcher to deliver on the promise of their research. Most data analysis requires a huge amount of preparation. We therefore developed an automated data processing pipeline to process, curate, and make available the data. ACCEPTED M ANUSCRIP T Page 7 of 24 Version 2018-01-10t13:23:28 Figure 1: Data processing pipeline: Data moves from the hospital EHR to the UCL data safe haven as an XML file, and is validated before appending to the central database. A data quality report is then returned to the submitting site. Preliminary cleaning removes out of range and invalid entries. The database can then be queried in its identifiable form by authorised users within the safe haven, or a separate anonymiser can produce extracts for external collaborators. 4.2.3.1 Data specification We developed an XML-based format for individual ICUs to store and transmit the extracted EHR data. The common data model was developed in collaboration with clinicians, clinical information systems architects and researchers. A description of the XML data model used is provided via the NIHR’s Health Data Finder.[5]* We extract EHR data from each ICU using a combination of manual, semi-automatic or entirely automatic methods adapted to local ICU clinical information systems. Currently, this includes systems from Phillips Healthcare and Epic Systems, but there is no barrier to extraction from other EHR providers. Data items are extracted as frequently as they were reported (typically hourly) from ICU admission to discharge.† This includes bedside physiology, near patient testing, laboratory testing, and drug administration. In addition, diagnostic coding, patient co-morbidities, admission and discharge pathways, demographics and other information typically used for risk adjustment are extracted on a per admission basis. Uniquely, patient identifiers (NHS number, name, and date of birth) are retained with the record to enable linkage to other health and social care resources. This includes but is not limited to data curated by NHS digital (e.g. Hospital Episode Statistics, and mortality data from the Office of National Statistics), primary care, and clinical trial data sets. 4.2.4 Data quality Our approach to data quality is based on the philosophy of reproducing accurately the local EHR rather than curating data for audit, benchmarking or quality control. For example, aberrant invasive blood pressure readings of 300mmHg occur when the transducer system is flushed, and exposed to the attached pressure bag instead of the patient. For benchmarking, it is important to identify and exclude these values before using them to adjust for patient outcomes. However, it is exactly this sort of artefact that must be handled by the designer of * Of note, this XML schema is common to other clinical schemes under the umbrella Health Informatics Collaborative programme including acute coronary syndromes, ovarian cancer, renal transplant anf viral hepatitis. † Some data items such as waveform data are often recorded at microsecond intervals, but are only reported to the local EHR solution at hourly or similar intervals. ACCEPTED M ANUSCRIP T http://www.hdf.nihr.ac.uk/catalogue/#/catalogue/dataModel/13 Page 8 of 24 Version 2018-01-10t13:23:28 a clinical monitoring system. Such use cases are very much part of the justification for CCHIC. Similarly, some projects will automatically impute missing data or discard incomplete records whereas others use the pattern of missingness for clinical diagnostics. [6] Hence data extracts were accepted if the provenance (submitting unit, file name and timestamp) and the indexing information (critical care unit, episode identifier, data item label and timestamp) were complete. A data quality report summarised the completeness of each time-invariant field, and the sampling frequency of the time-varying fields. Field level characteristics of new data ingests were compared to existing data within and across institutions in order to identify failure of local extraction procedures to accurately capture the local EHR. Fresh extracts were requested where reporting did not meet the schema standards (e.g. reporting PaO2 in mmHg rather than kPa), or where entire fields were missing because of a problem with local exporting. 4.2.5 Data anonymisation Researchers may apply to work with the primary identifiable data where necessary. However, limiting this access is clearly desirable with respect to data security. Moreover, working directly within the data safe haven (DSH) means the data storage environment also becomes the development environment. The pace of change of modern machine learning, statistical, and software tools would mean that the development environment needs continuous updating. This is a burden, and a security risk. Each update requires an external ingest of code, and as the number of researchers grows then so will the number of tools, and the risk of external exposure. We therefore minimise this risk by undertaking to make available anonymised data extracts to approved researchers. Here we follow guidance from the Information Commissioner’s Office (ICO) [7] which is in turn based on the UK Data Protection Act (DPA) 1988 and Recital 26 of the European Data Protection Directive (95/46/EC)* The key principle is that “information or a combination of information, that does not relate to and identify an individual, is not personal data”. [7] Moreover, (there is) clear legal authority for the view that where an organisation converts personal data into an anonymised form and discloses it, this will not amount to a disclosure of personal data. The anonymisation focussed on three areas: 1. Minimising the likelihood of re-identification 2. Minimising incentives for re-identification 3. Maximising the quality of data post-anonymisation 4.2.5.1 Minimising the likelihood of re-identification We first delete all direct identifiers (e.g. NHS numbers which have a uniquely identify an individual). However, other key variables can be combined by a motivated intruder, particularly one with access to external data sources, to re-identify individuals by the intersection of specific rare values. K-anonymity counts the number of individuals identified at this intersection, and we set k so that this smallest group still provides anonymity for its members.† In practice, we use a heuristic algorithm within the sdcMicro R package [8] developed by the International Household Survey Network to suppress quasi-identifiers from the dataset until the target k- * On 25 May 2018, this will be superseded by the General Data Protection Regulation (GDPR) (Regulation (EU) 2016/679). † For example, if we release individual data describing ‘species’, and ‘favourite sandwich filling’, then the intersection of ‘bears’ and ‘marmalade’ would uniquely identify Paddington Bear. If we generalise ‘favourite sandwich filling’ to ‘prefers sweet sandwiches’ then because Pooh Bear likes honey as well as Paddington liking marmalade, the k-anonymity would rise to two. ACCEPTED M ANUSCRIP T Page 9 of 24 Version 2018-01-10t13:23:28 anonymity is reached.[9] Quasi-identifiers are aggregated to increase the granularity before the k-anonymity suppression.* Additionally, for the public release, the remaining quasi- identifiers are perturbed with noise. 4.2.5.2 Minimising incentives for re-identification While a cliche, there is anonymity in obscurity. For this reason, records of publicly prominent individuals† are removed prior to a data release (just as invidual opt-outs are removed prior to data storage). However, because of the sensitivity of medical data, this risk remains to others. In addition, we prospectively identify sensitive data items such as those recording (alcoholic) cirrhosis, or HIV status. These are either suppressed if homogeneous, or released if heterogenous. In this way the disease status of the members of even the smallest (k) group remains uncertain.‡ 4.2.5.3 Maximising the quality of data post-anonymisation There is a trade off between information loss and disclosure risk so that as the risk of disclosure decreases then so does utility of the data. To define this we need to measure the information content, and quantify the disclosure risk. For non-identifying variables, (e.g. heart rate), there is no information loss. For key variables and sensitive fields, a balance must be reached. For example, a project examining the weekend effect on critical care outcomes might have to sacrifice granularity in other key variables (e.g. age) in order to extract the data. Such a compromise is not normally an impediment. Where information loss is not acceptable, then the research team will have to go through a vetting process to work with the original data, and be prepared to work with * For example, if the two most elderly patients were 101 and 109 years old, there is a risk of re-identification. These extreme values might be replaced (perhaps with the local median of 105 years). K-anonymity could then be (re)evaluated, and is likely to increase. † The team managing the MIMIC-III database at MIT report that there were several attempts at identifying the victims of the Boston marathon bombing in 2013. Although their database is open source, they have removed this individuals from the publicly released version. ‡ This is known as l-diversity and guarantees that even if an individual can be identified as belonging to a small group (cell) there is sufficient variability of these sensitive items within that group that uncertainty remains as to an specific individual’s status. ACCEPTED M ANUSCRIP T Page 10 of 24 Version 2018-01-10t13:23:28 the more limited set of tools available in the Data Safe Haven. Figure 2: Data anonymisation: algorithm implementation: A summary of the anonymisation process applied before any data release. 1. Removal of direct identifiers: All unique identifiers including NHS number and hospital number will be removed from the data before release. 2. Remove high risk individuals and specific opt-outs 3. Date and time metadata: All timestamps are converted to data and time differences from the instant of critical care admission. 4. Aggregate continuous and date-time key variables: Because we cannot group patients by a continuous measure, the concept of k-anonymity only applies to categorical variables (e.g. we can group patients by eye colour, but not hair length). Where a key variable is continuous then we will run an initial conversion to a categorical version by aggregating. The unit of aggregation will be the natural unit of the measurement (e.g. years for age, or kilograms for weight), and the initial aggregation will be some multiple of that unit (e.g. 2 years, and 5 kg respectively). These multiples will be initially small in order to minimise information loss, but will be increased during the iterative specific anonymisation step until the necessary k-anonymity and l-diversity is reached. ACCEPTED M ANUSCRIP T Page 11 of 24 Version 2018-01-10t13:23:28 5. Remove living subjects (where possible): The Data Protection Act only applies to living individuals so where possible data will only be released for non-survivors. 4.2.5.4 Data anonymisation: tiered data access model The algorithm above provides a mechanistic level of security that is supplemented by additional administrative safe guards. For example, in contrast to a member of the general public, a medically qualified researcher is expected to follow a code of professional ethics with associated sanctions for breach of this code. Releases to the general public are more strictly anonymised than releases to medical researchers. We have two standard tiers of data release based on the likelihood of re-identification being attempted: general public, or quasi-public. The general public extract is a small subset of the original dataset, where direct identifiers are removed, and quasi-identifiable variables are heavily aggregated and perturbed. It thus has the lowest disclosure risk but also the lowest data usability. Although the physiology fields are unaltered, the analysis results cannot be directly used for publication. The purpose of this dataset is for users to familiarise themselves with the data structure and to develop hypotheses that could be tested on the full data. To gain access to this dataset, researchers must sign data sharing agreement, identifying themselves and their institution, confirming that they will be only be using the data for clinical research (in line with our research ethics permissions), and undertaking to be respectful of the data (specifically not to pass it on, nor to attempt to re-identify individuals). A quasi-public data extract is distributed to researchers who have submitted a data request that has been vetted by the CCHIC governance structure. Researchers are recommended to request the minimum set of fields necessary for their planned analysis. The data may be suitable for a complete analysis but this will depend on the balance between the fields requested, and resolution required. Where this balance cannot be achieved with a public release, then the analysis may initially proceed using the anonymised data. The analysis script is then tested on a virtual machine that simulates the development inside the data safe haven. Finally, the tested script is deployed within the safe haven, and the outputs are released to the investigator after inspection to ensure that these too pose no re-identification risk. 4.2.5.5 Research ready: the cleanEHR toolkit As described above, the data that is released is a ‘warts and all’ version of the electronic health care record integrated across the sites. Although being faithful to the original record is a design principle, it leaves most researchers with the huge task of cleaning the data. We therefore provide alongside the data a set of tools covering the most common data pre- processing and post-processing operations. These are provided as an open source package cleanEHR for the R statistical programming language. The most important of these is a function that converts the various asynchronous lists of time-dependent measurements into a table of measurements with a customisable cadence. For example, if the researcher wishes to the data every hour then a skeleton table is built with one row per critical care admission per hour from the time of admission to the time of discharge. For time-invariant data, the data items are repeated across all rows. For time- varying items, a value is inserted if a value has been recorded in that hour.* The end result is a data frame that is ready for analysis in applications from Microsoft Excel to SPSS, from R to Python. A second function is used to stitch together separate but sequential critical care admissions into a unified illness spell. Regardless of whether care for that spell of illness is provided in a single facility, or across multiple facilities, the longitudinal data is appropriately concatenated. This is a particular problem in the UK where similar patients may step down from an ICU to an High Dependency Unit (HDU) in one institution, but may have all their care delivered in a single critical care unit in another institution. Additional functionality includes the ability to relabel the data fields at will, to perform * Where more than one item is available in that time period, the most recent measurement is used by default although other selection algorithms are possible. ACCEPTED M ANUSCRIP T Page 12 of 24 Version 2018-01-10t13:23:28 range and consistency checks, and to either impute missing values or to remove episodes with excess missingness. All of this is performed by providing a simple text file with the configuration requests so that even users not familiar with the R programming language can configure the data processing and cleaning pipeline to match their requirements.* The entire package is provided with tutorials and documentation. The cleanEHR toolkit is freely available from the Comprehensive R Archive Network (CRAN) and GitHub. * The text file is specified using the human readable and writeable version of XML called YAML. Learning the formatting rules for this should take no more than ten minutes. [10] ACCEPTED M ANUSCRIP T https://cran.r-project.org/web/packages/cleanEHR/index.html https://github.com/CC-HIC/cleanEHR Page 13 of 24 Version 2018-01-10t13:23:28 5 Current data The initial data set specification (version 1.0) was released to contributing sites in 2013. Data collection started in 3 ICUs from 3 hospitals in February 2014, and expanded to 11 ICUs from 5 hospitals by July 2017 with regular quarterly updates by which time, the database contained 21930 critical care admissions. The data set contains 258 variables describing each admission plus additional unit and hospital level metadata. 165 variables are time-dependent (e.g. drugs, physiology etc.), and the remaining 93 are captured on admission or discharge to the ICU, or discharge from the hospital. We used the ICNARC coding method to capture admission diagnosis as per the UK’s national audit.[11] The data specification permits multiple levels of metadata to be associated with each measurement (i.e. site and units of measurement, route and method of drug administration etc.). We hope to expand the data set to include additional structured data items, narrative text, and waveform data in the near future. A typical admission would have 70 time-invariant measures, and a median of 1030 (IQR 481 to 2335) time-varying measures. The database therefore contained more than 60 million data items plus associated meta data. ACCEPTED M ANUSCRIP T https://www.icnarc.org/Our-Audit/Audits/Cmp/Resources/Icm-Icnarc-Coding-Method Page 14 of 24 Version 2018-01-10t13:23:28 Figure 3: Number of physiology observations in the database by day relative to admission A user may therefore recreate, in detail, the longitudinal profile of an individual patient, or examine the distribution of variables across all patients. ACCEPTED M ANUSCRIP T Page 15 of 24 Version 2018-01-10t13:23:28 Figure 4: Selected physiology measures and drug administration from an admission with Inhalation pneumonitis CCHIC population across each individual ICU unit and in total with descriptive statistics ACCEPTED M ANUSCRIP T Page 16 of 24 Version 2018-01-10t13:23:28 Demographics Drugs Laboratory Other Physiology All On admission Dates 1 1 Admission descriptors 6 6 Patient characteristics 12 12 Pre-admission descriptors 10 10 Prognostic scoring information 24 24 On discharge Dates 2 2 End of life 12 12 Episode descriptors 6 6 Organ dysfunction summary 1 1 Post-admission descriptors 1 1 Prognostic scoring information 4 4 Late follow-up Dates 3 3 Episode descriptors 2 2 Post-admission descriptors 9 9 Daily Organ dysfunction 9 9 Fluid balance 1 1 Within 30 minutes of input Anti-microbials 45 45 Cardiovascular 14 14 Chemistry 15 15 CNS 9 9 CVSvasoactive 15 15 Dates 2 2 Haematology 5 5 Microbiology 3 3 Neurology 6 6 Position 1 1 Renal 8 8 Respiratory 4 17 21 Temperature 2 2 104 69 26 3 55 258 Table 2: Count of data fields (variables) classified by type and time dependence ACCEPTED M ANUSCRIP T Page 17 of 24 Version 2018-01-10t13:23:28 6 Discussion The widespread adoption of EHR platforms coupled with technical advancements in clinical information systems and biomedical information standards has enabled the collection and re-use of clinical data for research. Historically however, researchers typically only get to see the tip of the iceberg: coded administrative data relating to healthcare claims with mainly record billable diagnoses and procedures. The rich data generated across the clinical pathway remain submerged and inaccessible. It is to this challenge that CCHIC is responding. 6.1 Comparison with other databases 6.1.1 MIMIC Notable resources already exist in the United States, such as the Medical Information Mart for Intensive Care III (MIMIC-III) database, but these are single centre initiatives.[12] MIMIC has nonetheless set the precedent for open access health data, and has been enormously successful in this regard. The full MIMIC database is available to researchers who complete a human research ethics training programme, and sign a data use agreement. In contrast, CCHIC makes available a restricted fully anonymised exemplar data set for exploration, and code development. Access to the full data set currently requires approval by the CCHIC data advisory group. Because the source data is fully identifiable, and until we have tested our anonymisation process more widely, we feel this is an appropriate balance. The only external data that is routinely linked to MIMIC is mortality via social security records. CCHIC, in contrast, was designed from the outset to link regularly to a wide range of health and social care databases. The aim is to eventually collate a cradle to grave perspective of health for patients who experience critical illness. Permissions are already in place to link to hospital episode data thereby defining secondary care use following discharge, and comorbidities prior to admission. Permissions will next be sought for long term survival, and primary care episodes. 6.1.2 ICNARC The other major UK critical care database belongs to the Intensive Care National Audit and Research Centre’s Case Mix Programme (ICNARC CMP). This is now, with the exception of Scotland, a national audit programme with more than twenty years of data. However, the CMP is designed for benchmarking not research.[13] As such it only contains selected data during the first 24 hours of admissions to critical care with summarised outcome measures. Linkage is possible but not explicitly part of the remit of the design. We see ICNARC and CCHIC as two synergistic programmes: one with a wide-angled historical view, and one with a detailed, longitudinal view enriched with secondary sources. 6.2 Limitations The future of CCHIC depends on our meeting the obligations to patient privacy, and research for patient benefit. The initial technical hurdle has been in transforming the database into a research ready resource. In this we believe we have made significant progress. The next major technical challenge is to extend the data set, and expand the group of participating hospitals beyond the founding academic centres. Currently, both of these endeavours would require individual sites to write further local ETL (extract, transform, load) schemes. This is a significant burden that is multiplied by each data request and each participating site. Our experience is that even where sites share similar EHR systems, each has been so extensively modified that the ETL scripts are not transferrable. One solution is to shift the burden of data transformation centrally. [14] Each site is then only required to write a smaller data extraction routine. This routine identifies data items associated with critical care admissions (e.g. by filtering HL7 messages), and then transfers ACCEPTED M ANUSCRIP T Page 18 of 24 Version 2018-01-10t13:23:28 them centrally. Since all messages are archived, the data set can be expanded variable by variable as transformation and loading routines are developed. Moreover, these could be applied retrospectively to the existing data archive. The barrier to new sites joining would also be much lower. 6.3 Conclusion Making health care data available to researchers is a huge challenge. The data is both sensitive, and the research needs are many. Resources such as MIMIC and ICNARC already have their own answers to this, but CCHIC brings several advantages. It is an explicitly linkable, multi-centre collaboration with a focus on making the data research ready. This is more than the technical challenge of protecting personal information, and effectively anonymising data. It is also about creating a culture that promotes collaboration and the best quality reproducible science, and we therefore look forward to meeting our future collaborators. ACCEPTED M ANUSCRIP T Page 19 of 24 Version 2018-01-10t13:23:28 7 Authors’ contributions Manuscript preparation: SH, SS, SD Concept and design of database and model catalogue: NM & DB Concept and design of data pipeline: SH, SS, DPS, NM Design, data sharing and critical review: AE, PW, AJ, SA, RB, DY, SB, MS ACCEPTED M ANUSCRIP T Page 20 of 24 Version 2018-01-10t13:23:28 8 Acknowledgements This research was funded by the National Institute for Health Research Health Informatics Collaborative and supported by the National Institute for Health Research University College London Hospitals Biomedical Research Centre. ACCEPTED M ANUSCRIP T Page 21 of 24 Version 2018-01-10t13:23:28 9 Statement on conflicts of interest We wish to confirm that there are no known conflicts of interest associated with this publication and there has been no significant financial support for this work that could have influenced its outcome. We confirm that the manuscript has been read and approved by all named authors and that there are no other persons who satisfied the criteria for authorship but are not listed.. We further confirm that the order of authors listed in the manuscript has been approved by all of us. ACCEPTED M ANUSCRIP T Page 22 of 24 Version 2018-01-10t13:23:28 10 Summary table  Electronic health record (EHR) research is being led by critical care (for example, the MIMIC database at MIT) however all these projects face a common set of conflicting challenges: patient privacy, and usability for the researcher  CCHIC is a new multi-centre critical care database from the UK that holds data from five hospitals, eleven ICUs and more than 20,000 admissions  Uniquely CCHIC is an explicitly linkable with patient identifiers retained to allow mapping of health from the cradle to the grave  CCHIC is provided with a set of research ready open source software tools in order to facilitate the final part of the contract with patients in using their data: that we can show patient benefit. These tools include: o Anonymisation o Data cleaning o Data extraction in a language agnostic manner ACCEPTED M ANUSCRIP T Page 23 of 24 Version 2018-01-10t13:23:28 11 References [1] S.M. Stigler, The History of Statistics, Harvard University Press, 1986. [2] Standard ISB 0086: Information Governance Toolkit, (2017). http://webarchive.nationalarchives.gov.uk/+/http://www.isb.nhs.uk/library/standard/15 1 (accessed September 28, 2017). [3] H. Shah, The DeepMind debacle demands dialogue on data., Nature. 547 (2017) 259– 259. doi:10.1038/547259a. [4] J. Powles, H. Hodson, Google DeepMind and healthcare in an age of algorithms, Health Technol. 29 (2017) 1–17. doi:10.1007/s12553-017-0179-1. [5] NIHR HIC Locality: Critical Care, (2015). http://www.hdf.nihr.ac.uk/catalogue/#/catalogue/dataModel/13 (accessed September 28, 2017). [6] [1611.05146] A Semi-Markov Switching Linear Gaussian Model for Censored Physiological Data, (n.d.). https://arxiv.org/abs/1611.05146 (accessed September 28, 2017). [7] Information Commissioner’s Office, Anonymisation: managing data protection risk code of practice, 2014. [8] M. Templ, A. Kowarik, B. Meindl, Statistical Disclosure Control for Micro-Data Using the R Package sdcMicro, Journal of Statistical Software. 67 (2015). doi:10.18637/jss.v067.i04. [9] M. Templ, B. Meindl, A. Kowarik, S. Chen, Introduction to Statistical Disclosure Control (SDC), 2014. [10] YAML Ain’t Markup Language (YAML™) Version 1.2, (n.d.). http://www.yaml.org/spec/1.2/spec.html (accessed September 29, 2017). [11] J.D. Young, C. Goldfrad, K. Rowan, Development and testing of a hierarchical method to code the reason for admission to intensive care units: the ICNARC Coding Method, Br J Anaesth. 87 (2001) 543–548. doi:10.1093/bja/87.4.543. [12] A.E.W. Johnson, T.J. Pollard, L. Shen, L.-w.H. Lehman, M. Feng, M. Ghassemi, B. Moody, P. Szolovits, L. Anthony Celi, R.G. Mark, MIMIC-III, a freely accessible critical care database, Scientific Data, Published Online: 15 March 2016; | Doi:10.1038/Sdata.2016.18. 3 (2016) 160035–10. doi:10.1038/sdata.2016.35. [13] D.A. Harrison, A.R. Brady, K. Rowan, Case mix, outcome and length of stay for admissions to adult, general critical care units in England, Wales and Northern Ireland: the Intensive Care National Audit & Research Centre Case Mix Programme Database., Crit Care. 8 (2004) R99–111. doi:10.1186/cc2834. [14] C.B. Turley, Leveraging a Statewide Clinical Data Warehouse to Expand Boundaries of the Learning Health System, eGEMs (Generating Evidence & Methods to Improve Patient Outcomes). 4 (2016). doi:10.13063/2327-9214.1245. ACCEPTED M ANUSCRIP T http://webarchive.nationalarchives.gov.uk/+/http:/www.isb.nhs.uk/library/standard/151 http://webarchive.nationalarchives.gov.uk/+/http:/www.isb.nhs.uk/library/standard/151 https://doi.org/10.1038/547259a https://doi.org/10.1007/s12553-017-0179-1 http://www.hdf.nihr.ac.uk/catalogue/#/catalogue/dataModel/13 https://arxiv.org/abs/1611.05146 https://doi.org/10.18637/jss.v067.i04 http://www.yaml.org/spec/1.2/spec.html https://doi.org/10.1093/bja/87.4.543 https://doi.org/10.1038/sdata.2016.35 https://doi.org/10.1186/cc2834 https://doi.org/10.13063/2327-9214.1245 Page 24 of 24 Version 2018-01-10t13:23:28 12 Appendices 12.1 Data specification ACCEPTED M ANUSCRIP T