SAFEST: A Safeguarding Analytical Framework for Decentralised Sensitive Data Patricia Ryser-Welch1*, Leire Abarrantegui2, Soumya Banerjee3 1Department of Mathematics and Engineering, Newcastle University, Newcastle, United Kingdom; 2Department of Bioinformatics Research Group in Epidemiology, Newcastle University, Newcastle, United Kingdom; 3Department of Medical Research Council Epidemiology, University of Cambridge School of Clinical Medicine, Cambridge, United Kingdom ABSTRACT An increasing demand and dependence of analyzing a data has been driven by “big data” and “Internet of Things (IoT)”. Scientific reproducibility, robustness and the cost of capturing new data has been improved through findable, accessible, interoperable, and reusable data sharing. Ethical and legal restrictions impose the use of privacy preservation and protection measures for any disclosure and sensitive information. We, therefore, present a possible model to support multi-disciplinary research team to protect against disclosure of individual-level data and large datasets used in other disciplines. We argue technology reliance is not enough and a continuous collaboration that adapt to new cyber-security, and data inferential threat is needed. We consequently conclude some standards could lead to closer collaboration to support research and innovation in the long term. Keywords: Privacy preservation; Privacy information; Federated systems; Secure integration INTRODUCTION The fields of “big data” and “Internet of Things (IoT)” have positively responded to the increasing demand and dependence of analyzing a data for research purposes. Technological solutions have empowered scientific research with improved scientific reproducibility, robustness alongside a cost reduction of capturing new data [1,2]. To some extent, such endeavour has also addressed some need raised by the FAIR movement for data sharing; i.e., Findable, Accessible, Interoperable, Reusable [3]. Notwithstanding such positive outcomes, physical access to individual-level data may not always be possible; individuals may cede control over their privacy [4] and ethical, legal, and regulatory restrictions prevent sharing and accessing disclosive or sensitive information. Intellectual property or licensing issues associated with research can also impose some additional barriers [5]. Health-care, biomedical and social sciences research depends on accessing ethically and legally large individual-level data. Other disciplines-i.e., astro-physique, biology-may also increasingly capture and store extremelly large data. Transferring such datasets becomes an impractical and challenging task that a decentralisation approach relying on researchers’ collaboration can overcome. The federation of existing systems can cooperate with autonomous and heterogeneous computer systems [6]. The idea that data computations can be remotely-called behind firewalls becomes an attractive solution [5]. Data owners can open their data to a research community, without losing control over the use of their data. However, suitable data preservation and protection measures should be put in place with sensitive and individuals data. A balance to empower data governance, prevent the inferential reconstruction of datasets, and individual identification should be considered [7]. This paper has, therefore, threefolds- We propose a privacy-preserving and protecting decentralised architecture for the purpose of ethical and legal analysis. The latter should also enable data analysts and researchers analysing data in an ethical and legal manner, without lengthy transfer of datasets. Data governers and custodians could share in a controlled manner highlysensitive or excessively-large data with a reduced data governance burden. International Journal of Advancements in Technology Research Article Correspondence to: Patricia Ryser-Welch, Department of Mathematics and Engineering, Newcastle University, Newcastle, United Kingdom, Tel/Fax: 07941305681; E-mail: pat.ryser-welch@open.ac.uk Received: 31-Oct-2022, Manuscript No. IJAOT-22-19914; Editor assigned: 04-Nov-2022, Pre Qc No. IJOAT-22-19914 (PQ); Reviewed: 18-Nov-2022; Qc No. IJOAT-22-19914; Revised: 25-Nov-2022, Manuscript No. IJOAT-22-19914 (R); Published: 02-Dec-2022, DOI: 10.35248/0976-4860.22.13.213. Citation: Ryser-Welch P, Abarrantegui L, Banerjee S (2022) SAFEST: A Safeguarding Analytical Framework for Decentralised Sensitive Data. Int J Adv Technol. 13:213. Copyright: © 2022 Ryser-Welch P, et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Int J Adv Technol, Vol.13 Iss.10 No:1000213 1 • We suggest collaborative effort to adapt to societal needs as well as research needs should be considered, alongside the creation of new standards. have been oblivious of such issues [31,32]. Consequently, multidisciplinary research communities are increasingly discussing ethical considerations, privacy preservation, and protection of contact-tracing apps, social media usage, and other methods used to capture data [33-35]. A recent decentralisation movement of the world wide web is attempting to empower individuals with the data governance of their own data. An approach referred as “Solid” should offer individuals “pods” to keep and manage the data generated from social media, entertainment, shopping, and other web activities. Websites, web apps, and other web services should request data from these pods, rather than centrally storing the personal data. Individuals should allow organisations using the data [36-38]. Another approach has explored how the sharing of large files containing sensitive information with Inter Planetery File System (IPFS) and Block chain technologies. Individuals appear to register files, grant and revoke access to them [39-41]. At the time of writing, these ideas remain at an early stage of development. It is likely to take some time and effort to bring them to a more mature option to the current implementation of the world wide web. Previous decentralisation efforts benefited from the development of network interconnection technologies and service-oriented software architectures [42-45]. These highly heterogeneous technologies led to the federation of systems capable of autonomous and distributed systems, capable of evolving and enabling data sharing [5,6,46]. Direct access Page 3 of 9 remains unsuitable for highly-sensitive data without any data agreements. An alternative approach brings some hopes; it distributes the computations to the data using remote procedure calls to privacy-preserving computations. Referred as the data SHIELD approach, it denies direct users access to individual levels of data, but offers the opportunity to analyse remotely harmonised individual level data without leaving their host site. The outcome of these calculations should lower the risks of inferential data reconstruction through some carefully thought disclosure controls [7,47,48]. We propose a theoretical framework that decentralises secure multiparty computations in a context of federated analytical systems, to prevent any direct access to the data. Trusted parties respond to some requests by completing some computations; it often employs some “need-to-know” standards that accomplish allowable tasks and only disclose computations’ results [49,50]. Referred as SAFEST, the proposed framework is inspired from network interconnection technologies, service-oriented software architectures, and the data SHIELD approach. Each server-side component (shown in blue in Figure 1) decentralises users’ management, access right and privacy-preserving parametrisation, allowing restricted serverside computation call and execution. Some client-side software-shown in green in Figure 1-should enable the analysis and comparison of data, as if the data were outsourced to a central computer and jointly analysed. Communications between data analysts and some analytical servers can only occur using secured networking protocols and server-side connections. Ryser-Welch P, et al. Privacy keeps information about individuals from being available to others [8]. This nontrivial task is becoming an even more challenging task; the by-product of humans, Internet, WWW and devices activities produces some intangible individuals’ personal data at an alarming and increasing rate. Behaviour, preferences, and other sensitive details have become even more traceable, leading to predict future trends and product developments [9,10]. Some platforms provide a dynamic approach for data capture from sizeable social media and geospatial data [11-13]. Vehicle sensors and other spatio- temporal data have helped predict traffic flow in a city, support driverless vehicles, and quantify air pollution and traffic [14-16]. On the other hand, contact-tracing apps for fighting against the COVID-19 can collect and retain individuals’ data [17]. The emergence of “Internet of Medical Things” (IoMT) is to integrate medical devices to bring a better connected health care provision. Bioinformatics and population-health medical research has also generated large quantities of data stored in various digital forms [18]. MATERIALS AND METHODS A growing concern is becoming more apparent this data is awaiting to be stored and analysed by the immense analytical power offered by Internet-Based services and become exploitable business assets to improve business activities, and products development, and medical research [9,10,19-23]. For example, the application of machine learning algorithms extends advanced statistical methodologies to analyse unstructured data (i.e., images, documents, social media entries). Amongst other, unknown patterns about individuals can potentially become identifiable without their knowledge and benefit others. Stolen individuals’ details (i.e., a physical person or an organisation) can be used to commit a crime or damage reputation. Data analytics can also impact positively and negatively at a societal level; digital totalitarianism is arising in some countries [24]. Data mining techniques, and language models could also assist in predicting electoral exercise outcomes; the latter can empower campaigning, and political advertising [25-29]. Nevertheless, language models can also help us understand the propagation of fake news [30]. Despite these newly created phenomena, the COVID-19 pandemic-and the sudden adoption of some tracing technology to overcome the crisis-has globally engaged discussions about (1) data collection, (2) the potential future breaches of individuals’ privacy, (3) confidentiality, and (3) possible identification of individuals without their knowledge. Before, individuals may 2Int J Adv Technol, Vol.13 Iss.10 No:1000213 We review various approaches and application of technologies applied to prevent disclosure of individual-level data. We discuss trends and solutions currently considered by the wider research community. • • analysing large datasets can become an issue, if access to high- computation facilities become an issue. The privacy-preservation and protection features-introduced in the previous section (see Figure 1)-may be needed for healthcare, biomedical and social sciences research. Phenotypic and other highly-sensitive data are likely to be analysed. Federation trends diverge in approaches, adopting emerging technologies and security standards. For example, some cloud-based solution relies on a specific platform to develop advanced analytical tools, such as machine learning algorithms. Such trend follows cloud technologies market leaders who have adapted software, data storage and analytical tools as a service to the privacyprotection needs of medical health care providers. Data analysts depend on a web interface to query directly the data and its meta-data. Cloud technologies also transfer heavily encrypted data across trusted consortia. Such approach has been adopted by MedCo [54]. There is little evidence that privacy-preserving parameterisations and computations have yet to be implemented. Data governance may not agree with the data being transferred outside their organisations and large transfer of data may be impractical. Some specialised server-side warehouse software may offer some “lighter-weight” solutions, preventing any transfer of raw data, instead data is kept behind the firewall of an organisation. In such solutions, data can either be uploaded within the specialised software or linked to existing systems. Renku [55] offers liberal data sharing by data-repository use, with some privacy-protection features. The latter are dependent of privacy- protection offered by the chosen technology, and hence their suitability should be verified for the type of data shared prior use. Personal-health-train, MOLGENIS offers primarily the opportunity of discovering research datasets to the wider community [56,57]. An organisation firewall appears to bring data-protection. However, very little evidences of requests being encrypted using of secure socket layer protocols or other mechanisms may be a concern. Especially, as batch data download appears to be feasible. Cafe variome [58] functionalities include some specialised web server that obfuscates some data using privacy-preserving computations. Privacy levels can be set to manage access to some datasets, i.e., linked, public or private. The aim of cafe variome is data discovery rather than analysis, at the time of writing. Most of the aforementioned systems either offer a well-designed web interface or some scripts using programming languages designed for data analysis, i.e., R or python. Some strict data agreements may need arranging before participating to any collaboration between research partners. Remote Procedure Calls (RPC) can bring the computations to some data, where researchers remotely analyse some harmonised individual-level data. RPC can usefully implement a ‘hub-and- spoke’ design, bringing a step closer some implementation to Figure 1. Various data-protection and data-preservation techniques may need to be considered and implemented to prevent disclosure. Ryser-Welch P, et al. Some client-side components-visualisation, user-interaction, client-side computations, and networking tools-should bring some transparency, to obfuscate any remote-calculation calls to data analysts. Remote analysts should undoubtedly be informed of any serverside privacy-preserving and protective-features. A gateway (client-side) should manage connections and process responses from any participating servers. A gatekeeper (serverside) should authorise requests, validates users, and stops any disclosive responses. Invalid or unauthorised calls should be blocked. Data governance should remain responsible to grant or revoke remote access to data analysts. Users’ management and access rights should empower data governors authorizing data analyst executing specific server-side computations on some specific datasets. Some privacy-preserving parameterisation control the server-side computations’ behavior and outcome. Consequently, only encryption or some server-side computations’ results can be transmitted back to the client-side components. These privacy- preserving and protective features inform directly the Gatekeeper-only enabling authorized execution of computations on their own participating servers and preventing outcomes being returns under certain criteria. RESULTS AND DISCUSSION Individuals’ identification may only occur with an in-depth domain knowledge and having access certain type of data. Time- series data related to pandemics and genomic data is unlikely to be considered as highly-sensitive, reducing some of the needs for user management, access rights, privacy-preserving parametrisation, and computations. The UCSC genome browser [51], and the COVID-19 data hub [52] are significant examples of such fully distributed systems, in which data needs to be outsourced to some storage and processing unit. Analysts can either (1) download and analyse data using a command-line interface provided by functional languages on their computer or (2) use a web interface providing some analytical and upload tools. Bio-conductor enriches similar functionalities with analytical software libraries; any analytical computations relies on researchers computing facilities for their execution [53]. No computation has yet to be brought to the data. Therefore, Int J Adv Technol, Vol.13 Iss.10 No:1000213 3 Figure 1: Illustration of SAFEST. Ongoing collaboration can act as gatekeepers (see Figure 1) to reduce disclosure. Verification of analytical results has yet to be automated, instead experts assessment is made; an analytical pipeline has integrated computing and human processes. Such endeavours appear to successfully support COVID 19 and other causes. The number of publication using Open safely has rapidly increased exponentially as an impressive amount of raw NHS data can now be analysed [67-69]. Such process should lower suitably the level of disclosure to a certain optimum. Notwithstanding the positive contributions of this model, the human validation process may slow down the analysis pipeline. Adding disclosure controls alongside analytical tools may partially overcome such issues, while limiting disclosure. Automated and more traditional gatekeeping techniques may complement each other and preserve a low-level of disclosure. Goldacre, et al. [70] have acknowledged both Open safely and data SHIELD have successfully made possible in completing analytic outputs on NHS data. CONCLUSION A real momentum exists in bringing computer scientists, medical practitioners as well as other disciplines together to federate systems and support researchers. The distribution of advanced statistical and machine learning methodologies has been distributed to address specific research questions. It would be beneficial attempting generalising the distribution techniques and algorithms for more than one purpose. With time, we anticipate some effort in this direction may naturally arise. Collaboration is successfully integrating combinations of software to build distributed architectures. Little doubt exists that organisations IT services and researchers should work together to implement and maintain consortia close to some proposed framework. Additional research may need to explore the best approaches to bring together IT professionals, researchers, and their own perceptions of solving problems. Such collaboration may consider adaptability, flexibility and ability to integrate with existing systems. Some lightweight approaches or hub-and-spoke solutions appears to offer better opportunities for adapting to existing systems. It is possible lightweight approaches could become more resilient to a fast advancing technological ecosystems, we are experiencing. Without the creation of standards and white papers distributions of computer systems over the Internet would have not been so successful. This model should be an inspiration for health-care, biomedical and other industries to drive innovations and resilience to future technological advancement. We have presented a possible model to support multi- disciplinary research team to protect against disclosure of individual-level data. The model could also apply to large datasets used in other disciplines, such as astronomy. We conclude not all solutions should rely on technology on its own, but also continuous collaboration to adapt to new cyber-security, and data inferential threat. We have also concluded some standards could lead to closer collaboration to support research and innovation in the long term. Ryser-Welch P, et al. ViPaR [59] empowers data governors to authorise access to the data; no privacy preservation settings are yet to be included. Encrypted anonymised data may be transferred but not saved permanently on a central server hard disk. Researchers may not access directly the data, but the hosting organisation is likely to view data with appropriate tools. Unlike Figure 1, some federation of the data occurs at server side, rather than letting the researchers taking the decision to connect to specific data sources. Pooled data can therefore be analysed using server-level or virtually-pooled analysis, using an implemented analytical pipeline. An analytical web interface and some command line interface are available to conduct some remote analysis. To maintain resilience to a low disclosure level encryption and anonymisation methods need continuously adapt to security threats and counter-measures. The use of ViPar within the same organisation may also improve the level of protection. Another approach to ‘hub-and-spoke’ design brings more freedom to the data analysts. Some specialised client-side libraries, server-side libraries, and warehouse software brings some multi-platforms components to act as integration between existing systems and data analysts. Vantage6 [60] solution has yet to offers any gatekeepers, users’ management, access right, and privacy-preserving parameterisation. It is yet unclear whether any forms of encryption, such as homomorphic data encryption at the server-level or secure-socket layer at transmission level, are required. Vantage has “lighter-weight” use of technology to adapt to various needs dictated by the data and also supports multiple programming languages. Data SHIELD has developed from a prototype to a development platform for privacy-preserving federated analysis [47,48,61]; it depends on the functional programming language R. Visualisation, client-side computations, and secured connections to the servers are implemented in R and Java. The data analysts have yet to use any web interfaces. Instead, data analysts can use R scripts, notebooks, vignettes, or create their own web interface to analyses and share their analysis results. Data governors can edit authorized requests, privacy-preserving parameterisation, user management, and access rights using specialised warehouse software. The latter can resource data kept outside the server, in the same organisation. Existing organisational security features, implementation of privacypreserving computations, and also the settings of privacy-preserving parameters should lower disclosure. Data governance tools and their application should maintain resilience to data protection and data preservation. Hub-and-spoke architecture has also been adopted in exploring the decentralisation of machine learning, such as deep learning, to multiple devices and cloud platforms. Machine learning algorithms are brought to some training and testing data sets distributed in nodes; some of them are distributed within consortia or across organisations [62-64]. Privacypreservation using encryption techniques, such block chain or homomorphic encryption are considered as state-of-the art, at the time of writing [65,66]. However, the use a gatekeeper, users management and access rights as well as some privacy-preserving parametrisations has yet to be considered or acknowledged by the community. Int J Adv Technol, Vol.13 Iss.10 No:1000213 4 REFERENCES 1. Breden F, Luning Prak ET, Peters B, Rubelt F, Schramm CA, Busse CE, et al. Reproducibility and reuse of adaptive immune receptor repertoire data. Front immunol. 2017;8:1418. 2. Editorial: Data sharing and the future of science. Nat Commun.. 2018. 3. Dunning A, de Smaele M, Böhmer J. Are the fair data principles fair? Int J Digit Curation. 1970;12(2):177-195. 4. Waldman AE. Cognitive biases, dark patterns, and the ‘privacy paradox’. Curr Opin Psychol. 2020;31:105-109. 5. Kowalczyk S, Shankar K. Data sharing in the sciences. Annu Rev Inf Sci Technol. 2011;45(1):247-294. 6. Heimbigner D, McLeod D. A federated architecture for information management. ACM Trans Inf Syst. 1985;3(3):253-278. 7. Murtagh MJ, Turner A, Minion JT, Fay M, Burton PR. International data sharing in practice: New technologies meet old governance. Biopreserv Biobank. 2016;14(3):231-240. 8. Clifton C, Kantarcioglu M, Vaidya J. Defining privacy for data mining. In National science foundation workshop on next generation data mining 2002;12(6):1. 9. Redi M, Aiello LM, Schifanella R, Quercia D. The spirit of the city: Using social media to capture neighborhood ambiance. Proc ACM Hum-Comput Interact. 2018;2(CSCW):1-8. 10. Meier LM, Manzerolle VR. Rising tides? Data capture, platform accumulation, and new monopolies in the digital music economy. New Media Soc. 2019;21(3):543-561. 11. Gisselbrecht T, Denoyer L, Gallinari P, Lamprier S. Whichstreams: A dynamic approach for focused data capture from large social media. In Proceedings of the International AAAI Conference on Web and Social Media. 2015.9;1:130-139. 12. Bechini A, Gazze D, Marchetti A, Tesconi M. Towards a general architecture for social media data capture from a multi-domain perspective. In 2016 IEEE 30th International Conference on Advanced Information Networking and Applications (AINA). 2016:1093-1100. 13. Larkin A, Hystad P. Integrating geospatial data and social media in bidirectional long-short term memory models to capture human nature interactions. Comput J. 2022;65(3):667-678. 14. Koppanyi Z, Toth CK. Experiences with acquiring highly redundant spatial data to support driverless vehicle technologies. ISPRS Ann Photogramm. Remote Sens Spat Inf Sci. 2018;4(2). 15. Gately CK, Hutyra LR, Peterson S, Wing IS. Urban emissions hotspots: Quantifying vehicle congestion and air pollution using mobile phone GPS data. Environmental pollution. 2017 Oct 1;229:496-504. 16. Chen C, Li K, Teo SG, Chen G, Zou X, Yang X, Vijay RC, Feng J, Zeng Z. Exploiting spatio-temporal correlations with multiple 3d convolutional neural networks for citywide vehicle flow prediction. In 2018 IEEE international conference on data mining (ICDM) 2018;893-898. 17. Azad MA, Arshad J, Akmal SM, Riaz F, Abdullah S, Imran M, Ahmad F. A first look at privacy analysis of COVID-19 contact-tracing mobile applications. IEEE Internet Things J. 2020;8(21): 15796-15806. 18. Butte AJ. Challenges in bioinformatics: Infrastructure, models and analytics. Trends Biotechnol. 2001;19(5):159-160. 19. Hobart M. The ‘dark data’conundrum. Comput Fraud Secur. 2020;2020(7):13-16. 20. Gimpel G. Bringing dark data into the light: Illuminating existing IoT data lost within your organization. Bus Horiz. 2020;63(4): 519-530. 21. Abid S, Keshavjee K, Karim A, Guergachi A. What we can learn from amazon for clinical decision support systems. Stud Health Technol Inform. 2017.234:1-5. 22. Khandelwal V, Chaturvedi AK, Gupta CP. Amazon EC2 spot price prediction using regression random forests. IEEE Trans Cloud Comput. 2017;8(1):59-72. 23. Law KS, Chung FL. Knowledge-driven decision analytics for commercial banking. J Manag Anal. 2020;7(2):209-230. 24. Qiang X. The road to digital unfreedom: President Xi's surveillance state. J Democr. 2019;30(1):53-67 25. Fulgoni GM, Lipsman A, Davidsen C. The power of political advertising: Lessons for practitioners: How data analytics, social media, and creative strategies shape US presidential election campaigns. J Advert Res. 2016;56(3):239-244. 26. Rathi R. Effect of Cambridge analytica’s Facebook ads on the 2016 US presidential election. Towards Data Science. 2019. 27. Mahmood T, Iqbal T, Amin F, Lohanna W, Mustafa A. Mining Twitter big data to predict 2013 Pakistan election winner. In INMIC. 2013;49-54. 28. Fuller S. Brexit as the unlikely leading edge of the anti-expert revolution. Eur Manag J. 2017;35(5):575-580. 29. Livne A, Simmons M, Adar E, Adamic L. The party is over here: Structure and content in the 2010 election. In Proceedings of the International AAAI Conference on Web and Social Media. 2011;5(1):201-208. 30. Apuke OD, Omar B. Fake news and COVID-19: modelling the predictors of fake news sharing among social media users. Telemat Inform. 2021;56:101475. 31. Herschel R, Miori VM. Ethics and big data. Technology in Society. 2017;49:31-16. 32. Zhang X, Bi H. Research on privacy preserving classification data mining based on random perturbation. In 2010 International Conference on Information, Networking and Automation. 2010;1:V1-173. 33. Parker MJ, Fraser C, Abeler-Dörner L, Bonsall D. Ethics of instantaneous contact tracing using mobile phone apps in the control of the COVID-19 pandemic. J Med Ethics.2020;46(7):427-31. 34. Vaudenay S. Centralized or decentralized. The contact tracing dilemma. 2020. 35. Sun R, Wang W, Xue M, Tyson G, Camtepe S, Ranasinghe D. Vetting security and privacy of global COVID-19 contact tracing applications. arXiv preprint. 2020. 36. Mansour E, Sambra AV, Hawke S, Zereba M, Capadisli S, Ghanem A. A demonstration of the solid platform for social web applications. In Proceedings of the 25th international conference companion on world wide web. 2016;223-226. 37. Ramachandran M, Chowdhury N, Third A, Domingue J, Quick K, Bachler M. Towards complete decentralised verification of data with confidentiality: Different ways to connect solid pods and blockchain. In Companion Proceedings of the Web Conference. 2020;645-649. 38. Buyle R, Taelman R, Mostaert K, Joris G, Mannens E, Verborgh R, Berners-Lee T. Streamlining governmental processes by putting citizens in control of their personal data. In International Conference on Electronic Governance and Open Society: Challenges in Eurasia 2019;346-359. 39. Kumar R, Tripathi R. Large-scale data storage scheme in blockchain ledger using ipfs and nosql. In Large-Scale Data Streaming, Processing, and Blockchain Security 2021;91-116. 40. Benet J. Ipfs-content addressed, versioned, p2p file system. arXiv. 2014. Ryser-Welch P, et al. Int J Adv Technol, Vol.13 Iss.10 No:1000213 5 https://www.frontiersin.org/articles/10.3389/fimmu.2017.01418/full https://www.frontiersin.org/articles/10.3389/fimmu.2017.01418/full http://www.ijdc.net/article/view/567 http://www.ijdc.net/article/view/567 https://www.sciencedirect.com/science/article/pii/S2352250X19301484 https://www.sciencedirect.com/science/article/pii/S2352250X19301484 http://courses.washington.edu/geog482/resource/9_Kowalczyk_DataSharingSciences.pdf https://dl.acm.org/doi/abs/10.1145/4229.4233 https://dl.acm.org/doi/abs/10.1145/4229.4233 https://www.liebertpub.com/doi/abs/10.1089/bio.2016.0002 https://www.liebertpub.com/doi/abs/10.1089/bio.2016.0002 https://www.cs.purdue.edu/homes/clifton/document/NGDM02.pdf https://www.cs.purdue.edu/homes/clifton/document/NGDM02.pdf https://dl.acm.org/doi/abs/10.1145/3274413 https://dl.acm.org/doi/abs/10.1145/3274413 https://journals.sagepub.com/doi/abs/10.1177/1461444818800998 https://journals.sagepub.com/doi/abs/10.1177/1461444818800998 https://ojs.aaai.org/index.php/ICWSM/article/view/14587 https://ojs.aaai.org/index.php/ICWSM/article/view/14587 https://ojs.aaai.org/index.php/ICWSM/article/view/14587 https://ojs.aaai.org/index.php/ICWSM/article/view/14587 https://ojs.aaai.org/index.php/ICWSM/article/view/14587 https://academic.oup.com/comjnl/article-abstract/65/3/667/5893915 https://academic.oup.com/comjnl/article-abstract/65/3/667/5893915 https://academic.oup.com/comjnl/article-abstract/65/3/667/5893915 https://asset-pdf.scinapse.io/prod/2806250001/2806250001.pdf https://asset-pdf.scinapse.io/prod/2806250001/2806250001.pdf https://www.sciencedirect.com/science/article/abs/pii/S0269749117304001 https://www.sciencedirect.com/science/article/abs/pii/S0269749117304001 https://www.sciencedirect.com/science/article/abs/pii/S0269749117304001 https://ieeexplore.ieee.org/abstract/document/8594916 https://ieeexplore.ieee.org/abstract/document/8594916 https://ieeexplore.ieee.org/abstract/document/9199262 https://ieeexplore.ieee.org/abstract/document/9199262 https://www.sciencedirect.com/science/article/abs/pii/S0167779901016031 https://www.sciencedirect.com/science/article/abs/pii/S0167779901016031 https://www.magonlinelibrary.com/doi/abs/10.1016/S1361-3723%2820%2930075-0 https://www.sciencedirect.com/science/article/abs/pii/S0007681320300380 https://www.sciencedirect.com/science/article/abs/pii/S0007681320300380 https://ebooks.iospress.nl/doi/10.3233/978-1-61499-742-9-1 https://ebooks.iospress.nl/doi/10.3233/978-1-61499-742-9-1 https://ieeexplore.ieee.org/abstract/document/8166810 https://ieeexplore.ieee.org/abstract/document/8166810 https://www.tandfonline.com/doi/abs/10.1080/23270012.2020.1734879 https://www.tandfonline.com/doi/abs/10.1080/23270012.2020.1734879 https://muse.jhu.edu/article/713722 https://muse.jhu.edu/article/713722 https://www.journalofadvertisingresearch.com/content/56/3/239.short https://www.journalofadvertisingresearch.com/content/56/3/239.short https://www.journalofadvertisingresearch.com/content/56/3/239.short https://www.journalofadvertisingresearch.com/content/56/3/239.short https://towardsdatascience.com/effect-of-cambridge-analyticas-facebook-ads-on-the-2016-us-presidential-election-dacb5462155d https://towardsdatascience.com/effect-of-cambridge-analyticas-facebook-ads-on-the-2016-us-presidential-election-dacb5462155d https://ieeexplore.ieee.org/abstract/document/6731323 https://ieeexplore.ieee.org/abstract/document/6731323 https://www.sciencedirect.com/science/article/abs/pii/S0263237317301160 https://www.sciencedirect.com/science/article/abs/pii/S0263237317301160 https://ojs.aaai.org/index.php/ICWSM/article/view/14129 https://ojs.aaai.org/index.php/ICWSM/article/view/14129 https://www.sciencedirect.com/science/article/pii/S0736585320301349 https://www.sciencedirect.com/science/article/pii/S0736585320301349 https://www.sciencedirect.com/science/article/abs/pii/S0160791X16301373 https://ieeexplore.ieee.org/abstract/document/5636410 https://ieeexplore.ieee.org/abstract/document/5636410 https://jme.bmj.com/content/46/7/427.abstract https://jme.bmj.com/content/46/7/427.abstract https://jme.bmj.com/content/46/7/427.abstract https://www.researchgate.net/profile/Damith-Ranasinghe/publication/342352381_Vetting_Security_and_Privacy_of_Global_COVID-19_Contact_Tracing_Applications/links/5f5b2b01299bf1d43cf99f5f/Vetting-Security-and-Privacy-of-Global-COVID-19-Contact-Tracing-Applications.pdf https://www.researchgate.net/profile/Damith-Ranasinghe/publication/342352381_Vetting_Security_and_Privacy_of_Global_COVID-19_Contact_Tracing_Applications/links/5f5b2b01299bf1d43cf99f5f/Vetting-Security-and-Privacy-of-Global-COVID-19-Contact-Tracing-Applications.pdf https://dl.acm.org/doi/abs/10.1145/2872518.2890529 https://dl.acm.org/doi/abs/10.1145/2872518.2890529 https://dl.acm.org/doi/abs/10.1145/3366424.3385759 https://dl.acm.org/doi/abs/10.1145/3366424.3385759 https://dl.acm.org/doi/abs/10.1145/3366424.3385759 https://link.springer.com/chapter/10.1007/978-3-030-39296-3_26 https://link.springer.com/chapter/10.1007/978-3-030-39296-3_26 https://www.igi-global.com/chapter/large-scale-data-storage-scheme-in-blockchain-ledger-using-ipfs-and-nosql/259467 https://www.igi-global.com/chapter/large-scale-data-storage-scheme-in-blockchain-ledger-using-ipfs-and-nosql/259467 https://arxiv.org/abs/1407.3561 41. Steichen M, Fiz B, Norvill R, Shbair W, State R. Blockchain-based, decentralized access control for IPFS.Sensors (Basel).2021;21(7): 1499-1506. 42. Cömert C. Web services and national spatial data infrastructure (NSDI). In Proceedings of geo-imagery bridging continents, XXth ISPRS congress 2004. 43. Ferris C, Farrell J. What are web services? Communications of the ACM. 2003;46(6):31. 44. Berners‐Lee T, Cailliau R, Groff JF, Pollermann B. World‐Wide Web: The information universe. Internet Research. 1992;2(1):52-58. 45. Perry DG, Blumenthal SH, Hinden RM. The ARPANET and the DARPA Internet. Library Hi Tech. 1988;6(2):51-62. 46. Busse S, Kutsche RD, Leser U, Weber H. Federated information systems: Concepts, terminology and architectures. Forschungsberichte des Fachbereichs Informatik. 1999;99(9):1-38. 47. Gaye A, Marcon Y, Isaeva J, LaFlamme P, Turner A, Jones EM, et al. Data shield: Taking the analysis to the data, not the data to the analysis. Int J Epidemiol. 2014;43(6):1929-44. 48. Marcon Y, Bishop T, Avraam D, Escriba-Montagut X, Ryser- Welch P, Wheater S, et al. Orchestrating privacy-protected big data analyses of data from different resources with R and data shield. PLoS comput Biol. 2021;17(3):e1008880. 49. Yao AC. How to generate and exchange secrets. In 27th Annual Symposium on Foundations of Computer Science (SFCS 1986). 1986;162-167. 50. Goldreich O, Micali S, Wigderson A. How to play any mental game, or a completeness theorem for protocols with honest majority. In Providing Sound Foundations for Cryptography: On the Work of Shafi Goldwasser and Silvio Micali. 2019;307-328. 51. Genome Browser Gateway. 2020. 52. Guidotti E, Ardia D. COVID-19 data hub. J Open Source Softw. 2020;5(51):2376. 53. Gentleman RC, Carey VJ, Bates DM, Bolstad B, Dettling M, Dudoit S, et al. Bioconductor: Open software development for computational biology and bioinformatics. Genome biol. 2004;5(10):1-6. 54. Raisaro JL, Troncoso-Pastoriza JR, Misbach M, Sousa JS, Pradervand S, Missiaglia E, et al. Medc o: Enabling secure and privacy-preserving exploration of distributed clinical and genomic data. IEEE/ACM transactions on computational biology and bioinformatics. 2018 Jul 13;16(4):1328-1341. 55. Ziehmer MM, Rees R. Data management, open access and the ETH research collection. In Group Meeting Computer Engineering Group, ETH Zurich. 2019. 56. Beyan O, Choudhury A, van Soest J, Kohlbacher O, Zimmermann L, Stenzhorn H, et al. Distributed analytics on sensitive medical data: The personal health train. Data Intelligence. 2020;2(2):96-107. 57. van der Velde KJ, Imhann F, Charbon B, Pang C, van Enckevort D, Slofstra M, et al. MOLGENIS research: Advanced bioinformatics data software for non-bioinformaticians. Bioinformatics. 2019;35(6): 1076-1078. 58. Lancaster O, Beck T, Atlan D, Swertz M, Thangavelu D, Veal C, et al. Cafe variome: General-purpose software for making genotype- phenotype data discoverable in restricted or open access contexts. Hum Mutat. 2015;36(10):957-964. 59. Carter KW, Francis RW, Carter KW, Francis RW, Bresnahan M, Gissler M, et al. ViPAR: A software platform for the virtual pooling and analysis of research data. Int J Epidemiol. 2016;45(2):408-416. 60. Hulsen T. Sharing is caring-data sharing initiatives in healthcare. Int J Environ Res Public Health. 2020;17(9):3046. 61. Wolfson M, Wallace SE, Masca N, Rowe G, Sheehan NA, Ferretti V, et al. Data SHIELD: Resolving a conflict in contemporary bioscience-performing a pooled analysis of individual-level data without sharing the data. Int J Epidemiol. 2010;39(5):1372-82. 62. Galtier MN, Marini C. Substra: A framework for privacy-preserving, traceable and collaborative machine learning. arXiv. 2019. 63. Dou Q, So TY, Jiang M, Liu Q, Vardhanabhuti V, Kaissis G, et al. Federated deep learning for detecting COVID-19 lung abnormalities in CT: A privacy-preserving multinational validation study. NPJ Digit Med. 2021;4(1):1-1. 64. Nasirigerdeh R, Torkzadehmahani R, Matschinske J, Frisch T, List M, Späth J, et al. SPLINK: A federated, privacy-preserving tool as a robust alternative to meta-analysis in genome-wide association studies. BioRxiv. 2020. 65. Yang Q, Liu Y, Chen T, Tong Y. Federated machine learning: Concept and applications. ACM Transactions on Intelligent Systems and Technology (TIST). 2019;10(2):1-9. 66. Mothukuri V, Parizi RM, Pouriyeh S, Huang Y, Dehghantanha A, Srivastava G. A survey on security and privacy of federated learning. Future Gener Comput Syst. 2021;115:619-640. 67. Williamson E, Walker AJ, Bhaskaran K, Bacon S, Bates C, Morton CE, et al. Open SAFELY: Factors associated with COVID-19 related hospital death in the linked electronic health Records of 17 million adult NHS patients. MedRxiv. 2020. 68. Schultze A, Walker AJ, MacKenna B, Morton CE, Bhaskaran K, Brown JP, et al. Risk of COVID-19-related death among patients with chronic obstructive pulmonary disease or asthma prescribed inhaled corticosteroids: An observational cohort study using the Open SAFELY platform. Lancet Respir Med. 2020;8(11):1106-20. 69. Bhaskaran K, Bacon S, Evans SJ, Bates CJ, Rentsch CT, MacKenna B, et al. Factors associated with deaths due to COVID-19 versus other causes: Population-based cohort analysis of UK primary care data and linked national death registrations within the open SAFELY platform. Lancet Reg Health Eur. 2021;6:100109. 70. Goldacre, B, Morley J. Better, broader, safer: Using health data for research and analysis. A review commissioned by the secretary of state for health and social care. 2022. Ryser-Welch P, et al. Int J Adv Technol, Vol.13 Iss.10 No:1000213 6 https://ieeexplore.ieee.org/abstract/document/8726493/authors https://ieeexplore.ieee.org/abstract/document/8726493/authors https://www.isprs.org/proceedings/xxxv/congress/comm4/papers/365.pdf https://www.isprs.org/proceedings/xxxv/congress/comm4/papers/365.pdf http://d.web.umkc.edu/di5x7/output/Paper%20Critique%20-%20Web%20Services.pdf https://www.emerald.com/insight/content/doi/10.1108/eb047254/full/html https://www.emerald.com/insight/content/doi/10.1108/eb047254/full/html https://www.emerald.com/insight/content/doi/10.1108/eb047726/full/html https://www.emerald.com/insight/content/doi/10.1108/eb047726/full/html http://users.encs.concordia.ca/~gregb/home/691S/tu-berlin1999-db-integration.pdf http://users.encs.concordia.ca/~gregb/home/691S/tu-berlin1999-db-integration.pdf https://academic.oup.com/ije/article/43/6/1929/707730 https://academic.oup.com/ije/article/43/6/1929/707730 https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1008880 https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1008880 https://ieeexplore.ieee.org/abstract/document/4568207 https://dl.acm.org/doi/abs/10.1145/3335741.3335755 https://dl.acm.org/doi/abs/10.1145/3335741.3335755 https://genome.ucsc.edu/ https://joss.theoj.org/papers/10.21105/joss.02376.pdf https://link.springer.com/article/10.1186/gb-2004-5-10-r80 https://link.springer.com/article/10.1186/gb-2004-5-10-r80 https://ieeexplore.ieee.org/abstract/document/8410926 https://ieeexplore.ieee.org/abstract/document/8410926 https://ieeexplore.ieee.org/abstract/document/8410926 https://www.research-collection.ethz.ch/bitstream/handle/20.500.11850/343462/1/2019_22_05_Thiele_Group_D-ITET_RC.pdf https://www.research-collection.ethz.ch/bitstream/handle/20.500.11850/343462/1/2019_22_05_Thiele_Group_D-ITET_RC.pdf https://direct.mit.edu/dint/article/2/1-2/96/9997/Distributed-Analytics-on-Sensitive-Medical-Data https://direct.mit.edu/dint/article/2/1-2/96/9997/Distributed-Analytics-on-Sensitive-Medical-Data https://academic.oup.com/bioinformatics/article/35/6/1076/5085379 https://academic.oup.com/bioinformatics/article/35/6/1076/5085379 https://onlinelibrary.wiley.com/doi/full/10.1002/humu.22841 https://onlinelibrary.wiley.com/doi/full/10.1002/humu.22841 https://academic.oup.com/ije/article/45/2/408/2572511?login=false https://academic.oup.com/ije/article/45/2/408/2572511?login=false https://www.mdpi.com/1660-4601/17/9/3046 https://academic.oup.com/ije/article/39/5/1372/804410?login=false https://academic.oup.com/ije/article/39/5/1372/804410?login=false https://academic.oup.com/ije/article/39/5/1372/804410?login=false https://arxiv.org/abs/1910.11567 https://arxiv.org/abs/1910.11567 https://www.nature.com/articles/s41746-021-00431-6 https://www.nature.com/articles/s41746-021-00431-6 https://www.biorxiv.org/content/10.1101/2020.06.05.136382v2.abstract https://www.biorxiv.org/content/10.1101/2020.06.05.136382v2.abstract https://dl.acm.org/doi/abs/10.1145/3298981 https://dl.acm.org/doi/abs/10.1145/3298981 https://www.sciencedirect.com/science/article/abs/pii/S0167739X20329848 https://www.medrxiv.org/content/10.1101/2020.05.06.20092999v1 https://www.medrxiv.org/content/10.1101/2020.05.06.20092999v1 https://www.medrxiv.org/content/10.1101/2020.05.06.20092999v1 https://www.sciencedirect.com/science/article/pii/S221326002030415X https://www.sciencedirect.com/science/article/pii/S221326002030415X https://www.sciencedirect.com/science/article/pii/S221326002030415X https://www.sciencedirect.com/science/article/pii/S221326002030415X https://www.sciencedirect.com/science/article/pii/S2666776221000867 https://www.sciencedirect.com/science/article/pii/S2666776221000867 https://www.sciencedirect.com/science/article/pii/S2666776221000867 https://www.sciencedirect.com/science/article/pii/S2666776221000867 https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment_data/file/1067053/goldacre-review-using-health-data-for-research-and-analysis.pdf https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment_data/file/1067053/goldacre-review-using-health-data-for-research-and-analysis.pdf Contents SAFEST: A Safeguarding Analytical Framework for Decentralised Sensitive Data ABSTRACT INTRODUCTION MATERIALS AND METHODS RESULTS AND DISCUSSION CONCLUSION REFERENCES