MRC Biostatistics UnitNo Descriptionhttps://www.repository.cam.ac.uk/handle/1810/2619902023-12-07T09:53:56Z2023-12-07T09:53:56Z2141Dynamic risk prediction of cardiovascular disease using primary care data from New ZealandBarrott, Isobelhttps://www.repository.cam.ac.uk/handle/1810/3575042023-09-26T00:43:39Zdc.title: Dynamic risk prediction of cardiovascular disease using primary care data from New Zealand
dc.contributor.author: Barrott, Isobel
dc.description.abstract: Cardiovascular disease (CVD) gradually progresses over a period of time, and can lead to a cardiovascular disease event (“CVD event”) such as stroke or heart attack. There are several widely researched risk factors for CVD, such as smoking, diet, exercise, and stress (Perk et al., 2012). These risk factors can impact biomarkers like blood pressure and lipid levels, which can be measured by a primary care practitioner and are themselves risk factors (World Health Organization, 2021). The PREDICT cohort study (Wells et al., 2017, Pylypchuk et al., 2018) is comprised of the electronic health records (EHRs) of such CVD risk factor measurements, which were collected to assess the 5-year CVD risk of a patient in primary care. A risk prediction model previously developed for this population by Pylypchuk et al. (2018) is based on using only the most recent observations of these biomarkers.
Dynamic prediction is an alternative to this approach which updates risk predictions as measurements are collected, therefore using the entire history of these measurements. There are two main statistical frameworks that exist for performing dynamic prediction: the joint model and the landmark model. This thesis explores the use of dynamic prediction, and in particular the landmark model, to improve CVD risk prediction. Two types of landmark model for 5-year CVD risk are presented in this thesis, which were developed using the PREDICT cohort study dataset: one of these models the longitudinal data using a linear mixed effects (LME) model, and one which uses the last observation carried forward (LOCF) approach. It was found that these dynamic prediction models have some improvement in model performance over a “static” model which is similar to that developed by Pylypchuk et al. (2018). This thesis also presents the results of a simulation study to explore the difference between these two types of landmark models as the number of repeated measurements of the biomarkers increase, in particular finding that there is little difference in terms of model performance. Finally, this thesis presents an R package ‘Landmarking’ which allows the user to perform various analyses relating to the landmark model.
Using generative modelling in healthcareSkoularidou, Mariahttps://www.repository.cam.ac.uk/handle/1810/3558312023-09-01T00:40:53Zdc.title: Using generative modelling in healthcare
dc.contributor.author: Skoularidou, Maria
dc.description.abstract: In the present thesis a broad spectrum of high dimensional problems with application to healthcare will be explored. We shall review the state-of-the-art methods that are employed when trying to detect genetic factors that affect gene expression, which is a core problem in genetics. We shall also present two popular classes of generative models, namely Variational Autoencoders (VAEs) and Generative Adversarial Networks (GANs) and their variants. Subsequently, we shall review some new developed imputation methods which are based on GANs and VAEs. We shall assess their performance under various missingness scenarios via accordingly designed experiments and simulation studies. We shall proceed via introducing our method on GANs’ inversion and evaluate its performance in a newly suggested manner. Finally, we shall conclude this thesis with our main findings and future work.
Bayesian methodology for integrating multiple data sources and specifying priors from predictive informationManderson, Andrewhttps://www.repository.cam.ac.uk/handle/1810/3505082023-12-05T20:49:25Zdc.title: Bayesian methodology for integrating multiple data sources and specifying priors from predictive information
dc.contributor.author: Manderson, Andrew
dc.description.abstract: The joint model for observable quantities and latent parameters is the starting point for Bayesian inference. It is challenging to specify such a model so that it both accurately describes the phenomena being studied, and is compatible with the available data. In this thesis we address challenges to model specification when we have either multiple data sources and/or expert knowledge about the observable quantities in our model.
We often collect many distinct data sets that capture different, but partially overlapping, aspects of the complex phenomena. Instead of immediately specifying a single joint model for all these data, it may be easier to instead specify distinct submodels for each source of data and then join the submodels together. We specifically consider chains of submodels, where submodels directly relate to their neighbours via common quantities which may be parameters or deterministic functions thereof. We propose chained Markov melding, an extension of Markov melding, a generic method to combine chains of submodels into a joint model.
When using any form of Markov melding, one challenge is that the prior for the common quantities can be implicit, and their marginal prior densities must be estimated. Specifically, when we have just two submodels, we show that error in this density estimate makes the two-stage Markov chain Monte Carlo sampler employed by Markov melding unstable and unreliable. We propose a robust two-stage algorithm that estimates the required prior marginal self-density ratios using weighted samples, dramatically improving accuracy in the tails of the distribution.
Expert information often pertains to the observable quantities in our model, or can be more easily elicited for these quantities. However, the appropriate informative prior that matches this information is not always obvious, particularly for complex models. Prior predictive checks and the Bayesian workflow are often undertaken to iteratively specify a prior that agrees with the elicited expert information, but for complex models it is difficult to manually adjust the prior to better align the prior predictive distribution and the elicited information. We propose a multi-objective global optimisation approach that aligns these quantities by adjusting the hyperparameters of the prior, thus "translating" the elicited information in an informative prior.
Bayesian model-based clustering of multi-source dataColeman, Stephenhttps://www.repository.cam.ac.uk/handle/1810/3495542023-05-05T15:46:18Zdc.title: Bayesian model-based clustering of multi-source data
dc.contributor.author: Coleman, Stephen
dc.description.abstract: Inferring a partition of a dataset can help in downstream analyses and decision making. However, there often exist many feasible partitions, which makes the problem of inferring clusters challenging. A demanding problem is analysis of data generated across multiple sources. Bayesian mixture models and their extensions are effective tools for partition inference in this setting as we can use these to describe and infer the relationship between different sources. I consider applying such methods to two cases of multi-source data: multi-view, where the same items have data generated across different contexts, and multi-batch, where the same measurements are taken on sets of items.
I develop and explore a consensus clustering approach to navigate the problem of poor mixing, which refers to a failure of Markov chain Monte Carlo methods wherein the sampler becomes trapped in local high posterior density modes. This problem is commonly encountered when seeking to infer latent structure in high-dimensional data. I propose running many short Markov chains in parallel and using the final sample from each chain. My results suggest that performing inference this way frequently better describes model uncertainty than individual long chains. I use the method in a multi-omics analysis of the cell cycle of <i>Saccharomyces cerevisiae</i> and identify biologically meaningful structure.
I subsequently implement Multiple Dataset Integration (MDI), a Bayesian integrative clustering method, in C++ with a wrapper in R, correcting an error that was present in previous implementations, and extending MDI to be semi-supervised. My implementation allows a range of models for a variety of different data types, such as t-augmented mixtures of Gaussians and Gaussian processes. I then consider a semi-supervised multi-omics analysis of the model apicomplexan, <i>Toxoplasma gondii</i>.
In my final content chapter I consider the problem of analysing data generated across multiple batches. Such data can have structural differences which should be accounted for when inferring a partition. I propose a mixture model that includes both cluster/class and batch parameters to simultaneously model batch effects upon location and scale with the partition. I validate my method in a simulation study and using held out seroprevalence data, and compare to existing methods.
Finally, I discuss the state of the field of Bayesian mixture models and some potential future research directions.
Adaptive Designs and Methods for More Efficient Drug DevelopmentSerra, Alessandrahttps://www.repository.cam.ac.uk/handle/1810/3448252023-01-05T01:40:55Zdc.title: Adaptive Designs and Methods for More Efficient Drug Development
dc.contributor.author: Serra, Alessandra
dc.description.abstract: The development of a novel drug is a time-consuming and expensive process. Innovative
trial designs and optimal sequences of clinical trials aim to increase the efficiency of this
process by improving flexibility and maximising the use of accumulated information
throughout the trials while minimizing the number of patients that are exposed to
unsafe or ineffective regimens.
In Chapter 2 of this thesis, we focus on confirmatory trials that are one of the largest
contributors to cost and time in later stages of the drug development process. We
consider a clinical trial setting where multiple treatment arms are studied concurrently
and an ‘order’ (i.e. a monotonic relationship) among the treatment effects can be
assumed. We propose a novel design which incorporates the information about the
order in the decision-making without assuming any parametric arm-response model
and controlling error rates. We compare the performance of this novel approach with
currently used trial designs and we describe its application to design an actual trial
in tuberculosis in Chapter 3. In Chapter 4, we propose a Bayesian extension of the
design described in Chapter 2 allowing to relax the order assumption and incorporate
historical information. This is needed for settings where, for example, the increase in
side effects or compliance with the treatment lead to a reduced efficacy of the treatment
and, hence, violation of the order assumption. We compare this design with other
competing approaches that do not consider uncertainty in the order.
In Chapter 5, we focus on the whole drug development process. We consider an
oncology trial setting and we compare two sequences of clinical trials, the first targeting
the whole patient population and the second a molecularly defined subgroup within
the population. We propose a metric to quantify the expected clinical benefit of these
two strategies. In addition, for each strategy we measure the cost of development as
the expected proportion of patients enrolled over the total common sample size. We
illustrate a performance evaluation of the proposed metric in an actual trial.
Modularized Bayesian Inference: Methodology, Algorithm, Theory And Application.Liu, Yanghttps://www.repository.cam.ac.uk/handle/1810/3430142022-11-09T01:41:38Zdc.title: Modularized Bayesian Inference: Methodology, Algorithm, Theory And Application.
dc.contributor.author: Liu, Yang
dc.description.abstract: Bayesian inference has shown powerful impacts on understanding and explaining data and their generating mechanisms, but misspecification of the model is a major threat to the validity of the inference. Although methods that deal with misspecification have been developed and their properties have been studied, these methods are mainly established based on the premise that the whole model is misspecified. Since the real mechanism of the data generating process is often complex and many factors can affect the observation and collection of data, the reliability of the model may widely vary across its components and lead to partial misspecification. Dealing with such partial misspecification for a robust inference remains challenging and requires comprehensive studies of its methodology, algorithm, theory and potential application.
Modularized Bayesian inference has been developed as a robust alternative to standard Bayesian inference for partial misspecification. As a particular form, cut inference completely removes the influence from misspecified components and involves a cut distribution which differs from the standard posterior distribution. Existing algorithms which sample from this cut distribution suffer from unclear convergence properties or slow computations. A novel algorithm named the stochastic approximation cut algorithm (SACut) is proposed in this thesis. The theoretical and computational properties of the SACut algorithm are studied.
A general framework of cut inference beyond a generic two-module case, where one component is assumed to be misspecified, is not clear. In particular, the definition of what a ``module'' is remains vague in the literature. Furthermore, implementing cut inference for an arbitrary multiple-module case remains an open question. Solving these basic questions is appealing and necessary. This thesis formulates rules including the definition of modules; determination of relationships between modules and building the cut distribution that one should follow to implement cut inference within an arbitrary model structure.
Semi-Modular inference bridges the gap between standard Bayesian inference and cut inference through the use of a likelihood with a power term. Interestingly, this feature corresponds to a geographically weighted regression (GWR) model that has been developed to handle the spatial non-stationarity but hitherto not been extended to Bayesian inference except for the Gaussian regression. This thesis proposes the Bayesian GWR model as a certain multiple-module case of Semi-Modular inference. The theory of Semi-Modular inference is extended to the multiple-module case to justify the Bayesian GWR model.
Modularized Bayesian inference remains a young and emerging topic. Being one of the many pioneering works that promote the modularized Bayesian inference to a broader range of statistical models, it is hoped that this thesis will enlighten future developments of methodology and algorithm, and stimulate applications of modularized Bayesian inference.
Statistical methods to improve understanding of the genetic basis of complex diseasesHutchinson, Annahttps://www.repository.cam.ac.uk/handle/1810/3328842022-01-25T03:56:41Zdc.title: Statistical methods to improve understanding of the genetic basis of complex diseases
dc.contributor.author: Hutchinson, Anna
dc.description.abstract: Robust statistical methods, utilising the vast amounts of genetic data that is now available, are required to resolve the genetic aetiology of complex human diseases including immune-mediated diseases. Essential to this process is firstly the use of genome-wide association studies (GWAS) to identify regions of the genome that determine the susceptibility to a given complex disease. Following this, identified regions can be fine-mapped with the aim of deducing the specific sequence variants that are causal for the disease of interest.
Functional genomic data is now routinely generated from high-throughput experiments. This data can reveal clues relating to disease biology, for example elucidating the functional genomic annotations that are enriched for disease-associated variants. In this thesis I describe a novel methodology based on the conditional false discovery rate (cFDR) that leverages functional genomic data with genetic association data to increase statistical power for GWAS discovery whilst controlling the FDR. I demonstrate the practical potential of my method through applications to asthma and type 1 diabetes (T1D) and validate my results using the larger, independent, UK Biobank data resource.
Fine-mapping is used to derive credible sets of putative causal variants in associated regions from GWAS. I show that these sets are generally over-conservative due to the fact that fine-mapping data sets are not randomly sampled, but are instead sampled from a subset of those with the largest effect sizes. I develop a method to derive credible sets that contain fewer variants whilst still containing the true causal variant with high probability. I use my method to improve the resolution of fine-mapping studies for T1D and ankylosing spondylitis. This enables a more efficient allocation of resources in the expensive functional follow-up studies that are used to elucidate the true causal variants from the prioritised sets of variants.
Whilst GWAS investigate genome-wide patterns of association, it is likely that studying a specific biological factor using a variety of data sources will give a more detailed perspective on disease pathogenesis. Taking a more holistic approach, I utilise a variety of genetic and functional genomic data in a range of statistical genetics techniques to try and decipher the role of the Ikaros family of transcription factors in T1D pathogenesis. I find that T1D-associated variants are enriched in Ikaros binding sites in immune-relevant cell types, but that there is no evidence of epistatic effects between causal variants residing in the Ikaros gene region and variants residing in genome-wide binding sites of Ikaros, thus suggesting that these sets of variants are not acting synergistically to influence T1D risk.
Together, in this thesis I develop and examine a range of statistical methods to aid understanding of the genetic basis of complex human diseases, with application specifically to immune-mediated diseases.
Approaches to developing clinically useful Bayesian risk prediction modelsKarapanagiotis, Solonhttps://www.repository.cam.ac.uk/handle/1810/3291482021-12-06T13:40:43Zdc.title: Approaches to developing clinically useful Bayesian risk prediction models
dc.contributor.author: Karapanagiotis, Solon
dc.description.abstract: Prediction of the presence of disease (diagnosis) or an event in the future course of disease (prognosis) becomes increasingly important in the current era of personalised medicine. Both tasks (diagnosis and prognosis) are supported using (risk) prediction models. Such models usually combine multiple variables by using different statistical and/or machine learning approaches. Recent advances in prediction models have improved diagnostic and prognostic accuracy, in some cases surpassing the performance of clinicians. However, evidence is lacking that deployment of these models has improved care and patient outcomes. That is, their clinical usefulness is debatable. One barrier to demonstrating such improvement is the basis used to evaluate their performance. In this thesis, we explore methods for developing (building and evaluating) risk prediction models, in an attempt to create clinically useful models.
We start by introducing a few commonly used metrics to evaluate the predictive performance of prediction models. We then show that a model with good predictive performance is not enough to guarantee clinical usefulness. A well performing model can be clinically useless, and a poor model valuable. Following recent line of work, we adopt a decision theoretic approach for model evaluation that allows us to determine whether the model would change medical decisions and, if so, whether the outcome of interest would improve as a result.
We then apply this approach to investigate the clinical usefulness of including information about circulating tumour DNA (ctDNA) when predicting response to treatment in metastatic breast cancer. ctDNA has been proposed as a promising approach to assess response to treatment. We show that incorporating trajectories of circulating tumour DNA results in a clinically useful model and can improve clinical decisions.
However, an inherit limitation to the decision theoretic approach (and related ones) is that model building and evaluation are done independently. During training, the prediction model is agnostic of the clinical consequences from its use. That is, the prediction model is agnostic of its (clinical) purpose, e.g., which type of classification error is more costly (i.e., undesirable).
We address this shortcoming by introducing Tailored Bayes (TB), a novel Bayesian inference framework which “tailors” model fitting to optimise predictive performance with respect to unbalanced misclassification costs. In both simulated and real-world applications, we find our approach to perform favourably in comparison to standard Bayesian methods.
We then move to extend the framework to situations where a large number of (potentially irrelevant) variables are measured. Such high-dimensional settings represent a ubiquitous challenge in modern scientific research. We introduce a sparse TB framework for variable selection and find that TB favours smaller models (with fewer variables) compared to standard Bayesian methods, whilst performing better or no worse. This pattern was seen both in simulated and real data. In addition, we show the relative importance of the variables changes when we consider unbalanced misclassification costs.
Weighting and moment conditions in Bayesian inferenceYiu, Andrewhttps://www.repository.cam.ac.uk/handle/1810/3290242023-09-17T00:50:21Z2021-10-23T00:00:00Zdc.title: Weighting and moment conditions in Bayesian inference
dc.contributor.author: Yiu, Andrew
dc.description.abstract: The work presented in this thesis was motivated by the goal of developing Bayesian methods for "weighted" biomedical data. To be more specific, we are referring to probability weights, which are used to adjust for distributional differences between the sample and the population. Sometimes, these differences occur by design; data collectors can choose to implement an unequal probability sampling frame to optimize efficiency subject to constraints. If so, the probability weights are known and are traditionally equal to the inverse of the unit sampling probabilities. It is often the case, however, that the sampling mechanism is unknown. Methods that use estimated weights include so-called doubly robust estimators, which have become popular in causal inference.
There is a lack of consensus regarding the role of probability weights in Bayesian inference. In some settings, it is reasonable to believe that conditioning on certain observed variables is sufficient to adjust for selection; the sampling mechanism is then deemed \textit{ignorable} in a Bayesian analysis. In Chapter 2, we develop a Bayesian approach for case-cohort data that ignores the sampling mechanism and outperforms existing methods, including those that involve inverse probability weighting. Our approach showcases some key strengths of the Bayesian paradigm---namely, the marginalization of nuisance parameters, and the availability of sophisticated computational techniques from the MCMC literature. We analyse data from the EPIC-Norfolk cohort study to investigate the associations between saturated fatty acids and incident type-2 diabetes.
However, ignoring the sampling is not always beneficial. For a variety of popular problems, weighting offers the potential for increased robustness, efficiency and bias-correction. It is also of interest to consider settings where sampling is nonignorable, but weights are available (only) for the selected units. This is tricky to handle in a conventional Bayesian framework; one must either make ad-hoc adjustments, or attempt to model the distribution of the weights. The latter is infeasible without additional untestable assumptions if the weights are not exact probability weights---e.g. due to trimming or calibration. By contrast, weighting methods are usually simple to implement in this context and are virtually model-free.
Chapters 3 and 4 develop approaches that are capable of combining weighting with Bayesian modelling. A key ingredient is to define target quantities as the solutions to moment conditions, as opposed to ``true'' components of parametric models. By doing so, the quantities coincide with the usual definitions if working model assumptions hold, but retain the interpretation of being projections if the assumptions are violated. This allows us to nonparametrically model the data-generating distribution and obtain the posterior of the target quantity implicitly. Crucially, our approaches still enable the user to directly specify their prior for the target quantity, in contrast to common nonparametric Bayesian models like Dirichlet processes.
The scope of our methodology extends beyond our original motivations. In particular, we can tackle a whole class of problems that would ordinarily be handled using estimating equations and robust variance estimation. Such problems are often called semiparametric because we are interested in estimating a finite-dimensional parameter in the presence of an infinite-dimensional nuisance parameter. Chapter 4 studies examples such as linear regression with heteroscedastic errors, and quantile regression.
2021-10-23T00:00:00ZCurtailed phase II binary outcome trials and adaptive multi-outcome trialsLaw, Martinhttps://www.repository.cam.ac.uk/handle/1810/3247652021-12-06T07:53:39Z2021-11-27T00:00:00Zdc.title: Curtailed phase II binary outcome trials and adaptive multi-outcome trials
dc.contributor.author: Law, Martin
dc.description.abstract: Phase II clinical trials are a critical aspect of the drug development process. With drug development costs ever increasing, novel designs that can improve the efficiency of phase II trials are extremely valuable.
Phase II clinical trials for cancer treatments often measure a binary outcome. The final trial decision is generally to continue or cease development. When this decision is based solely on the result of a hypothesis test, the result may be known with certainty before the planned end of the trial. Unfortunately though, there is often no opportunity for early stopping when this occurs.
Some existing designs do permit early stopping in this case, accordingly reducing the required sample size and potentially speeding up drug development. However, more improvements can be achieved by stopping early when the final trial decision is very likely, rather than certain, known as stochastic curtailment. While some authors have proposed approaches of this form, these approaches have limitations, such as relying on simulation, considering relatively few possible designs and not permitting early stopping when a treatment is promising.
In this thesis we address these limitations by proposing design approaches for single-arm and two-arm phase II binary outcome trials. We use exact distributions, avoiding simulation, consider a wider range of possible designs and permit early stopping for promising treatments. As a result, we are able to obtain trial designs that have considerably reduced sample sizes on average.
Following this, we switch attention to consider the fact that clinical trials often measure multiple outcomes of interest. Existing multi-outcome designs focus almost entirely on evaluating whether all outcomes show evidence of efficacy or whether at least one outcome shows evidence of efficacy. While a small number of authors have provided multi-outcome designs that evaluate when a general number of outcomes show promise, these designs have been single-stage in nature only. We therefore propose two designs, of group-sequential and drop the loser form, that provide this design characteristic in a multi-stage setting. Previous such multi-outcome multi-stage designs have allowed only for a maximum of two outcomes; our designs thus also extend previous related proposals by permitting any number of outcomes.
2021-11-27T00:00:00Z