We employed a Gradient Boosting Machine (GBM - (Friedman 2002)) to develop a non-parametric time-to-event model with many predictor variables. GBMs are well suited for heterogeneous data with different distributional and numerical characteristics, such as in our patient data. Moreover, GBMs are typically sparse and select only a subset of variables for model building, i.e. have an embedded feature selection. Missing values are implicitly treated as a separate „category“ of each variable. We performed multi-modal data integration by performing GBM based feature selection for clinical data, knowledge based SNPs, pathway impact scores and population structure. All selected features were joined and a final GBM model built. To evaluate our GBM model we performed 10 times repeated 10-fold cross-validation.
We also recorded the prediction error (Brier score) on hold out test data as a function of time (bottom right), showing a clear advantage of the GBM model compared to a Kaplan-Meier estimator (which is a non-parametric model without predictor variables). It was shown that the GBM model correctly predicts yields highly separated event curves for patients that were AD diagnosed within 12 months vs those who were healthy at end of study.
After assessing the prediction performance of the GBM model, we trained it based on all available data and determined selected features. For the next step (Bayesian Network learning) missing values were imputed via missForest. For Bayesian Network structure learning six different algorithms were employed and compared via cross-validation. Structure learning was constrained by biological knowledge: For example, all genomic features (SNPs, pathways, principal components) could only have other genomic features as input, and the time-to-diagnosis was forced to be the endpoint, i.e. could not have any outgoing edges. The best algorithm (tabu search) was then applied in a non-parametric bootstrap procedure to identify statistically stable edges. These were then compared against the literature.
Bayesian networks describe (conditional) probabilistic dependencies between variables. They constitute generative models of complex multivariate statistical distributions. Bayesian Networks can be used to model and (partially) learn causal relationships from multi-modal data. Structure learning of Bayesian Networks is a computationally hard task due to the huge search space and requires specific algorithms. Prior knowledge (e.g. that SNPs cannot be influenced by clinical variables) thus plays a vital role to constrain the space of possible network structures. Bayesian Networks are a highly flexible framework and can be used for longitudinal and predictive modeling as well as for simulations of “what, if” scenarios.
Altogether our data matrix now comprised 70 clinical baseline features (neuropsychological assessments, , neuroimaging features, …), ~300 literature derived SNPs, ~300 pathway scores and 32 top principal components. These data were available for ~900 patients with normal or MCI diagnosis at baseline, who were unrelated (kinship and inbreeding coefficient < 0.1). The clinical endpoint was the (right censored) time to first AD diagnosis in the trial.
Statistically stable edges in the Bayesian Network are likely representing causal relationships, provided there are no confounding external / unmeasured factors. Statistical stability is assessed by the frequency of an edge appearing in 1000 network structures learned from repetitive bootstrap samples drawn from the original data. The Bayesian Network captures interactions between brain regions, neuropsychological assessments, pathway impact scores, SNPs and the clinical endpoint.
As a summary to the whole presentation, we have taken multilevel clinical data of around 900 normal and MCI patients which include:
- Neuropsychological assessment scores,
- CSF Biomarkers,
- Genomic features,
- Neuro-imaging features.
Based on these data we have developed a model predicting the AD risk for each individual patient. Our preliminary results indicate a high prediction performance with a concordance index of ~85%. The model can be used to stratify patients based on their disease risk. Dependencies between relevant features were estimated via a Bayesian Network approach. Ongoing work focuses on investigating the overlap with the existing literature derived OpenBEL AD cause-effect relationships. The rationale behind is to identify mechanisms that are commonly found in data as well as the literature.