PPMI Analyses

Recent effort on understanding Parkinson’s Disease (PD) has led to finding of number of factors that affect PD progression. However, to completely understand the aetiology of this complex multi-factorial disorder, it is important to understand how clinical (demographic, cognitive tests and functional test), molecular (omics), genomic (SNPs/copy number variants) and neuroimaging (brain volumes) factors affect each other over time. With this in perspective, a longitudinal Bayesian model of PD progression was built using data from Parkinson’s Progression Markers Initiative (PPMI) to assess dependency between multi-variate factors over course of disease. This analysis also demonstrates the use of Bayesian network to describe the entire course of a multi-variate clinical trial while accounting for patient drop out.

This study uses data of 362 de novo PD patients from PPMI. These untreated subjects had PD for two years or less. It was required that the patients recruited for the study had shown signs of resting tremor, bradykinesia, and rigidity in last two years. PPMI data comprises of eight cohorts and encompasses 831 clinical variables. The clinical variables can be categorized under six groups and further into 52 sub-groups. The six groups are as follows:

  • Patient: e.g Demographic, medical history and socio-economic data.

  • Imaging: e.g. DatScan and MRI scan

  • Medical history: e.g Physical exam, Clinical diagnosis, Diagnosis features, PD medication, Vital signs.

  • Non-motor: e.g. Benton judgment of line oriention test (BJLOT), Epworth sleepiness scale (ESS).

  • Biological: e.g DNA, RNA, Serum, Plasma, Cerebrospinal fluid.

1038 SNP risk factors were retrieved from SCAIView, the 58 SNP that mapped to de novo PD subjects were also included for analysis.

Baseline clinical variables with less than 50% missing data were selected and value of these baseline variables were also extracted at months 3,6,9,12,18, 24, 30, 36, 42, 48 and 54. The longitudinal data thus obtained had considerable missing observations due to patient drop out. To systematically mitigate the (i) loss of data due to patient drop out over course of study and (ii) potential model bias resulting out of correlation of drop-out with other variables, auxiliary variables were used.

In the next step, this mixed type data was imputed using missForest method, a non-parametric method for multiple imputation. The method predicted the missing values of each variable based on values of other variable. The auxiliary variables ensured the parameter estimation to be conditionally dependent on pattern of missingness in data.

The data is relatively high dimensional. To increase the chance of identifying the correct (causal) network during later BN structure learning we significantly reduced the dimension of the data as a first step. This was brought about by reducing dimensionality of data and adding constraints on possible edges. The dimension was reduced with the use of autoencoders to provided one condensed representation for multiple input variables belonging to one feature group. Therefore, one meta feature per group was obtained at every clinical visit.

Before learning topology of BN, conditional independence of variable at visit t from variables at visits t + 1 was encoded using blacklist and whitelist arguments such that edges from visit t can point to a visit in t + 1 and not otherwise. Prior information on possible edges among feature group was also fed into the learning algorithm using blacklist and whitelist arguments (Figure 1). For example, edges from UPDRS score to biological score was blacklisted.


Figure 1: Edges allowed in Bayesian Network (BN). The graph illustrates allowed dependencies between six groups of features (Biological, Non-motor, Imaging, UPDRS, Patient, Medical History ). The BN was hence restricted to pick edges only out of the set of depicted dependencies

As a first step to choose a network learning algorithm, constraint-based structure learning algorithms (Max-Min Parents and Children, Hilton Parents and Children), score-based structure learning algorithms (hill-climbing, tabu search) and hybrid structure learning algorithms (restricted maximisation, Max-min hill climbing) were compared based on average negative log-likelihood score obtained in 10-fold cross- validation [101]. The algorithm which gave least average negative log-likelihood loss was used for learning the graphical structure of BN. The progression model built using group score is a restrictive but representative model of evolution of the disease in early-stage PD patients (Figure 2 ).


Figure 2: Subgraph of network is illustrated in the figure. The nodes represent meta feature and edges represent conditional dependencies between nodes. Each node is defined by combination of weighted variables, where in the influence of a variable on the node is given by relative weight of variable on the node.

BN being a generative model, allows sampling of virtual PD patients. The idea of PD virtual patients proposed in this work holds promise to facilitate better data privacy as well as increase the statistical power of trials by potentially increasing sample size of cohort. Given that virtual patient data is not actual data, yet is an abstraction of the same, the framework can be used for data privacy and distribution.

The generative and predictive nature of model, made it possible to use the model for prediction, in- silico testing of the treatment effects on PD patients and what-if type of analyses. The models was used for virtual clinical trial and simulation of drug intervention without making any assumptions about the mechanism of drug action.