PPMI Analyses
Recent effort on understanding Parkinson’s Disease (PD) has led to finding of number of factors that affect PD progression. However, to completely understand the aetiology of this complex multifactorial disorder, it is important to understand how clinical (demographic, cognitive tests and functional test), molecular (omics), genomic (SNPs/copy number variants) and neuroimaging (brain volumes) factors affect each other over time. With this in perspective, a longitudinal Bayesian model of PD progression was built using data from Parkinson’s Progression Markers Initiative (PPMI) to assess dependency between multivariate factors over course of disease. This analysis also demonstrates the use of Bayesian network to describe the entire course of a multivariate clinical trial while accounting for patient drop out.
This study uses data of 362 de novo PD patients from PPMI. These untreated subjects had PD for two years or less. It was required that the patients recruited for the study had shown signs of resting tremor, bradykinesia, and rigidity in last two years. PPMI data comprises of eight cohorts and encompasses 831 clinical variables. The clinical variables can be categorized under six groups and further into 52 subgroups. The six groups are as follows:

Patient: e.g Demographic, medical history and socioeconomic data.

Imaging: e.g. DatScan and MRI scan

Medical history: e.g Physical exam, Clinical diagnosis, Diagnosis features, PD medication, Vital signs.

Nonmotor: e.g. Benton judgment of line oriention test (BJLOT), Epworth sleepiness scale (ESS).

Biological: e.g DNA, RNA, Serum, Plasma, Cerebrospinal fluid.
1038 SNP risk factors were retrieved from SCAIView, the 58 SNP that mapped to de novo PD subjects were also included for analysis.
Baseline clinical variables with less than 50% missing data were selected and value of these baseline variables were also extracted at months 3,6,9,12,18, 24, 30, 36, 42, 48 and 54. The longitudinal data thus obtained had considerable missing observations due to patient drop out. To systematically mitigate the (i) loss of data due to patient drop out over course of study and (ii) potential model bias resulting out of correlation of dropout with other variables, auxiliary variables were used.
In the next step, this mixed type data was imputed using missForest method, a nonparametric method for multiple imputation. The method predicted the missing values of each variable based on values of other variable. The auxiliary variables ensured the parameter estimation to be conditionally dependent on pattern of missingness in data.
The data is relatively high dimensional. To increase the chance of identifying the correct (causal) network during later BN structure learning we significantly reduced the dimension of the data as a first step. This was brought about by reducing dimensionality of data and adding constraints on possible edges. The dimension was reduced with the use of autoencoders to provided one condensed representation for multiple input variables belonging to one feature group. Therefore, one meta feature per group was obtained at every clinical visit.
Before learning topology of BN, conditional independence of variable at visit t from variables at visits t + 1 was encoded using blacklist and whitelist arguments such that edges from visit t can point to a visit in t + 1 and not otherwise. Prior information on possible edges among feature group was also fed into the learning algorithm using blacklist and whitelist arguments (Figure 1). For example, edges from UPDRS score to biological score was blacklisted.
As a first step to choose a network learning algorithm, constraintbased structure learning algorithms (MaxMin Parents and Children, Hilton Parents and Children), scorebased structure learning algorithms (hillclimbing, tabu search) and hybrid structure learning algorithms (restricted maximisation, Maxmin hill climbing) were compared based on average negative loglikelihood score obtained in 10fold cross validation [101]. The algorithm which gave least average negative loglikelihood loss was used for learning the graphical structure of BN. The progression model built using group score is a restrictive but representative model of evolution of the disease in earlystage PD patients (Figure 2 ).
BN being a generative model, allows sampling of virtual PD patients. The idea of PD virtual patients proposed in this work holds promise to facilitate better data privacy as well as increase the statistical power of trials by potentially increasing sample size of cohort. Given that virtual patient data is not actual data, yet is an abstraction of the same, the framework can be used for data privacy and distribution.
The generative and predictive nature of model, made it possible to use the model for prediction, in silico testing of the treatment effects on PD patients and whatif type of analyses. The models was used for virtual clinical trial and simulation of drug intervention without making any assumptions about the mechanism of drug action.