Trajectories for PPMI

Z transformed and binarized UPDRS score were used as the target variable, while other clinical data were used as predictor variables in supervised machine learning algorithms to assess the performance of xgboost, random forest and elastic net for binary classification. The performance was evaluated by taking the average of test AUC value obtained during each testing phase of cross-validation. The solid horizontal line in box plot illustrated in Figure 19 shows the median AUC value for each method.

Xgboost model showed the best performance with a median AUC value of 0.778 and was hence used for feature selection.

Xgboost model trained on entire data was used to extract most relevant and least redundant features. Importance of each feature was obtained using functions from the xgboost package. Out of the three component of importance i.e. gain, cover and frequency, the gain was selected to interpret the relative importance of each feature. The value quantifies information gained by producing a split in decision tree using that particular feature. 160 clinical variables showed positive gain and were selected for further used for Bayesian modelling. The top twenty features are shown below in figure 20. Table 2 shows variable description and the group to which features belong.

Table: Variable, top twenty clinical feature ranked by value of information gain. Description, explanation of variable Group, data type to which the variable belongs.

UPDR S1UPDRS I: evaluation of mentation, behavior, and moodUPDRS
NP3RIGU_CL3.3c Rigidity - UE - ContralateralUPDRS
NP3RIGL_CL3.3e Rigidity - LE - ContralateralUPDRS
QUIPQuestionnaire for Impulsive-Compulsive Disorders in PDNon-motor
ALDH1A1..rep.2.ALDH1A1 (rep 2) (Ct)Biological
HRSUPSupine heart rateMedical history
Abeta.42Abeta 42 (pg/ml)Biological
SYSSUPSupine BP - systolicMedical history
RBD.posRBD Positive: RBD >= 5Non-motor
STAI.TraitSTAI - Trait SubscoreNon-motor
UPDRS2UPDRS II: self-evaluation of the activities of daily life (ADLs)UPDRS
HSPA8..rep.1HSPA8 (rep 1) (Ct)Biological
NP3FTAPL3.4b Finger Tapping Left HandUPDRS
tTau.Abetat-tau/Abeta 1-42Biological
NP3TTAP_IL3.7 Toe tapping - foot - IpsilateralUPDRS
GAPDH..rep.2.GAPDH (rep 2) (Ct)Biological
DIASTNDStanding BP - diastolicMedical history
UPSITUniversity of Pennsylvania Smell ID Test (UPSIT)Non-motor

Figure 1: Boxplot of AUC values obtained during repeated cross-validation for extreme gradient boosting(xgboost) , random forest and elastic net learning algorithms. Higher the AUC, better is model performance.

Figure 2: Relative importance of top twenty clinical variables found important in prediction of clinical endpoint. Higher the gain score, better is the predictive contribution of clinical feature in stratifying patients into slow progressing PD and fast progressing PD.

Traditionally used univariate approach to model a disease has limited power to assess the disease state and progression. By combining a panel of biomarkers, we seek to provide a method which discriminates patients as slow-progressors and fast-progressors. The xgboost model uses a combination of data from different groups of bio-markers in order to dissect the heterogeneous population into two homogeneous groups.

The different modalities of data are reflected in our selected features. We get a combination of biological, imaging, non-motor and patient history features at baseline, which can be used to stratify patients and indicate the rate of disease progression.