Links between ADNI-derived BN in AD
The analysis conducted to analyse ADNI dataset has a two-fold goal; first to develop a prediction model able to detect the patient transition from normal or mild cognitively impaired stage to AD, and second to understand the dependencies between the features identified by the model that better predict AD progression.
The first step in order to reconstruct the dependencies was to assign the nodes present in the BN to mechanistic subgraphs of NeuroMMSig. Once these mappings are generated, the edges (dependencies) containing both nodes successfully assigned to NeuroMMSig subgraphs can be used for reconstructing mechanistic networks using the OpenBEL model. For that purpose, we developed an algorithmic approach that makes use of path-finding algorithms for the reconstruction.
Since the BN integrated several scales of biology such as demographic features, genetic variants or pathways coming from KEGG (Kanehisa et al,. 2008) and Reactome (Fabregat et al,. 2016), a manual mapping of each node was necessary. An example of this mapping is illustrated in Figure 7 where the SNP rs405509 is assigned to its correspond gene in NeuroMMSig and where different pathways like the TFG-beta signalling pathway from KEGG is mapped to the TFG-beta subgraph in NeuroMMSig. Particularly for the pathways, their gene sets were used to assess the statistical significant of the mapping via a hypergeometric test corrected for multiple testing using the Benjamini-Yekutieli method under dependency (more detail in D.3.3.4). Thus, providing not only a manual mapping but a robust statistical proof that there exist a common core of genes converging across the NeuroMMSig network and its corresponding canonical pathway.
While the statistical method used to assess the overlap between the pair of gene sets is already a valid rationale to assume that they share functional impact, we have gone one step further by comparing how this method performs calculating KEGG vs KEGG pathway overlaps. In other words, exploring the overlaps across KEGG pathways. As Figure 8 shows, there is a huge overlap across KEGG pathways due to a variety of reasons. For example, one could argue that pathway boundaries are subjective according to each expert. Furthermore, there exist group of genes involved in multiple pathways therefore, without meaning the fact that multiple pathways can have effect on others. Therefore, the results of the heat map is not surprising.
Analogous to statistical methods, overlap estimations can be made by calculating the presence of shared genes across each pathway pair (pair of gene sets). Figure 9 presents this approach over KEGG using different thresholds (% of shared components over the smallest pathway). As an illustration, if you would calculate the overlap of two pathways having a gene set size of 10 and 20, when both share 5 common genes, the percentage of overlap for these two pathways would be 50% --- 5 (number of shared genes) divided by 10 (size of the smallest pathway gene set). By looking at figure 9, one can estimate how the overlap of pathways varies by changing this threshold, thus identifying the pathways with the highest overlap. Not surprisingly, cancer and metabolic pathways had the highest overlap with the rest. It is noteworthy to mention that KEGG does not follow a hierarchical structure like Reactome. This is important because if we would conduct the same exercise using Reactome gene sets, the pathways with the highest hierarchy (major and minor pathways) would always overlap with their children since the minor pathway gene set is fully contained within the major pathway.
KEGG overlap calculations served us as a control for the next exercise that is applying the same approach but comparing KEGG with NeuroMMSig AD instead. The reason for this comparison is to illustrate how the machine learning model was not only able to accurately predict the time of diagnosis but it also identified the majority of the pathways related to well-known AD mechanisms (NeuroMMSig subgraphs). In other words, the model is learning from these pathways that are related to AD rather than the other pathways.
The prediction model made used of 110 KEGG pathways (future nodes in the BN) out of the total 323. Those 110 contained 39 of the 52 KEGG pathways that were successfully mapped to NeuroMMSig gene sets (well-known AD mechanisms). This accounts for over 75% of the NeuroMMSig mechanisms, when the expected number by randomly selecting 110 KEGG pathways would be approximately 30%. In future work, we plan to focus on the set of pathways related with AD in order to reduce the dimensionality.
Here, we only present the comparison with KEGG pathways and not with Reactome, because KEGG pathways sum up to the majority of map-able nodes in the BN. As we can observe in Figure 10, the number of significant overlapping pathway pairs is much smaller when comparing KEGG against NeuroMMSig than when comparing KEGG against KEGG. The rationale behind this is that NeuroMMSig gene sets are smaller in size and they are disease specific, meaning that each gene set contains genes/proteins that have been associated with AD or its mechanisms in the literature. On the other hand, KEGG or Reactome gene sets are made of all the knowledge around that particular pathway, and thus, they are inherently non-context specific. Therefore, when the BN suggests a dependency between a pair of map-able KEGG/Reactome pathways, we can use NeuroMMSig to hypothesize what this link might be due to according to the literature knowledge in the context of AD.