Mapping Data- and Knowledge-derived Information
As we have shown in the previous algorithms, the mapping of clinical data to knowledge-derived graphs is not a straight forward step. There are multiple limitations such as the fact that molecular features might not always be measured in the cohorts or the different non-standard terminologies used in the cohort to name the variables measured (columns) that are mapped to the knowledge graphs. Even when these limitations do not exist, the problem of data dimensionality comes usually into play.
In order to tackle the complex dimensionality of ADNI, a predictive model was created using a set of clinical baseline variables, genomic variables, pathways; the last two were extracted using dimensionality reduction techniques, in this case PCA. This methodology is described in detail in D.3.3.4. The set of features that the predictive model used for classification were taken to generate a Bayesian network (BN). This machine learning technique represent dependencies among variables in the form of a directed acyclic graph. Thus, a BN is in nature a graph containing nodes representing the variables and edges representing the conditional dependencies.
Even if the graphs in NeuroMMSig were derived from knowledge and the BN were derived from data (in this case ADNI), nodes that stands for the same variable can be found in both graphs. For example, clinical readouts, SNPs or pathways. This allows for graph-graph matching, as a form of validating with the literature the relationships that machine learning established from the data.
However, since multiple levels of abstraction are part of the graphs as nodes; for instance, there might be nodes that represent the presence or absence of an allele (APOE4) or representing a cognitive score or a given pathway, it is hard to bring those together and map them to the molecular level where the literature describes the relationships as mechanistic and not correlations. Otherwise, we might imply causation when there is nothing but a correlation between a biomarker and a readout. Thus, other types of mapping across different level scales need to be implemented.
Due to the lack of nodes directly mapping from one graph to another (BN to NeuroMMSig or viceversa), we developed an algorithm that makes use of shortest paths between different genesets that correspond to a pathway in a reference Database (REACTOME or KEGG) and which output is a mechanistic network containing not only the players and the relationships among them but new candidate players that might explain the link observed in the BN between two pathways (see figure below).
One of the AETIONOMY approaches used Bayesian networks to identify the dependencies between variables that better predict the conversion from normal/MCI to AD. However, one of the challenges after identifying an interesting list of variables, and in this case also the dependencies is to interpret their biological meaning. Therefore, in order to compare the findings coming from this machine learning approach, we have used NeuroMMSig as a knowledge graph to support with the literature the links between mappeable variables in the Bayesian network and the knowledge in this resource.
While some variables like SNPs can be mapped directly to the NeuroMMSig AD nodes, other nodes such as pathways or phenotypic read-outs have to be manually mapped. Therefore, we have manually mapped the nodes in the Bayesian network to either nodes in NeuroMMSig (e.g., SNP to SNP) or complete mechanisms in NeuroMMSig that are equivalent KEGG/Reactome pathways (e.g., “Notch signaling pathway” to the NeuroMMSig “Notch signaling in Alzheimer’s disease”) in order to investigate the dependencies in the Bayesian network. This exercise validated some links such as the link between Adherens Junction and Autophagy and Insulin Signaling and Natural Killer (NK) Cell Mediated Cytotoxicity. Furthermore, we have developed a tool which allows uses to navigate the Bayesian network and explore the corresponding knowledge-derived links in NeuroMMSig (https://neurommsig.scai.fraunhofer.de/bayesian_explorer).
The first step in the algorithm is to manually map the nodes in the BN and NeuroMMSig genesets (genes that are annotated specifically in the context of Alzheimer’s to one of the candidate subgraphs). Once we have the mapping, statistics about the overlap between the canonical pathways in REACTOME or KEGG and their corresponding NeuroMMSig mapping can be inferred by making use of hyper-geometric tests corrected for multiple testing using the Benjamini-Yekutieli method under dependency.