tranSMART - clinical data repository


Figure 1: Integrated storage

In the example, the data collected were in different formats over different files and languages, etc. Though valuable, they are disparate and provide no structure. These have to be transformed for further analysis or even process them together as a single dataset. After data curation and harmonisation we load them into tranSMART, a translational medicine platform enabling data integration. This then gives a structure to the dataset, allowing it to be explored and shared easily.


Figure 2: Structured data from semi and unstructured data in heterogeneous formats

Integration of heterogeneous datasets require extraction, transformation, and loading (ETL) processes to harmonise the representation. Data can be added to the tranSMART database by mapping the variables to a data-scheme via standard templates or mapping files. The mapping files for the curated data files are then generated. The mapping files are generated to follow the tranSMART standard files for the ETL scripts. Additional data can be associated on the subject level data and linked via these mapping files. For instance, for datasets which include expression data, additional files for the platforms used for the experiment have to be generated. These platform mapping files enable the mapping of probe ids from the platform to its corresponding GeneID and Gene Symbol.

The ETL process to load data in tranSMART ensures that all integrated data makes use of unique identifiers and provides a uniform structure. In addition to the benefits of integrating heterogeneous data it also enables easy sharing of data in the future. This structured and standardised structure fosters data exchange in the scientific community, which is also a pre-requisite for many translational medicine projects and multi-subject expert teams.

Linking heterogeneous data and hypothesis generation
Using platforms like tranSMART helps to integrate disparate datasets in a common platform. This allows to explore the datasets together and to analyze them for support of research hypotheses. tranSMART serves as a collaboration platform by integrating data from heterogeneous sources. It enables code free data exploration and interactive visual analytics, and thus brings together researchers from different areas of expertise (biologists or clinicians and bioinformaticians or statisticians). The data can also be easily exported for further in depth analysis.


Figure 3: Summary statistics for a study


Figure 4: Correlation of cognitive functions with cytokine levels in AD subjects


Figure 5: TNF-Alpha in MCI subjects was reported to be negatively correlated with cognitive scores


Figure 6: Correlation of cytokine levels with cognitive tests in AD


Figure 7: Data visualisation using PD map from differentially expressed genes in expression data