Data curation, normalization and imputation

Due to the increase in research on diseases, the data being generated grow both in volume and variety. The data are definitely valuable but mostly they are unstructured or partially structured. This requires cleaning and structuring, which is both expensive and time consuming. It adds to the cost at every step of data processing, from analysis to decision making. Data collected from disparate sources require harmonization for them to provide a single “view”, otherwise datasets remain separate pieces and not provide the whole picture.

Harmonization process requires a few transformation steps. First, is the identification of the relevant sources of data and conversion to a relevant, machine-readable format. The next step is data cleaning and the quality check, including translation of the language, spelling errors, value errors, etc. This will make the data uniform and unambiguous. The next step involves transforming the data into a format accepted by the subsequent steps or data model, imputing missing data points, for instance by deriving them from existing values. Finally, to standardise the data vocabulary, ontologies are applied both to the variables and values, and values are converted to standard units.

The mapping file generated to load the datasets into tranSMART also generate the i2b2 tree structure for the study. Therefore, the category for each variable and the naming of the features are assigned at this stage. The mapping files are hence responsible for structuring the different studies in the AETIONOMY knowledge base. Each feature collected across studies should eventually be assigned to the same category and leaf node for every study loaded. Features specific to a single study, however will have a new leaf node, nevertheless the structure of the tree (in terms of category and branching) can be harmonised to the extent possible.

Figure 1: Data curation and harmonisation

Figure 2: Structuring datasets

Figure 3: ETL process for harmonisation and integration of datasets in tranSMART