Author ORCID Identifier

Year of Publication


Degree Name

Doctor of Philosophy (PhD)

Document Type

Doctoral Dissertation




Computer Science

First Advisor

Dr. Guo-Qiang Zhang

Second Advisor

Dr. Jin Chen


Clinical data have been continuously collected and growing with the wide adoption of electronic health records (EHR). Clinical data have provided the foundation to facilitate state-of-art researches such as artificial intelligence in medicine. At the same time, it has become a challenge to integrate, access, and explore study-level patient data from large volumes of data from heterogeneous databases. Effective, fine-grained, cross-cohort data exploration, and semantically enabled approaches and systems are needed. To build semantically enabled systems, we need to leverage existing terminology systems and ontologies. Numerous ontologies have been developed recently and they play an important role in semantically enabled applications. Because they contain valuable codified knowledge, the management of these ontologies, as metadata, also requires systematic approaches. Moreover, in most clinical settings, patient data are collected with the help of a data dictionary. Knowledge of the relationships between an ontology and a related data dictionary is important for semantic interoperability. Such relationships are represented and maintained by mappings. Mappings store how data source elements and domain ontology concepts are linked, as well as how domain ontology concepts are linked between different ontologies. While mappings are crucial to the maintenance of relationships between an ontology and a related data dictionary, they are commonly captured by CSV files with limits capabilities for sharing, tracking, and visualization. The management of mappings requires an innovative, interactive, and collaborative approach.

Metadata management servers to organize data that describes other data. In computer science and information science, ontology is the metadata consisting of the representation, naming, and definition of the hierarchies, properties, and relations between concepts. A structural, scalable, and computer understandable way for metadata management is critical to developing systems with the fine-grained data exploration capabilities.

This dissertation presents a systematic approach called MetaSphere using metadata and ontologies to support the management and integration of clinical research data through our ontology-based metadata management system for multiple domains. MetaSphere is a general framework that aims to manage specific domain metadata, provide fine-grained data exploration interface, and store patient data in data warehouses. Moreover, MetaSphere provides a dedicated mapping interface called Interactive Mapping Interface (IMI) to map the data dictionary to well-recognized and standardized ontologies. MetaSphere has been applied to three domains successfully, sleep domain (X-search), pressure ulcer injuries and deep tissue pressure (SCIPUDSphere), and cancer. Specifically, MetaSphere stores domain ontology structurally in databases. Patient data in the corresponding domains are also stored in databases as data warehouses. MetaSphere provides a powerful query interface to enable interaction between human and actual patient data. Query interface is a mechanism allowing researchers to compose complex queries to pinpoint specific cohort over a large amount of patient data.

The MetaSphere framework has been instantiated into three domains successfully and the detailed results are as below. X-search is publicly available at with nine sleep domain datasets consisting of over 26,000 unique subjects. The canonical data dictionary contains over 900 common data elements across the datasets. X-search has received over 1800 cross-cohort queries by users from 16 countries. SCIPUDSphere has integrated a total number of 268,562 records containing 282 ICD9 codes related to pressure ulcer injuries among 36,626 individuals with spinal cord injuries. IMI is publicly available at Using IMI, we have successfully mapped the North American Association of Central Cancer Registries (NAACCR) data dictionary to the National Cancer Institute Thesaurus (NCIt) concepts.

Digital Object Identifier (DOI)

Available for download on Wednesday, November 11, 2020