Author ORCID Identifier
Year of Publication
Doctor of Philosophy (PhD)
Dr. Miroslaw Truszczynski
Dr. Licong Cui
An ontology provides a formalized representation of knowledge within a domain. In biomedicine, ontologies have been widely used in modern biomedical applications to enable semantic interoperability and facilitate data exchange. Given the important roles that biomedical ontologies play, quality issues such as incompleteness, if not addressed, can affect the quality of downstream ontology-driven applications. However, biomedical ontologies often have large sizes and complex structures. Thus, it is infeasible to uncover potential quality issues through manual effort. In this dissertation, we introduce automated and scalable approaches for auditing the completeness of biomedical ontologies. We mainly focus on two incompleteness issues -- missing hierarchical relations and missing concepts. To identify missing hierarchical relations, we develop three approaches: a lexical-based approach, a hybrid approach utilizing both lexical features and logical definitions, and an approach based on concept name transformation. To identify missing concepts, a lexical-based Formal Concept Analysis (FCA) method is proposed for concept enrichment. We also predict proper concept names for the missing concepts using deep learning techniques. Manual review by domain experts is performed to evaluate these approaches. In addition, we leverage extrinsic knowledge (i.e., external ontologies) to help validate the detected incompleteness issues. The auditing approaches have been applied to a variety of biomedical ontologies, including the SNOMED CT, National Cancer Institute (NCI) Thesaurus and Gene Ontology.
In the first lexical-based approach to identify missing hierarchical relations, each concept is modeled with an enriched set of lexical features, leveraging words and noun phrases in the name of the concept itself and the concept's ancestors. Given a pair of concepts that are not linked by a hierarchical relation, if the enriched lexical attributes of one concept is a superset of the other's, a potentially missing hierarchical relation will be suggested. Applying this approach to the September 2017 release of SNOMED CT (US edition) suggested 38,615 potentially missing hierarchical relations. A domain expert reviewed a random sample of 100 potentially missing ones, and confirmed 90 are valid (a precision of 90%).
In the second work, a hybrid approach is proposed to detect missing hierarchical relations in non-lattice subgraphs. For each concept, its lexical features are harmonized with role definitions to provide a more comprehensive semantic model. Then a two-step subsumption testing is performed to automatically suggest potentially missing hierarchical relations. This approach identified 55 potentially missing hierarchical relations in the 19.08d version of the NCI Thesaurus. 29 out of 55 were confirmed as valid by the curators from the NCI Enterprise Vocabulary Service (EVS) and have been incorporated in the newer versions of the NCI Thesaurus. 7 out of 55 further revealed incorrect existing hierarchical relations in the NCI Thesaurus.
In the third work, we introduce a transformation-based method that leverages the Unified Medical Language System (UMLS) knowledge to identify missing hierarchical relations in its source ontologies. Given a concept name, noun chunks within it are identified and replaced by their more general counterparts to generate new concept names that are supposed to be more general than the original one. Applying this method to the UMLS (2019AB release), a total of 39,359 potentially missing hierarchical relations were detected in 13 source ontologies. Domain experts evaluated a random sample of 200 potentially missing hierarchical relations identified in the SNOMED CT (US edition), and 100 in the Gene Ontology. 173 out of 200 and 63 out of 100 potentially missing hierarchical relations were confirmed by domain experts, indicating our method achieved a precision of 86.5% and 63% for the SNOMED CT and Gene Ontology, respectively.
In the work of concept enrichment, we introduce a lexical method based on FCA to identify potentially missing concepts. Lexical features (i.e., words appearing in the concept names) are considered as FCA attributes while generating formal context. Applying multistage intersection on FCA attributes results in newly formalized concepts along with bags of words that can be utilized to name the concepts. This method was applied to the Disease or Disorder sub-hierarchy in the 19.08d version of the NCI Thesaurus and identified 8,983 potentially missing concepts. We performed a preliminary evaluation and validated that 592 out of 8,983 potentially missing concepts were included in external ontologies in the UMLS.
After obtaining new concepts and their relevant bags of words, we further developed deep learning-based approaches to automatically predict concept names that comply with the naming convention of a specific ontology. We explored simple neural network, Long Short-Term Memory (LSTM), and Convolutional Neural Network (CNN) combined with LSTM. Our experiments showed that the LSTM-based approach achieved the best performance with an F1 score of 63.41% for predicting names for newly added concepts in the March 2018 release of SNOMED CT (US Edition) and an F1 score of 73.95% for naming missing concepts revealed by our previous work.
In the last part of this dissertation, extrinsic knowledge is leveraged to collect supporting evidence for the detected incompleteness issues. We present a work in which cross-ontology evaluation based on extrinsic knowledge from the UMLS is utilized to help validate potentially missing hierarchical relations, aiming at relieving the heavy workload of manual review.
Digital Object Identifier (DOI)
Zheng, Fengbo, "Scalable Approaches for Auditing the Completeness of Biomedical Ontologies" (2021). Theses and Dissertations--Computer Science. 105.