Author ORCID Identifier

https://orcid.org/0000-0001-8045-7083

Date Available

8-8-2019

Year of Publication

2019

Degree Name

Doctor of Philosophy (PhD)

Document Type

Doctoral Dissertation

College

Medicine

Department/School/Program

Molecular and Cellular Biochemistry

First Advisor

Dr. Hunter N. B. Moseley

Abstract

Ontologies provide an organization of language, in the form of a network or graph, which is amenable to computational analysis while remaining human-readable. Although they are used in a variety of disciplines, ontologies in the biomedical field, such as Gene Ontology, are of interest for their role in organizing terminology used to describe—among other concepts—the functions, locations, and processes of genes and gene-products. Due to the consistency and level of automation that ontologies provide for such annotations, methods for finding enriched biological terminology from a set of differentially identified genes in a tissue or cell sample have been developed to aid in the elucidation of disease pathology and unknown biochemical pathways. However, despite their immense utility, biomedical ontologies have significant limitations and caveats. One major issue is that gene annotation enrichment analyses often result in many redundant, individually enriched ontological terms that are highly specific and weakly justified by statistical significance. These large sets of weakly enriched terms are difficult to interpret without manually sorting into appropriate functional or descriptive categories. Also, relationships that organize the terminology within these ontologies do not contain descriptions of semantic scoping or scaling among terms. Therefore, there exists some ambiguity, which complicates the automation of categorizing terms to improve interpretability.

We emphasize that existing methods enable the danger of producing incorrect mappings to categories as a result of these ambiguities, unless simplified and incomplete versions of these ontologies are used which omit problematic relations. Such ambiguities could have a significant impact on term categorization, as we have calculated upper boundary estimates of potential false categorizations as high as 121,579 for the misinterpretation of a single scoping relation, has_part, which accounts for approximately 18% of the total possible mappings between terms in the Gene Ontology. However, the omission of problematic relationships results in a significant loss of retrievable information. In the Gene Ontology, this accounts for a 6% reduction for the omission of a single relation. However, this percentage should increase drastically when considering all relations in an ontology. To address these issues, we have developed methods which categorize individual ontology terms into broad, biologically-related concepts to improve the interpretability and statistical significance of gene-annotation enrichment studies, meanwhile addressing the lack of semantic scoping and scaling descriptions among ontological relationships so that annotation enrichment analyses can be performed across a more complete representation of the ontological graph.

We show that, when compared to similar term categorization methods, our method produces categorizations that match hand-curated ones with similar or better accuracy, while not requiring the user to compile lists of individual ontology term IDs. Furthermore, our handling of problematic relations produces a more complete representation of ontological information from a scoping perspective, and we demonstrate instances where medically-relevant terms--and by extension putative gene targets--are identified in our annotation enrichment results that would be otherwise missed when using traditional methods. Additionally, we observed a marginal, yet consistent improvement of statistical power in enrichment results when our methods were used, compared to traditional enrichment analyses that utilize ontological ancestors. Finally, using scalable and reproducible data workflow pipelines, we have applied our methods to several genomic, transcriptomic, and proteomic collaborative projects.

Digital Object Identifier (DOI)

https://doi.org/10.13023/etd.2019.362

Funding Information

Support for this research was provided by the National Science Foundation grant NSF 1419282 (Hunter N.B. Moseley).

S3.1a-v - Visualizing the degree of overlap between the category subgraphs created by GOcats, Map2Slim, and the UniProt CV.zip (706 kB)
Supplemental Figures 3.1 A-V Visualizing the degree of overlap between the category subgraphs created by GOcats, Map2Slim, and the UniProt CV

S3.2 List of GO terms mapped by Map2Slim to the term plasma membrane that were not mapped to this location by GOcats.xlsx (12 kB)
Supplemental Table 3.2 List of GO terms mapped by Map2Slim to the plasma membrane that were not mapped to this location by GOcats

S4.1 - Adjusted p-values between omitted has_part and GOcats part_of_some edges for terms enriched from breast cancer data.xlsx (23 kB)
Supplemental Table 4.1 Adjusted p-values between omitted has_part and GOcats part_of_some edges for terms enriched in breast cancer data

Share

COinS