Gene Ontology is used extensively in scientific knowledgebases and repositories to organize a wealth of biological information. However, interpreting annotations derived from differential gene lists is often difficult without manually sorting into higher-order categories. To address these issues, we present GOcats, a novel tool that organizes the Gene Ontology (GO) into subgraphs representing user-defined concepts, while ensuring that all appropriate relations are congruent with respect to scoping semantics. We tested GOcats performance using subcellular location categories to mine annotations from GO-utilizing knowledgebases and evaluated their accuracy against immunohistochemistry datasets in the Human Protein Atlas (HPA). In comparison to term categorizations generated from UniProt’s controlled vocabulary and from GO slims via OWLTools’ Map2Slim, GOcats outperformed these methods in its ability to mimic human-categorized GO term sets. Unlike the other methods, GOcats relies only on an input of basic keywords from the user (e.g. biologist), not a manually compiled or static set of top-level GO terms. Additionally, by identifying and properly defining relations with respect to semantic scope, GOcats can utilize the traditionally problematic relation, has_part, without encountering erroneous term mapping. We applied GOcats in the comparison of HPA-sourced knowledgebase annotations to experimentally-derived annotations provided by HPA directly. During the comparison, GOcats improved correspondence between the annotation sources by adjusting semantic granularity. GOcats enables the creation of custom, GO slim-like filters to map fine-grained gene annotations from gene annotation files to general subcellular compartments without needing to hand-select a set of GO terms for categorization. Moreover, GOcats can customize the level of semantic specificity for annotation categories. Furthermore, GOcats enables a safe and more comprehensive semantic scoping utilization of go-core, allowing for a more complete utilization of information available in GO. Together, these improvements can impact a variety of GO knowledgebase data mining use-cases as well as knowledgebase curation and quality control.
Digital Object Identifier (DOI)
This work was supported in part by grants NSF 1419282 (Moseley), NIH 1U24DK097215-01A1 (Higashi, Fan, Lane, Moseley), and NIH UL1TR001998-01 (Kern).
GOcats is an open-source Python software package under a BSD-3 License, available on GitHub at https://github.com/MoseleyBioinformaticsLab/GOcats and on the Python Package Index (PyPI) at https://pypi.python.org/pypi/GOcats. Documentation can be found at http://gocats.readthedocs.io/en/latest/. The exact version of GOcats used in this study, along with all scripts used to generate results can be found in the Figshare repository at https://doi.org/10.6084/m9.figshare.7064516 and at https://doi.org/10.6084/m9.figshare.7064549. The version of GO used to generate these results is go-core (go.obo) data-version: releases/2016-01-12. The UniProt Controlled Vocabulary file can be found at https://www.uniprot.org/docs/subcell.txt. Associated GO terms are indicated in by the GO identifier in each stanza. Map2slim is available on GitHub (https://github.com/owlcollab/owltools/wiki/Map2Slim) and requires OWL Tools, also available via GitHub (https://github.com/owlcollab/owltools/wiki/Install-OWLTools#building-from-source). Subcellular location data was obtained from version 15 of the Human Protein Atlas and can be downloaded at http://v15.proteinatlas.org/download/subcellular_location.csv.zip.
Hinderer, Eugene Waverly III and Moseley, Hunter N. B., "GOcats: A Tool for Categorizing Gene Ontology into Subgraphs of User-Defined Concepts" (2020). Molecular and Cellular Biochemistry Faculty Publications. 173.
S1 Data. Visualizing the degree of overlap between the category subgraphs created by GOcats, Map2Slim, and the UniProt CV (additional categories). https://doi.org/10.1371/journal.pone.0233311.s001
pone.0233311.s002.docx (23 kB)
S2 Data. List of GO terms mapped by Map2Slim to the term plasma membrane that were not mapped to this location by GOcats. https://doi.org/10.1371/journal.pone.0233311.s002
pone.0233311.s003.docx (12 kB)
S1 File. https://doi.org/10.1371/journal.pone.0233311.s003