Year of Publication


Degree Name

Doctor of Philosophy (PhD)

Document Type

Doctoral Dissertation




Computer Science

First Advisor

Dr. Jinze Liu

Second Advisor

Dr. Jin Chen


Machine learning algorithms are becoming the most effective methods for knowledge discovery from high dimensional datasets. Machine learning seeks to construct predictive models through the analysis of large-scale heterogeneous data. While machine learning has been widely used in many domains including computer vision, natural language processing, product recommendation, its application in biomedical science for clinical diagnosis and treatment is only emerging. However, the wealthy amount of data in the biomedical domain offers not only challenges but also opportunities for machine learning. In this dissertation, we focus on three biomedical applications from vastly different domains to understand the opportunities and challenges faced by machine learning algorithms and the lessons we’ve learned.

Next generation sequencing has revolutionized the field of biomedical applications in the last decades. It has significantly advanced our understanding of how the individual difference in our genomic fingerprints affects the clinical phenotypes, especially gaining importance in diagnostic assessment in cancer. We propose a computational pipeline for phenotype prediction with “genomic words”. In this work, we directly extract words in “k-mer” format from raw sequencing omics data as features for prediction. Our algorithm sidesteps conventional gene expression profiles which are often time-consuming and may be incomplete. We evaluate the performance of k-mer level features for predicting cell types and tumor subtypes. Our experiments demonstrate that the k-mer level features lend themselves to better performance than traditional gene expression features in classification accuracy in the majority of cases.

Glycomics, as another type of omics data, serves as building blocks for complex carbohydrate molecules and provides dynamic structural diversity to proteins and lipids that are responsive to cellular phenotype. A Matrix-Assisted Laser Desorption/Ionization (MALDI) imaging mass spectrometry workflow for profiling glycome distributions in tissues has recently been developed. The complexity of MALDI imaging data hinders biomedical scientists from having a general understanding of the molecular patterns in tissues. We propose a framework incorporating unsupervised and supervised machine learning algorithms for tissue histology segmentation and prediction, biomarker detection, and phenotype classification. We also creatively take cores in MALDI imaging tissue micro-arrays (TMAs) as general images to employ classical image processing techniques as well. This pilot study naturally combines pixel spatial information and molecular profiles together for phenotype classification.

Ontology is used to integrate and organize a common set of terms to represent the basic concepts in a domain and the relationships between them. Along with the boom of omics studies, biomedical ontologies are becoming increasingly popular to integrate a vast number of clinical findings from functional research. With a growing number of domain-specific ontologies, curated ontologies may suffer from consistency problems, even contain some percentage of errors. We propose a set of conceptual relation prediction methods based on neural networks to audit ontology quality. In this work, concepts are presented with concept embeddings learned directly from the existing ontology. The disagreements between the predictions of concept pair relationships and the existing relationships may be potential ontology errors or may point to areas for improvements of classification algorithms. The experiments on SNOMED CT demonstrate the efficacy of our methods in conceptual relation prediction and ontology auditing.

Our extensive work in three research directions gives us a profound understanding of their correlation and contribution to the biomedical domain. In the future, we would like to integrate them in a machine learning framework to aggregate clinical findings from multi-omics studies into domain-specific biomedical ontologies.

Digital Object Identifier (DOI)

Available for download on Saturday, January 20, 2024