Year of Publication
Doctor of Philosophy (PhD)
Epidemiology and Biostatistics
Dr. Chi Wang
Mass spectrometry (MS) is widely used for proteomic and metabolomic profiling of biological samples. Data obtained by MS are often zero-inflated. Those zero values are called point mass values (PMVs). Zero values can be further grouped into biological PMVs and technical PMVs. The former type is caused by the absence of components and the latter type is caused by detection limit. There is no simple solution to separate those two types of PMVs. Mixture models were developed to separate the two types of zeros apart and to perform the differential abundance analysis. However, we notice that the mixture model can be unstable when the number of non-zero values is small.
In this dissertation, we propose a new differential abundance (DA) analysis method, DASEV, which applies an empirical Bayes shrinkage estimation on variance. We hypothesized that performance on variance estimation could be more robust and thus enhance the accuracy of differential abundance analysis. Disregarding the issue the mixture models have, the method has shown promising strategies to separate two types of PMVs. We adapted the mixture distribution proposed in the original mixture model design and assumed that the variances for all components follow a certain distribution. We proposed to calculate the estimated variances by borrowing information from other components via applying the assumed distribution of variance, and then re-estimate other parameters using the estimated variances. We obtained better and more stable estimations on variance, means abundances, and proportions of biological PMVs, especially where the proportion of zeros is large. Therefore, the proposed method achieved obvious improvements in DA analysis.
We also propose to extend the method for clustering analysis. To our knowledge, commonly used cluster methods for MS omics data are only K-means and Hierarchical. Both methods have their own limitations while being applied to the zero-inflated data. Model-based clustering methods are widely used by researchers for various data types including zero-inflated data. We propose to use the extension (DASEV.C) as a model-based cluster method. We compared the clustering performance of DASEV.C with K-means and Hierarchical. Under certain scenarios, the proposed method returned more accurate clusters than the standard methods.
We also develop an R package dasev for the proposed methods presented in this dissertation. The major functions DASEV.DA and DASEV.C in this R package aim to implement the Bayes shrinkage estimation on variance then conduct the differential abundance and cluster analysis. We designed the functions to allow the flexibility for researchers to specify certain input options.
Digital Object Identifier (DOI)
The project described was supported by the National Cancer Institute through Grant R03CA211835.
Huang, Zhengyan, "Differential Abundance and Clustering Analysis with Empirical Bayes Shrinkage Estimation of Variance (DASEV) for Proteomics and Metabolomics Data" (2019). Theses and Dissertations--Epidemiology and Biostatistics. 24.