Author ORCID Identifier

https://orcid.org/0000-0002-0281-3940

Date Available

8-5-2026

Year of Publication

2024

Degree Name

Doctor of Philosophy (PhD)

Document Type

Doctoral Dissertation

College

Arts and Sciences

Department/School/Program

Statistics

First Advisor

Chi Wang

Abstract

Single-cell RNA sequencing (scRNA-seq) has transformed our understanding of cellular heterogeneity and gene expression dynamics. Despite its potential, the inherent noise and sparsity of scRNA-seq data pose significant challenges in clustering cells into biologically meaningful groups. This dissertation addresses these challenges through two novel methodologies aimed at enhancing the accuracy and robustness of scRNA-seq data analysis.

First, we introduce the Differential Feature Selection (DIFS) framework, designed to improve the identification of differential features in scRNA-seq data. DIFS employs a two-stage marker identification process. In the first stage, a modified Dip Test is used to filter and identify genes with significant multimodal expression patterns, marking them as potential Stage I markers. The second stage focuses on identifying cluster-specific features or Stage II markers using clustering methods and Fisher's exact test. This combined approach provides a robust method for selecting highly informative genes, improving cell type classification and the understanding of cellular heterogeneity.

Second, we present a semi-supervised clustering approach that integrates SingleR, a supervised classification tool, with a modified Hierarchical Dirichlet Process (normHDP) model. SingleR provides preliminary cell type annotations based on a reference dataset, generating a similarity score matrix for each cell. This matrix serves as prior information in the Bayesian framework of the normHDP model, which is adapted to focus on clustering by retaining essential features like batch effect removal. This integration allows for dynamic adjustment of cluster assignments, accounting for both prior knowledge and unique patterns in new scRNA-seq data. To address the challenges of high-dimensional MCMC in clustering, we adopt an ensemble approach by running multiple chains with fewer iterations, improving posterior exploration and clustering accuracy.

We evaluate both methodologies using synthetic and real datasets, demonstrating their superior performance compared to traditional clustering techniques. Our findings highlight the effectiveness of these approaches in reducing the impact of data sparsity and noise, leading to more reliable identification of cell types and states. These methodologies offer powerful tools for the detailed analysis of cellular heterogeneity in complex biological systems, advancing our understanding of underlying biological processes and disease mechanisms.

Digital Object Identifier (DOI)

https://doi.org/10.13023/etd.2024.367

Available for download on Wednesday, August 05, 2026

Share

COinS