Author ORCID Identifier

https://orcid.org/0009-0000-0740-3986

Date Available

12-11-2025

Year of Publication

2025

Document Type

Doctoral Dissertation

Degree Name

Doctor of Philosophy (PhD)

College

Engineering

Department/School/Program

Computer Science

Faculty

Jerzy W. Jaromczyk

Faculty

Christopher L. Schardl

Faculty

Simone Silvestri

Abstract

This dissertation concerns a new application of RNA-seq data—computation of pairwise genetic distance matrices. RNA-seq captures sequences of RNA molecules in some cells or tissues of interest. RNA-seq provides data are well-suited to studies examining gene expression, and its use for this purpose is currently widespread. A pairwise genetic distance matrix, the main topic of this dissertation, quantifies differences in the genomes of every pair of samples (e.g., individuals) in a given set. Genetic distance matrices are versatile; they can be used for various kinds of downstream analyses, including genotyping, phylogenetics, and genetic diversity measurement. Although DNA sequence data are typically used for computing such matrices, reusing RNA-seq data for computing genetic distances could be an economical option compared with obtaining DNA sequence data for the same purpose. Although this application of RNA-seq appears promising, there are challenges and potential pitfalls in repurposing RNA-seq data for genetic distance computation, and few algorithms exist to handle them. This dissertation discusses these challenges and a solution in the form of a novel algorithm and software implementation, “RNA-clique.” The dissertation details thorough testing of the method, including comparison to an existing method, and investigates possible improvements, extensions, and variations on the algorithm and its implementation. The dissertation then discusses an interesting problem that arose in adding a new feature and that may eventually led to the improvement of the algorithm’s filtering step. The dissertation also describes aspects of the design and redesign aimed at ensuring reproducibility and usability of the software. lthough RNA molecules are produced from DNA templates, RNA-seq data provide a biased view of the genome because different parts of the genome are expressed at different levels, and how parts of the genome are expressed can vary dramatically among different tissues. A method for obtaining genetic distances from RNA-seq data must take steps to avoid reflecting expression differences rather than genetic differences. Additionally, care must be taken to avoid comparing transcripts with their paralogs or homeologs since such comparisons can lead to overestimation of distances. This dissertation describes the filtering devised to address these potential problems and evaluates and compares RNA-clique with an existing method for both real and simulated datasets. This dissertation also evaluates the effect of the graph-based step of the filtering process on accuracy of results for the test datasets. Further, the dissertation discusses the problem of ensuring orthologs found by RNA-clique can be placed in consistent orientations, which originally arose in adding a new feature to RNA-clique. The problem is formulated in terms of assignments of orientations to vertices of a graph containing information about the relative orientations of orthologous transcripts. Properties of the graph are discussed, and algorithms for making optimal assignments are presented for both a simple idealized version of the problem and a more difficult but more realistic case. Finally, aspects of the design and development of the software are discussed, especially a flexible configuration system that promotes reproducibility and facilitates creation of new interfaces to the software while minimizing the need for additional code to support these interfaces.

Digital Object Identifier (DOI)

https://doi.org/10.13023/etd.2025.626

Funding Information

This work was supported by United States National Science Foundation Division of Environment Biology (DEB) grant 2030225, United States Department of Agriculture National Institute of Food and Agriculture Multi-state project 7003566, and the Harry E. Wheeler endowment to the University of Kentucky.

Share

COinS