Deriving the Distributions and Developing Methods of Inference for R2-type Measures, with Applications to Big Data Analysis
Author ORCID Identifier
Year of Publication
Doctor of Philosophy (PhD)
Arts and Sciences
Dr. Katherine L. Thompson
As computing capabilities and cloud-enhanced data sharing has accelerated exponentially in the 21st century, our access to Big Data has revolutionized the way we see data around the world, from healthcare to investments to manufacturing to retail and supply-chain. In many areas of research, however, the cost of obtaining each data point makes more than just a few observations impossible. While machine learning and artificial intelligence (AI) are improving our ability to make predictions from datasets, we need better statistical methods to improve our ability to understand and translate models into meaningful and actionable insights.
A central goal in the world of statistics and data science is the construction of linear regression models for continuous variables of interest. Often, our objective is to examine the impact of one or more explanatory variables, after adjusting for demographic variables or some other known/relevant covariate(s). While the traditional methodology uses a combination of partial F-tests and individual t-tests to determine statistical significance, we know that p-values obtained from such methods are heavily dependent on sample size. This is particularly problematic for large datasets or "overpowered" studies, where even the tiniest of effects will appear to be highly significant, or for extremely small datasets, where real effects may not reach statistical significance. The coefficient of partial determination (also known as partial R2) is widely used in the applied sciences to supplement hypothesis testing, but little work has been done to understand its statistical properties. In this dissertation, the exact, complete distribution of partial R2 is derived, accompanied by simulation studies and real-world data examples to show the advantages of adding coefficients of determination to the analysis of quantitative data models, regardless of sample size. Additionally, two novel inference methods are proposed for both R2 and partial R2, which build on these distributional results to provide better coverage and more focused intervals for models built using small- and medium-sized datasets.
Digital Object Identifier (DOI)
Hawk, Gregory S., "Deriving the Distributions and Developing Methods of Inference for R2-type Measures, with Applications to Big Data Analysis" (2022). Theses and Dissertations--Statistics. 63.
Available for download on Wednesday, May 24, 2023