Date Available
12-20-2024
Year of Publication
2024
Document Type
Master's Thesis
Degree Name
Master of Arts in Linguistic Theory and Typology (MALTT)
College
Arts and Sciences
Department/School/Program
Linguistics
Advisor
Dr. Mark Lauersdorf
Abstract
This thesis examines the predictability and regularity of English orthography through computational methods. The primary objective is to use n-gram models to predict missing letters in English words by exploiting contextual information from adjacent letters. The study evaluates the impact of dataset size, word length, letter position, and vowel presence on the predictive accuracy of these models, uncovering patterns and structures inherent to English spelling.
The research utilizes a range of datasets, including the Carnegie Mellon University Pronouncing Dictionary, the Brown Corpus, the Corpus of Late Modern English Texts, the Lampeter Corpus of Early Modern English Tracts, and the Open EDGeS Diachronic Bible Corpus. The analysis minimizes word frequency biases by focusing on distinct word types rather than tokens, providing insights into letter patterns and frequencies independent of token repetition.
The findings demonstrate that predictability in English orthography remains stable across genres and corpus sizes, with n-gram models exhibiting consistent performance even with limited training data. Word length, letter position within words, and vowel presence influence predictive accuracy. Notably, the second and penultimate letter positions yield the highest accuracy rates, while the first and last positions yield the lowest. The study demonstrates that English spelling adheres to systematic patterns and regularities that computational models can effectively capture.
This research has practical applications in improving error correction systems and deciphering historical manuscripts. It demonstrates n-gram models' effectiveness in predicting missing letters, even with limited data, which is valuable for analyzing under-resourced languages. The study's insights into English spelling patterns contribute to cryptography and language processing technologies.
Digital Object Identifier (DOI)
https://doi.org/10.13023/etd.2024.508
Recommended Citation
Winstead, John, "A Computational Investigation of English Spelling" (2024). Theses and Dissertations--Linguistics. 64.
https://uknowledge.uky.edu/ltt_etds/64
Included in
Computational Linguistics Commons, Discourse and Text Linguistics Commons, Other Linguistics Commons