Date Available

12-20-2024

Year of Publication

2024

Document Type

Master's Thesis

Degree Name

Master of Arts in Linguistic Theory and Typology (MALTT)

College

Arts and Sciences

Department/School/Program

Linguistics

Advisor

Dr. Mark Lauersdorf

Abstract

This thesis examines the predictability and regularity of English orthography through computational methods. The primary objective is to use n-gram models to predict missing letters in English words by exploiting contextual information from adjacent letters. The study evaluates the impact of dataset size, word length, letter position, and vowel presence on the predictive accuracy of these models, uncovering patterns and structures inherent to English spelling.

The research utilizes a range of datasets, including the Carnegie Mellon University Pronouncing Dictionary, the Brown Corpus, the Corpus of Late Modern English Texts, the Lampeter Corpus of Early Modern English Tracts, and the Open EDGeS Diachronic Bible Corpus. The analysis minimizes word frequency biases by focusing on distinct word types rather than tokens, providing insights into letter patterns and frequencies independent of token repetition.

The findings demonstrate that predictability in English orthography remains stable across genres and corpus sizes, with n-gram models exhibiting consistent performance even with limited training data. Word length, letter position within words, and vowel presence influence predictive accuracy. Notably, the second and penultimate letter positions yield the highest accuracy rates, while the first and last positions yield the lowest. The study demonstrates that English spelling adheres to systematic patterns and regularities that computational models can effectively capture.

This research has practical applications in improving error correction systems and deciphering historical manuscripts. It demonstrates n-gram models' effectiveness in predicting missing letters, even with limited data, which is valuable for analyzing under-resourced languages. The study's insights into English spelling patterns contribute to cryptography and language processing technologies.

Digital Object Identifier (DOI)

https://doi.org/10.13023/etd.2024.508

Share

COinS