Archived
This content is available here for research, reference, and/or recordkeeping.
Author ORCID Identifier
https://orcid.org/0009-0000-0891-7567
Date Available
4-27-2026
Year of Publication
2026
Document Type
Doctoral Dissertation
Degree Name
Doctor of Philosophy (PhD)
College
Engineering
Department/School/Program
Computer Science
Faculty
V.K. Cody Bumgardner
Faculty
Brent Seales
Faculty
Simone Silvestri
Abstract
Self-supervised learning (SSL) has emerged as a principled approach to visual representation learning that derives supervisory signal directly from unlabeled data, enabling foundation models to be trained at scale without manual annotation. Deployments in medical imaging and biometric recognition have demonstrated the potential of this paradigm, yet the assumptions that make SSL effective on natural image benchmarks fail systematically in specialized domains. Generic SSL pipelines encode a tacit assumption that the most informative correspondence is spatial proximity within a single acquisition. In specialized domains this assumption breaks at the level of the data-generating process: the signal that carries domain-specific information is cross-scale, cross-session, or spatially sparse, and a within-image training objective cannot access it.
Recovering this signal requires intervening at the level of view construction and correspondence policy rather than data scale or fine-tuning strategy. Yet the infrastructure needed to make such interventions reproducibly has not kept pace with algorithmic developments, and the gap between research-grade outputs and deployable systems remains unaddressed.
This dissertation presents a layered set of contributions that address these barriers. DINO-MX is introduced as a unified, configuration-driven training framework for SSL with Vision Transformers that standardizes distributed execution, view construction, and artifact management. Three domain-specific protocols are developed on top of this substrate: a label-guided view construction protocol instantiated as CARD-ViT for coronary artery calcium detection; a magnification-aware distillation protocol instantiated as MAD-NP for neuropathology whole-slide image analysis; and a temporal-aware distillation protocol constituting the first self-supervised pretraining framework for biometric representation learning. Vision Foundry integrates these contributions within a compliance-aware platform accessible to domain practitioners without distributed computing expertise.
Digital Object Identifier (DOI)
https://doi.org/10.13023/etd.2026.128
Archival?
Archival
Recommended Citation
Gokmen, Mahmut S., "A Specification-Driven Framework for Self-Supervised Learning in Specialized Vision Domains" (2026). Theses and Dissertations--Computer Science. 159.
https://uknowledge.uky.edu/cs_etds/159
Included in
Artificial Intelligence and Robotics Commons, Biomedical Informatics Commons, Computer Engineering Commons, Theory and Algorithms Commons
