Archived

This content is available here for research, reference, and/or recordkeeping.

Author ORCID Identifier

https://orcid.org/0009-0000-0891-7567

Date Available

4-27-2026

Year of Publication

2026

Document Type

Doctoral Dissertation

Degree Name

Doctor of Philosophy (PhD)

College

Engineering

Department/School/Program

Computer Science

Faculty

V.K. Cody Bumgardner

Faculty

Brent Seales

Faculty

Simone Silvestri

Abstract

Self-supervised learning (SSL) has emerged as a principled approach to visual representation learning that derives supervisory signal directly from unlabeled data, enabling foundation models to be trained at scale without manual annotation. Deployments in medical imaging and biometric recognition have demonstrated the potential of this paradigm, yet the assumptions that make SSL effective on natural image benchmarks fail systematically in specialized domains. Generic SSL pipelines encode a tacit assumption that the most informative correspondence is spatial proximity within a single acquisition. In specialized domains this assumption breaks at the level of the data-generating process: the signal that carries domain-specific information is cross-scale, cross-session, or spatially sparse, and a within-image training objective cannot access it.

Recovering this signal requires intervening at the level of view construction and correspondence policy rather than data scale or fine-tuning strategy. Yet the infrastructure needed to make such interventions reproducibly has not kept pace with algorithmic developments, and the gap between research-grade outputs and deployable systems remains unaddressed.

This dissertation presents a layered set of contributions that address these barriers. DINO-MX is introduced as a unified, configuration-driven training framework for SSL with Vision Transformers that standardizes distributed execution, view construction, and artifact management. Three domain-specific protocols are developed on top of this substrate: a label-guided view construction protocol instantiated as CARD-ViT for coronary artery calcium detection; a magnification-aware distillation protocol instantiated as MAD-NP for neuropathology whole-slide image analysis; and a temporal-aware distillation protocol constituting the first self-supervised pretraining framework for biometric representation learning. Vision Foundry integrates these contributions within a compliance-aware platform accessible to domain practitioners without distributed computing expertise.

Digital Object Identifier (DOI)

https://doi.org/10.13023/etd.2026.128

Archival?

Archival

Share

COinS