Author ORCID Identifier

Year of Publication


Degree Name

Doctor of Philosophy (PhD)

Document Type

Doctoral Dissertation




Electrical and Computer Engineering

First Advisor

Dr. Sen-ching Samson Cheung

Second Advisor

Dr. Nathan Jacobs


Machine learning-based approaches have been achieving state-of-the-art results on many computer vision tasks. While deep learning and convolutional networks have been incredibly popular, these approaches come at the expense of huge amounts of labeled data required for training. Manually annotating large amounts of data, often millions of images in a single dataset, is costly and time consuming. To deal with the problem of data annotation, the research community has been exploring approaches that require less amount of labelled data.

The central problem that we consider in this research is image synthesis without any manual labeling. Image synthesis is a classic computer vision task that requires understanding of image contents and their semantic and geometric properties. We propose that we can train image synthesis models by relying on sequences of videos and using weakly supervised learning. Large amounts of unlabeled data are freely available on the internet. We propose to set up the training in a multi-image setting so that we can use one of the images as the target - this allows us to rely only on images for training and removes the need for manual annotations. We demonstrate three main contributions in this work.

First, we present a method of fusing multiple noisy overhead images to make a single, artifact-free image. We present a weakly supervised method that relies on crowd-sourced labels from online maps and a completely unsupervised variant that only requires a series of satellite images as inputs. Second, we propose a single-image novel view synthesis method for complex, outdoor scenes. We propose a learning-based method that uses pairs of nearby images captured on urban roads and their respective GPS coordinates as supervision. We show that a model trained with this automatically captured data can render a new view of a scene that can be as far as 10 meters from the input image. Third, we consider the problem of synthesizing new images of a scene under different conditions, such as time of day and season, based on a single input image. As opposed to existing methods, we do not need manual annotations for transient attributes, such as fog or snow, for training. We train our model by using streams of images captured from outdoor webcams and time-lapse videos.

Through these applications, we show several settings where we can train state-of-the-art deep learning methods without manual annotations. This work focuses on three image synthesis tasks. We propose weakly supervised learning and remove requirements for manual annotations by relying on sequences of images. Our approach is in line with the research efforts that aim to minimize the labels required for training machine learning methods.

Digital Object Identifier (DOI)