Date Available

12-20-2023

Year of Publication

2023

Degree Name

Master of Computer Engineering (MCompE)

Document Type

Master's Thesis

College

Engineering

Department/School/Program

Electrical and Computer Engineering

First Advisor

Dr. Ishan G Thakkar

Abstract

As deep neural network (DNN) models increase significantly in complexity and size, it has become important to increase the computing capability of specialized hardware architectures typically used for DNN processing. The major linear operations of DNNs, which comprise the fully connected and convolution layers, are commonly converted into general matrix-matrix multiplication (GEMM) operations for acceleration. Specialized GEMM accelerators are typically employed to implement these GEMM operations, where a GEMM operation is decomposed into multiple vector-dot-product operations that run in parallel. A common challenge that arises in modern DNNs is the mismatch between the matrices used for GEMM operations and the hardware size of the GEMM accelerator. In case the matrices are smaller than the hardware size, some hardware resources go idle but still consume static power. This diminishes the energy efficiency. On the other hand, in case the matrices are larger than the hardware size, the many vector-dot-product operations involved in a GEMM operation cannot be fully mapped onto the hardware structure. As a result, the vector-dotproduct operations need to be folded over time into multiple temporal frames. Each temporal frame generates a partial sum (psum) of the final output value of the corresponding dot-product operation. Consequently, to produce the final output matrix, these psums need to be stored in memory and redistributed back into the accelerator to be accumulated using a network of accumulators called a reduction network (RN). To efficiently accelerate modern DNNs with heterogeneous matrix sizes, customized spatial GEMM accelerators have been introduced in prior work. These accelerators employ flexible RNs to implement spatial and temporal reduction of psums of heterogeneous sizes. They create unique mappings of matrices depending on their sizes to compute multiple vector-dot-products in parallel while minimizing the number of computing resources remaining idle.

Despite their advantages, these flexible RNs from prior work are still limited due to their electronic design. A flexible RN typically comprises a network of accumulators that work together to collect and reduce psums. Every electronic accumulator has a limited fan-in, and therefore, a large number of accumulators need to be connected together. This increases the number of hardware components and network links required to achieve the desired reduction of psums, leading to a reduction in performance and energy efficiency. Nevertheless, to address this shortcoming, photonic devices and interconnects have been demonstrated. In this thesis, I present an innovative use of photonic devices and interconnects from the state-of-the-art to build a novel photonic RN architecture. Our photonic RN architecture substantially reduces the required counts of photonic accumulators and links to achieve the spatial and temporal reduction of psums of heterogeneous sizes with massive parallelism. We evaluate our photonic RN and compare it against the state-of-the-art electronic RN architectures from prior work for four modern DNN workloads. The evaluation results show a latency speed-up of up to 5.63× and energy efficiency improvement of up to 1.97× on average across the considered DNN workloads.

Digital Object Identifier (DOI)

https://doi.org/10.13023/etd.2023/478

Funding Information

National Science Foundation (Computer and Network Systems no: 2139167) in 2023.

Share

COinS