Date Available
12-20-2023
Year of Publication
2023
Degree Name
Master of Computer Engineering (MCompE)
Document Type
Master's Thesis
College
Engineering
Department/School/Program
Electrical and Computer Engineering
First Advisor
Dr. Ishan G Thakkar
Abstract
As deep neural network (DNN) models increase significantly in complexity and size, it has become important to increase the computing capability of specialized hardware architectures typically used for DNN processing. The major linear operations of DNNs, which comprise the fully connected and convolution layers, are commonly converted into general matrix-matrix multiplication (GEMM) operations for acceleration. Specialized GEMM accelerators are typically employed to implement these GEMM operations, where a GEMM operation is decomposed into multiple vector-dot-product operations that run in parallel. A common challenge that arises in modern DNNs is the mismatch between the matrices used for GEMM operations and the hardware size of the GEMM accelerator. In case the matrices are smaller than the hardware size, some hardware resources go idle but still consume static power. This diminishes the energy efficiency. On the other hand, in case the matrices are larger than the hardware size, the many vector-dot-product operations involved in a GEMM operation cannot be fully mapped onto the hardware structure. As a result, the vector-dotproduct operations need to be folded over time into multiple temporal frames. Each temporal frame generates a partial sum (psum) of the final output value of the corresponding dot-product operation. Consequently, to produce the final output matrix, these psums need to be stored in memory and redistributed back into the accelerator to be accumulated using a network of accumulators called a reduction network (RN). To efficiently accelerate modern DNNs with heterogeneous matrix sizes, customized spatial GEMM accelerators have been introduced in prior work. These accelerators employ flexible RNs to implement spatial and temporal reduction of psums of heterogeneous sizes. They create unique mappings of matrices depending on their sizes to compute multiple vector-dot-products in parallel while minimizing the number of computing resources remaining idle.
Despite their advantages, these flexible RNs from prior work are still limited due to their electronic design. A flexible RN typically comprises a network of accumulators that work together to collect and reduce psums. Every electronic accumulator has a limited fan-in, and therefore, a large number of accumulators need to be connected together. This increases the number of hardware components and network links required to achieve the desired reduction of psums, leading to a reduction in performance and energy efficiency. Nevertheless, to address this shortcoming, photonic devices and interconnects have been demonstrated. In this thesis, I present an innovative use of photonic devices and interconnects from the state-of-the-art to build a novel photonic RN architecture. Our photonic RN architecture substantially reduces the required counts of photonic accumulators and links to achieve the spatial and temporal reduction of psums of heterogeneous sizes with massive parallelism. We evaluate our photonic RN and compare it against the state-of-the-art electronic RN architectures from prior work for four modern DNN workloads. The evaluation results show a latency speed-up of up to 5.63× and energy efficiency improvement of up to 1.97× on average across the considered DNN workloads.
Digital Object Identifier (DOI)
https://doi.org/10.13023/etd.2023/478
Funding Information
National Science Foundation (Computer and Network Systems no: 2139167) in 2023.
Recommended Citation
Bose, Bobby, "A Flexible Photonic Reduction Network Architecture for Spatial GEMM Accelerators for Deep Learning" (2023). Theses and Dissertations--Electrical and Computer Engineering. 195.
https://uknowledge.uky.edu/ece_etds/195
Included in
Computer and Systems Architecture Commons, Digital Communications and Networking Commons