And the hardware improves performance and energy efficiency

© 2022
Mohammadreza Soltaniyeh
ALL RIGHTS RESERVED

HARDWARE-SOFTWARE TECHNIQUES FOR ACCELERATING SPARSE COMPUTATION
By
MOHAMMADREZA SOLTANIYEH
A dissertation submitted to the
School of Graduate Studies
Rutgers, The State University of New Jersey
In partial fulfillment of the requirements
For the degree of
Doctor of Philosophy
Graduate Program in Computer Science
Written under the direction of
Santosh Nagarakatte
And approved by

Dissertation Director: Prof. Santosh Nagarakatte

Linear algebra kernels are widely used in various fields such as machine learning, data science, physical science, and graph analysis. Many of these applications work with sparse data (i.e., only a small fraction of data is non-zero). Sparse data are often stored in a com-pressed format (ie, sparse format) that stores only the non-zero elements with additional metadata to identify where the non-zero elements are located. Using compressed formats eliminates the need to store and process zeros, making the storage and computation of sparse kernels more efficient.

iii

sparse formats. Our results show that REAP achieves both high frequency and promising speedup compared to state-of-the-art FPGA accelerators for SpMV and SpGEMM while supporting various sparse formats and precision configurations. Second, we present a hard-ware accelerator for sparse CNN inference tasks. We formulate the convolution operation as general matrix-matrix multiplication (GEMM) using an image to column (IM2COL) transformation. With a dynamically reconfigurable GEMM and a novel IM2COL hardware unit, our design can support various layers in CNNs with high performance. Besides, our design exploits sparsity in both weights and feature maps. We use the software to perform group-wise pruning followed by a preprocessing step that puts the pruned weights into our hardware-friendly sparse format for efficient and high performance computation. We eval-uated our accelerator using a cycle-level simulator and an HLS implementation realized on an Alveo FPGA board. Our ASIC design is on average 2.16×, 1.74×, and 1.63× faster than Gemmini, Eyeriss, and Sparse-PE, which are prior hardware accelerators for dense and sparse CNNs, respectively. Besides, our hardware accelerator is also 78×, and 12×more energy-efficient when compared to CPU and GPU implementations, respectively.

I would also like to thank the other members of my Ph.D. committee, Prof. Yipeng Huang and Prof. Joe Devietti, for their thoughtful feedback about my research that has helped shape this dissertation.

Additionally, I would like to thank the Memory Solution team at Samsung Semiconduc-tor, where I did two internships during my Ph.D. I particularly want to thank my mentors and manager, Veronica Lagrange Moutinho dos Reis, Matt Bryson, and Xuebin Yao, from whom I learned a lot.

Last but not least, I would like to thank all my family for their unconditional love and support throughout the years. I would also like to thank Mahdi, Alireza, Kathryn, Shahab, Mahyar, Vahid, Shiva, and many other friends with whom I had the opportunity to have fun times.

Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v

List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii


1.1	Dissertation Statement			. . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.1.1		Synergistic CPU-FPGA Acceleration for SpMV and SpGEMM . . .
1.1.2

1.1.3		A End-to-end FPGA Prototype of SPOTS for Sparse CNNs . . . . .
1.2	Papers Related to this Dissertation . . . . . . . . . . . . . . . . . . . . . .
1.3	Dissertation Organization . . . . . . . . . . . . . . . . . . . . . . . . . . .

2.1
bra Kernels			. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.1.1		Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

viii

2.2	Background on Sparse Formats and Sparse Linear Algebra Kernels . . . . .
2.2.1		A Background on Sparse Formats				. . . . . . . . . . . . . . . . . .
2.2.2		A Background on SpMV and SpGEMM kernels . . . . . . . . . . .
2.3	Synergistic CPU-FPGA Acceleration . . . . . . . . . . . . . . . . . . . . .
2.3.1		Instantiation of REAP to Accelerate SpMV . . . . . . . . . . . . .
2.3.2		Instantiation of REAP to Accelerate SPGEMM . . . . . . . . . . .
2.4	Experimental Methodology of REAP . . . . . . . . . . . . . . . . . . . . .
2.5	Experimental Evaluation of REAP for SpMV and SpGEMM . . . . . . . .
2.6	Related Work on Accelerators for SpMV and SpGEMM Kernels . . . . . .
2.7	Summary		. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3.1	Overview of SPOTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.1.1		Novelties of SPOTS . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2	Background on CNNs, IM2COL, and Sparsity-Awareness in CNNs . . . . .
3.2.1		Convolution Neural Networks			. . . . . . . . . . . . . . . . . . . .
3.2.2		Transforming Convolution to General Matrix-Matrix Multiplication
3.2.3		Sparsity-Awareness in CNNs . . . . . . . . . . . . . . . . . . . . .
3.3	Motivation for a GEMM-based Formulation of Convolution and Sparsity-
Awareness Design in SPOTS . . . . . . . . . . . . . . . . . . . . . . . . .
3.3.1
ing Elements				. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.3.2		Benefits and Challenges of Convolution with IM2COL					. . . . . . .
3.3.3		Drawbacks of Prior Sparsity-aware Designs . . . . . . . . . . . . .

3.4.2 The GEMM Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

3.4.3 Handling Sparsity in CNNs . . . . . . . . . . . . . . . . . . . . . . 73

Chapter 4: An FPGA Accelerator for Sparse Convolutional Neural Networks . . 85

4.1 Background on High Level Synthesis for FPGAs . . . . . . . . . . . . . . 86

4.2.2 The GEMM Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

4.2.3 Sparsity-aware Design . . . . . . . . . . . . . . . . . . . . . . . . 99

Sparse Convolutional Neural Networks . . . . . . . . . . . . . . . . . 107

5.1 Experimental Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . 107

5.2
	5.2.1

	5.2.2	Comparing the Energy Efficiency of SPOTS ASIC and FPGA Pro-

totypes with CPUs and GPUs . . . . . . . . . . . . . . . . . . . . . 114


	Performance Sensitivity to Different Layers’ Shapes
	Performance and Energy Characterization of IM2COL Unit . . . . . 121 Load Imbalance in SPOTS . . . . . . . . . . . . . . . . . . . . . . 122

Chapter 6: Conclusion and Future Directions . . . . . . . . . . . . . . . . . . . . 123


	6.2.1
	6.2.2
	6.2.3	Exploring New Programming Languages for Sparse Kernels on FP-
	GAs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126

2.1	The CPU, and FPGA configurations. . . . . . . . . . . . . . . . . . . . . .	39

FPGA resource utilization for SpMV and SpGEMM designs. . . . . . . . .		39
List of matrices from SparseSuite [27] used in our evaluation. . . . . . . . .		41

for three sparse formats and different input precisions. . . . . . . . . . . . .		42

erate sparse linear algebra kernels. . . . . . . . . . . . . . . . . . . . . . .		51
		82
SPOTS ASIC design parameters and area. . . . . . . . . . . . . . . . . . . 108 The CPU, GPU and FPGA configurations for SPOTS evaluation. . . . . . . 108
The FPGA configurations for SPOTS evaluation.	. . . . . . . . . . . . . . 109

	Network characteristics, the top1, and top5 result accuracy, and the overall sparsity for the original (with no pruning), random pruning, and our struc-tured pruning using the imagenet dataset. . . . . . . . . . . . . . . . . . . . 110 Comparing the performance and efficiency of SPOTS ASIC design with

2.1		19
2.1	CSR, CSC, DIA, ELL, and RLC. . . . . . . . . . . . . . . . . . . . . . . .	19

2.2	An illustration of four different formulations to perform general matrix-			21
2.2	matrix multiplication. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .			21
2.3	The floorplan of XCU200 FPGA. . . . . . . . . . . . . . . . . . . . . . . .			25
2.4	An illustration of the overlapping of the tasks performed by the CPU and			27
2.4	FPG tasks in REAP. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .			27
2.5	An illustration of RIR bundle generation for SpMV. . . . . . . . . . . . . .			30
2.6	An illustration of the FPGA design for SpMV. . . . . . . . . . . . . . . . .			31
2.7	An illustration of RIR bundle generation for SpGEMM. . . . . . . . . . . .			34
2.8	Overall architecture of the FPGA dsign for SpGEMM.	. . . . . . . . . . .		34
2.9	An illustration of the multiply unit in action for SpGEMM design.		. . . . .	35

2.12 Comparing REAP’s speedup with the recent FPGA accelerators for SpMV and SpGEMM. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

2.13 REAP execution breakdown. . . . . . . . . . . . . . . . . . . . . . . . . . 47

An illustration of a convolution layer along with its inputs.		58
An illustration of an FC layer and its matrix form. . . . . . . . . . . . . . .		59
An illustration of transforming a convolution layer to a GEMM computation. 60
A comparison of various pruning methods for CNNs. . . . . . . . . . . . .		61

VGGNet, and GoogleNet for a CPU system. . . . . . . . . . . . . . . . . .		63

systolic array-based GEMM unit. . . . . . . . . . . . . . . . . . . . . . . .		67
An illustration of patch generation using the PUs in the IM2COL unit.	. . .	68
An illustration of the GEMM unit in our accelerator. . . . . . . . . . . . . .		71

the-art sparse formats. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .		75

our sparsity-aware architecture. . . . . . . . . . . . . . . . . . . . . . . . . 76

3.11 An illustration of different configurations of the systolic array GEMM unit. 77

4.5 A high-level overview of IM2COL and patch units. . . . . . . . . . . . . . 92

4.6 An illustration of the IM2COL unit in our FPGA design for sparse CNNs. . 93

a variety of filters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

5.1 An illustration showing the sparsity ratio in the filters and input feature

	Evaluating the ASIC prototype of SPOTS in performance compared to

Sparse-PE, Eyeriss, and Gemmini for four CNNs: AlexNet, VGGNet, ResNet,

5.6 The load imbalance percentage in the pruned weights for AlexNet, VG-

GNet, ResNet, and GoogleNet. . . . . . . . . . . . . . . . . . . . . . . . . 120

First, a sparse format can reduce the total storage by orders of magnitude compared to a dense representation. In addition, operations on zeros can be skipped to reduce the overall computation.

Sparse formats store only the non-zero elements and use various techniques to encode the positions of the non-zero elements within the matrix [15, 26, 72, 91, 104, 131]. In most standard sparse formats, determining the position of the non-zero elements requires a series of indirect memory accesses that introduce irregular memory accesses. For example, to access an element A[i, j] in the Compressed Sparse Row (CSR) [104] format, one needs to consult the row pointer to obtain the location where the row i begins (i.e., row pointer[i]), search the column indices until the beginning of the next row to check if the item is present (i.e., search from col[row pointer[i]] to col[row pointer[i + 1]), and then subsequently access the data element at the matched index. In addition, skipping the operation involving zeros requires extra work to determine if the non-zero elements match before performing the operation. The overhead of accessing the indices into the sparse structure, not including the matching cost, can exceed the work of the mathematical operations by 2×-5× [59, 61].

GPUs use a Single Instruction Multiple Data (SIMD) model to perform more useful work per instruction. Additionally, they rely on caching techniques to maximize memory perfor-mance by utilizing temporal and spatial data localities. SIMD and caching techniques are effective for most dense computations, but they are not as effective for sparse problems. Using sparse formats introduces many indirect and irregular memory accesses, which lim-its the benefits of caching and the SIMD model. As a result, using the same formulations and optimizations as the dense version for the sparse version results in poor performance.

For many years, Moore’s Law [92] and Dennard scaling [32] helped general-purpose processors become faster and more energy efficient transparently. Dennard scaling is ob-solete, and Moore’s law has slowed down and is expected to end in the coming years. In response, there has been an increased interest in designing specialized hardware, such as ASICs and FPGAs, for various applications, including sparse problems. Specialization can be applied at different levels for sparse computation. At the algorithmic level, new algorithms and methods can be developed to perform more optimally for sparse scenar-ios [15, 46, 137]. For sparse data storage, the sparse formats can be customized based on the operations’ memory access pattern and sparsity pattern of the inputs [38, 61, 97, 120]. Lastly, custom hardware can be designed to perform operations such as extracting the non-zero elements or matching the non-zeros more efficiently [7, 42, 52, 61]. These customiza-tions aim to improve sparse computation’s performance and energy efficiency by extracting parallelism at a finer level and optimizing the memory hierarchy to accommodate irregu-lar memory accesses arising from sparse problems. Sparse Matrix-Vector Multiplication (SpMV), Sparse General Matrix-Matrix Multiplication (SpGEMM), and sparse neural net-works are among the most studied sparse problems for these specialized hardware.

This dissertation presents a number of hardware-software techniques to improve the per-formance and energy efficiency of sparse linear algebra kernels, including SpMV and SpGEMM and sparse convolutional neural networks. Our general strategy is to use soft-ware methods to reformat the sparse data into a hardware-friendly format that allows the hardware to perform the computation with a high degree of parallelism. The software im-proves design flexibility to support multiple sparse formats, and the hardware improves performance and energy efficiency. We develop an intermediate representation that allows the software to communicate regularized data and scheduling decisions to the hardware. Besides, most of the software and hardware execution are overlapped. We applied these hardware-software techniques to three sparse problems: sparse matrix-vector multiplica-tion, sparse general matrix-vector multiplication, and sparse convolutional neural network. Different characteristics of these three applications raise different questions, and we have answered some of them in the following contributions:

1. A CPU-FPGA system to improve the performance of SpMV and SpGEMM ker-nels in comparison to CPU-only and FPGA-only designs, while supporting multiple sparse formats and data precisions.