Benanza: Automatic Benchmark Generation to Characterize “Lower-Bound” Latency of ML Models and Inform Optimizations on GPUs

NOTE: This site has been prepared as supplementary material to demonstrate the usage of the tool described in the IPDPS 2020 submission.

The past few years have seen a surge in efforts to benchmark Deep Learning (DL) models. These benchmarks are used to characterize representative models and are used as bases to propose software or hardware stack optimizations. Current efforts from benchmarking to optimization, however, are largely manual and lack highly desired abilities to determine the gap between the current achieved performance and the ideal performance, identify potential inefficiency of model execution, and quantify the benefits of applying possible optimizations. This slow characterization/optimization cycle has been further strained by the fast pace by which DL models are introduced. Quickly being able to generate benchmarks, characterize, and pinpoint potential optimizations is highly desired.

We propose Benanza, a sustainable and extensible design to speed up the characterization/optimization cycle of DL models on GPUs. Two components form the basis of Benanza: a configurable benchmark generator to automatically generate micro-benchmarks given a set of models; an analyzer to compute the “lower-bound” latency of DL models using the benchmarking data, and to inform optimizations for model execution. The “lower-bound” latency metric estimates the ideal model execution on a GPU system and serves as the baseline to indicate optimization opportunities in frameworks or system libraries. We use Benanza to evaluate the “lower-bound” latency of $30$ ONNX models and compare it against MXNet, ONNX Runtime, and PyTorch on $7$ GPUs from Kepler to the latest Turing. We further use the analyzer to identify optimization opportunities in layer execution (up to $3.19 \times$ with a geometric mean of $28.2 %$ on Tesla V100), cuDNN algorithm selection ( $8 - 30 %$ across GPUs), framework (pinpointed inefficiencies in MXNet and PyTorch), and quantify the benefits of performing layer fusion and using Tensor Cores (up to $8.5 %$ and $35 %$ improvement for ResNet50-v1 respectively).