Benanza: Automatic Benchmark Generation to Characterize “Lower-Bound” Latency of ML Models and Inform Optimizations on GPUs
The past few years have seen a surge in efforts to benchmark Deep Learning (DL) models. These benchmarks are used to characterize representative models and are used as bases to propose software or hardware stack optimizations. Current efforts from benchmarking to optimization, however, are largely manual and lack highly desired abilities to determine the gap between the current achieved performance and the ideal performance, identify potential inefficiency of model execution, and quantify the benefits of applying possible optimizations. This slow characterization/optimization cycle has been further strained by the fast pace by which DL models are introduced. Quickly being able to generate benchmarks, characterize, and pinpoint potential optimizations is highly desired.
We propose Benanza, a sustainable and extensible design to speed up the
characterization/optimization cycle of DL models on GPUs. Two components
form the basis of Benanza: a configurable benchmark generator to
automatically generate micro-benchmarks given a set of models; an
analyzer to compute the “lower-bound” latency of DL models using the
benchmarking data, and to inform optimizations for model execution. The
“lower-bound” latency metric estimates the ideal model execution on a
GPU system and serves as the baseline to indicate optimization
opportunities in frameworks or system libraries. We use Benanza to
evaluate the “lower-bound” latency of $30$ ONNX models and compare it
against MXNet, ONNX Runtime, and PyTorch on $7$ GPUs from Kepler to
the latest Turing. We further use the analyzer to identify optimization
opportunities in layer execution (up to $3.19\times$ with a geometric
mean of $28.2\%$ on Tesla V100), cuDNN algorithm selection
($8–30\%$ across GPUs), framework (pinpointed inefficiencies in MXNet
and PyTorch), and quantify the benefits of performing layer fusion and
using Tensor Cores (up to $8.5\%$ and $35\%$ improvement for
ResNet50-v1
respectively).