Presentation

· Presenters · Organizations · Search Program

Research Poster

: (RP21) Optimizing Deep Learning LSTM Topologies on Intel Xeon Architecture

SessionResearch Posters Session

Poster Authors

Event Type

Research Poster

Passes

Tags

TimeTuesday, June 18th8:30am - 10am

LocationSubstanz 1, 2

DescriptionLong short-term memory (LSTM) is a type of recurrent neural network which is well-suited for processing temporal data. In this work, we present an optimized implementation of LSTM cell for Intel Xeon architecture. Typical implementations of the LSTM cell employ one or two large GEMM calls and then apply the element-wise operations (sigmoid/tanh) onto the GEMM results. While this approach is easy to implement by exploiting vendor-optimized GEMM library calls, the data reuse relies on how GEMMs are parallelized and is sub-optimal for GEMM sizes stemming from small minibatch. Also, the element-wise operations are exposed as a bandwidth-bound kernel after the GEMM which is typically a compute-bound kernel. To address this discrepancy, we implemented a parallel blocked matrix GEMM in order to (a) achieve load balance, (b) maximize weight matrix reuse, (c) fuse the element-wise operations after partial GEMM blocks are computed and while they are hot in cache. Additionally, we bring the time step loop in our cell to further increase the weight reuse and amortize the overhead to transform the weights into blocked layout. The results show that our forward pass can be up to 1.4x faster compared to MKL-DNN implementation, whereas the backward/update pass can be up to 1.3x faster. Furthermore, we modified TensorFlow framework to use our LSTM cell for end-to-end training of Google’s neural machine translation application and attained identical BLEU score in as many iterations as original TensorFlow implementation while showcasing 1.9x speed up for 8-layer German-to-English translation model.

Poster PDF

Poster Authors

Research Scientist

Research Scientist

Research Scientist

Research Scientist