Improving the Performance of Fully Connected Neural Networks by Out-of-Place Matrix Transpose

If these results are to be believed it’s perhaps worth looking into.

Abstract—Fully connected network has been widely used in
deep learning, and its computation efficiency is highly benefited
from the matrix multiplication algorithm with cuBLAS on
GPU. However, We found that, there exist some drawbacks
of cuBLAS in calculating matrix A multiplies the transpose
of matrix B (i.e., NT operation). To reduce the impact of NT
operation by cuBLAS, we exploit the out-of-place transpose
of matrix B to avoid using NT operation, and then we apply
our method to Caffe, which is a popular deep learning tool.
Our contribution is two-fold. First, we propose a naive method
(TNN) and model-based method (MTNN) to increase the
performance in calculating A × B
T
, and it achieves about
4.7 times performance enhancement in our tested cases on
GTX1080 card. Second, we integrate MTNN method into Caffe
to enhance the efficiency in training fully connected networks,
which achieves about 70% speedup compared to the original
Caffe in our configured fully connected networks on GTX1080
card.


Conclusion and Future Work
Our method to multiply matrix A and the transpose of
matrix B is much better than that of cuBLAS API. The
original results achieve about 4.7x speedup compared to
using cuBLAS directly on GTX1080 card. Furthermore,
the method is applied to Caffe, and the optimized Caffe
performs about 70% speedup on GTX1080 card.
The transposition algorithm we used is out-of-place
method, which results in double memory of a matrix and
it cannot run normally if there is no enough memory.
Therefore, we plan to exploit in-place matrix transposition
algorithm and to find a good trade-off between memory
overhead and throughput.

Thanks for the reference! We’ll look into it.