In ICML 2017 an interesting paper was introduced. Memory efficient convolution MEC suggests a new way to compute convolution with much lesser memory and much faster performance.
Empirically this code was most efficient with convolutions involving a large number of planes. But a couple months later, nvidia released CUDNN, and we did not think our code was any faster. In fact the comparison with CUDNN is missing from the MEC paper because it is not open source. This is unfortunate.
On the other hand, they use cublasSgemmBatched and claim that this helps a lot. We talked about it but I do not remember if we tried. On the other hand, we tried to call cublasSgemm on multiple streams, but that did not help on our combination of CUDA and GPU. Things may have improved…