I am running two networks as per the generic GAN structure, however when calling g_loss.backward() for the loss of my generator, it is as slow as calling loss.backward() for my discriminator. The forward call is very fast for both my discriminator and generator, but the backward calls are painfully slow (averaging an hour per epoch).
The culprit, I believe, is some operations are bottlenecking on my CPU. For some reason, some backprop is relying on CPU and the BCELoss is entirely relying on CPU.
The input to my discriminator are 3x35x35 images.
I ran the GPU profiler on my training code and this was the output:
(process) (CPU time) (CUDA time)
binary_cross_entropy 269059.615us 356.323us
binary_cross_entropy_forward 269047.743us 353.271us
CudnnConvolutionBackward 88123.910us 50472.900us
cudnn_convolution_backward 88107.707us 50469.971us
ThnnConvDepthwise2DBackward 269.936us 56319.946us
thnn_conv_depthwise2d_backward 43.441us 55875.610us
ThnnConvDepthwise2DBackward 139.962us 41228.271us
thnn_conv_depthwise2d_backward 70.050us 40887.329us
ThnnConvDepthwise2DBackward 129.066us 27838.501us
thnn_conv_depthwise2d_backward 58.876us 27607.056us
A link to a gist of my code: https://gist.github.com/Sieyk/d3737b12d38ec41e0792f6eae7a57ffc
The main loop starts: line 512
Discriminator definition starts: line 47
Generator definition starts: line 116
I am running Pytorch 0.4.0 with CUDA8 and conda python 3.6