Loss.backward() very slow when using two networks for GAN

I am running two networks as per the generic GAN structure, however when calling g_loss.backward() for the loss of my generator, it is as slow as calling loss.backward() for my discriminator. The forward call is very fast for both my discriminator and generator, but the backward calls are painfully slow (averaging an hour per epoch).

The culprit, I believe, is some operations are bottlenecking on my CPU. For some reason, some backprop is relying on CPU and the BCELoss is entirely relying on CPU.

The input to my discriminator are 3x35x35 images.

I ran the GPU profiler on my training code and this was the output:
(process) (CPU time) (CUDA time)

binary_cross_entropy   269059.615us        356.323us 
binary_cross_entropy_forward    269047.743us        353.271us 
CudnnConvolutionBackward     88123.910us      50472.900us 
cudnn_convolution_backward      88107.707us      50469.971us 
ThnnConvDepthwise2DBackward     269.936us      56319.946us
thnn_conv_depthwise2d_backward      43.441us      55875.610us
ThnnConvDepthwise2DBackward     139.962us      41228.271us
thnn_conv_depthwise2d_backward    70.050us      40887.329us
ThnnConvDepthwise2DBackward    129.066us      27838.501us
thnn_conv_depthwise2d_backward    58.876us      27607.056us

A link to a gist of my code: https://gist.github.com/Sieyk/d3737b12d38ec41e0792f6eae7a57ffc
The main loop starts: line 512
Discriminator definition starts: line 47
Generator definition starts: line 116

I am running Pytorch 0.4.0 with CUDA8 and conda python 3.6