Huge loss with a 1D densenet and 100x slowdown in loss.backwards with DataParallel

I am using the model from It is similar to the torchvision densenet but in includes bottlenecking, compression, and a demo script which runs training including over multiple GPU. I am training on a p2.xlarge with 8 K80s. Cuda visible devices are 0-7 and it loads the model on all the gpus but when I do a small profile myself the backwards pass is 100x the time of a forward pass.

Here is a small timing profile of an iteration:

Epoch: [1/300] Iter: [2/162] Time 93.462 (110.256) Loss 25298.0527 (25176.3350)
Time to create batch on GPU: 0.0001
Forward pass through model: 0.4165
Loss Calculation: 0.0031
Accuracy Calculation: 0.0001
Loss update Calculation: 0.0000
Zero grads: 0.0016
Loss backward: 92.5070
Optim step: 0.0167

My only changes to the network have been to use my own data, move it to 1D x 2 channel. And I changed dimensions of the last linear layer to work with l1_loss (because I am running pytorch 4.0 and I understand that mse_loss is bugged). A quick view of the end of my network is below. Any ideas on where to start with this?

(denselayer16): _DenseLayer(
        (norm1): BatchNorm1d(372, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (relu1): ReLU(inplace)
        (conv1): Conv1d(372, 48, kernel_size=(1,), stride=(1,), bias=False)
        (norm2): BatchNorm1d(48, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (relu2): ReLU(inplace)
        (conv2): Conv1d(48, 12, kernel_size=(3,), stride=(1,), padding=(1,), bias=False)
    (norm_final): BatchNorm1d(384, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (classifier): Linear(in_features=384, out_features=1, bias=True)

Here is also my forward pass function:

def forward(self, x):
        features = self.features(x)
        out = F.relu(features, inplace=True)
        out = F.avg_pool1d(out, kernel_size=out.size(2)).view(out.size(0), -1)
        out = self.classifier(out)
        return out