I am using the model from https://github.com/gpleiss/efficient_densenet_pytorch. It is similar to the torchvision densenet but in includes bottlenecking, compression, and a demo script which runs training including over multiple GPU. I am training on a p2.xlarge with 8 K80s. Cuda visible devices are 0-7 and it loads the model on all the gpus but when I do a small profile myself the backwards pass is 100x the time of a forward pass.
Here is a small timing profile of an iteration:
Epoch: [1/300] Iter: [2/162] Time 93.462 (110.256) Loss 25298.0527 (25176.3350)
Time to create batch on GPU: 0.0001
Forward pass through model: 0.4165
Loss Calculation: 0.0031
Accuracy Calculation: 0.0001
Loss update Calculation: 0.0000
Zero grads: 0.0016
Loss backward: 92.5070
Optim step: 0.0167
My only changes to the network have been to use my own data, move it to 1D x 2 channel. And I changed dimensions of the last linear layer to work with l1_loss (because I am running pytorch 4.0 and I understand that mse_loss is bugged). A quick view of the end of my network is below. Any ideas on where to start with this?
(denselayer16): _DenseLayer(
(norm1): BatchNorm1d(372, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu1): ReLU(inplace)
(conv1): Conv1d(372, 48, kernel_size=(1,), stride=(1,), bias=False)
(norm2): BatchNorm1d(48, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu2): ReLU(inplace)
(conv2): Conv1d(48, 12, kernel_size=(3,), stride=(1,), padding=(1,), bias=False)
)
)
(norm_final): BatchNorm1d(384, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
(classifier): Linear(in_features=384, out_features=1, bias=True)
)
Here is also my forward pass function:
def forward(self, x):
features = self.features(x)
out = F.relu(features, inplace=True)
out = F.avg_pool1d(out, kernel_size=out.size(2)).view(out.size(0), -1)
out = self.classifier(out)
return out