Torch training is taking way too long time

I have just built my first torch model which was originally written tensorflow/keras. But, it seems like the training is taking 4x longer in pytorch. Any suggestion would be a great help:

class Env2Acl(nn.Module):
    def __init__(self, input_length, n_class, sr):
        super(Env2Acl, self).__init__();
        self.input_length = input_length;

        stride1 = 2;
        stride2 = 2;
        channels = 8;
        k_size = (3, 3);
        n_frames = (sr/1000)*10; #No of frames per 10ms

        self.filter_bank_pool_size = int(n_frames/(stride1*stride2));
        self.pool_size = (2,2);

        self.conv1, self.bn1 = ConvLayer(1, channels, (1, 9), (1, stride1)).get();
        self.conv2, self.bn2 = ConvLayer(channels, channels*8, (1, 5), (1, stride2)).get();
        self.conv3, self.bn3 = ConvLayer(1, channels*4, k_size, padding=1).get();
        self.conv4, self.bn4 = ConvLayer(channels*4, channels*8, k_size, padding=1).get();
        self.conv5, self.bn5 = ConvLayer(channels*8, channels*8, k_size, padding=1).get();
        self.conv6, self.bn6 = ConvLayer(channels*8, channels*16, k_size, padding=1).get();
        self.conv7, self.bn7 = ConvLayer(channels*16, channels*16, k_size, padding=1).get();
        self.conv8, self.bn8 = ConvLayer(channels*16, channels*32, k_size, padding=1).get();
        self.conv9, self.bn9 = ConvLayer(channels*32, channels*32, k_size, padding=1).get();
        self.conv10, self.bn10 = ConvLayer(channels*32, channels*64, k_size, padding=1).get();
        self.conv11, self.bn11 = ConvLayer(channels*64, channels*64, k_size, padding=1).get();
        self.conv12, self.bn12 = ConvLayer(channels*64, n_class, (1, 1)).get();

        self.maxpool1 = nn.MaxPool2d(kernel_size=(1, self.filter_bank_pool_size));
        self.maxpool2 = nn.MaxPool2d(kernel_size=(2,2));
        self.avgpool = nn.AvgPool2d(kernel_size=(2,4));
        self.fcn = nn.Linear(n_class, n_class);
        nn.init.kaiming_normal_(self.fcn.weight, nonlinearity='relu')

    def forward(self, x):
        #Start: Filter bank
        x = F.relu(self.bn1(self.conv1(x)));
        x = F.relu(self.bn2(self.conv2(x)));
        x = self.maxpool1(x);
        #Start: Filter bank

        x = x.permute((0, 2, 1, 3));

        x = self.maxpool2(F.relu(self.bn3(self.conv3(x))));

        x = F.relu(self.bn4(self.conv4(x)));
        x = self.maxpool2(F.relu(self.bn5(self.conv5(x))));

        x = F.relu(self.bn6(self.conv6(x)));
        x = self.maxpool2(F.relu(self.bn7(self.conv7(x))));

        x = F.relu(self.bn8(self.conv8(x)));
        x = self.maxpool2(F.relu(self.bn9(self.conv9(x))));

        x = F.relu(self.bn10(self.conv10(x)));
        x = self.maxpool2(F.relu(self.bn11(self.conv11(x))));

        x =  nn.Dropout(0.2)(x);
        x = self.avgpool(F.relu(self.bn12(self.conv12(x))));

        x = nn.Flatten()(x);

        y = F.softmax(self.fcn(x), dim=1);
        return y;

class ConvLayer:
    def __init__(self, in_channels, out_channels, kernel_size, stride=(1,1), padding=0, bias=False):
        self.conv = nn.Conv2d(in_channels=in_channels, out_channels=out_channels, kernel_size=kernel_size, stride=stride, padding=padding, bias=bias);
        nn.init.kaiming_normal_(self.conv.weight, nonlinearity='relu') = nn.BatchNorm2d(out_channels);

    def get(self):
        return self.conv,;
net = model.GetModel();

lossFunc = torch.nn.KLDivLoss();
optimizer = optim.SGD(net.parameters(), lr=self.opt.LR, weight_decay=self.opt.weightDecay, momentum=self.opt.momentum, nesterov=True)

testData = np.load(os.path.join(, self.opt.dataset, 'aug-data/sp-{}/{}.npz'.format(self.split, 'test/test4000')), allow_pickle=True);
testX = torch.tensor(np.moveaxis(testData['x'], 3, 1));
testY = torch.tensor(testData['y']);

for epochIdx in range(self.opt.nEpochs):
    epoch_start_time = time.time();
    optimizer.param_groups[0]['lr'] = self.__get_lr(epochIdx+1);

    data = np.load(os.path.join(, self.opt.dataset, 'aug-data/sp-{}/{}.npz'.format(self.split, 'train/train{}'.format(epoch))), allow_pickle=True);
    trainX = torch.tensor(np.moveaxis(data['x'], 3, 1));
    trainY = torch.tensor(data['y']);

    running_loss = 0.0;
    running_acc = 0.0;
    n_batches = math.ceil(len(trainX)/64);
    for batchIdx in range(n_batches):
        x = self.trainX[index*self.opt.batchSize : (index+1)*self.opt.batchSize];
        y = self.trainY[index*self.opt.batchSize : (index+1)*self.opt.batchSize];

        # zero the parameter gradients

        # forward + backward + optimize
        outputs = net(x);

        running_acc += (( == y.argmax(dim=1))*1).float().mean().item();
        loss = lossFunc(outputs, y)

        running_loss += loss.item()

    tr_loss = running_loss / n_batches;

    #Epoch wise validation Validation
    with torch.no_grad():
        y_pred = None;
        batch_size = (self.opt.batchSize//self.opt.nCrops)*self.opt.nCrops
        for idx in range(math.ceil(len(self.testX)/batch_size)):
            x = self.testX[idx*batch_size : (idx+1)*batch_size];
            scores = net(x);
            y_pred = if y_pred is None else,;

        acc, loss = self.__compute_accuracy(y_pred, self.testY);
    print('Epoch: {}/{} | Train: Loss {:.3f}% | Val: Acc(top1) {:.3f}%'.format(epochIdx+1, self.opt.nEpochs,tr_loss, val_acc));

    running_loss = 0;
    running_acc = 0;

I have tried to paste the simplified code here. Can any body help me identifying where is the hole that is taking 4x longer execution time? My understanding is that torch should be around 2x faster.

Kind Regards,

I would recommend to compare the number of parameters between both models and make sure they are equal (if not already done).

Why should this be the case?

@ptrblck I was badly expecting your reply. I looked at the number of parameters. The tensorflow model.summary() gives me the following:
Total params: 4,739,526
Trainable params: 4,735,378
Non-trainable params: 4,148

For pytorch, I use torchsummary library and get the following:
Total params: 4,735,378
Trainable params: 4,735,378
Non-trainable params: 0

So, it seems that the trainable parameters are same, but not sure how is it calculating the non-trainable params.

For tensorflow, one epoch requires 1.5 minutes where as pytorch takes almost 4.5 minutes which is a surprise for me.

I am happy to share both my torch and tensorflow/keras progam if you are happy to have a look and can indicate any issue there. It would be really a great help.

Kind Regards,

I guess the non-trainable parameters should come from the buffers, e.g. running stats in batchnorm layers, but I’m not sure how model.summary() calculates them.

Anyway, the trainable parameters match, which is great.

Sure, please share the code and more importantly your setup, i.e. which PyTorch, TF, CUDA and cudnn versions are you using in both use cases.

@ptrblck I was wondering if copying and pasting my code here or it would be sensible to provide github link?

Whatever works for you.
If you post or link the code, please also provide some input shapes, so that I could use random inputs to run the code.

Instead of loading training data every epoch, why not move it outside the training loop and load it once?

@kshitij the training dataset (augmented data) is different every time.

@ptrblck I have created two repositories for you. The links are bellow:

Looking for ways provide you my dataset, so that you can just run it right away.
The codes run on CPU with tensorflow 2.0.0 and torch 1.4.0 is the starter file. and are the files holding the model creation and training

Thanks for the code.

No need for it. Random data is good enough for profiling the model.

@ptrblck Cool.
I think you asked me for the input shapes:
For the tensorflow model the shape is (batch_size, hight, width, channels) which in my case is (64, 1, 66650,1)

For torch the shape is (batch_size, channels, height, width): (64, 1, 1, 66650).

Since, it is my first model in pytorch, I guess I am missing something pretty basic. Anyway, will wait to hear from you.

@ptrblck I was wondering if you see any glitch / major flaw in my torch implementation there, specially inside the training loop in

I don’t see anything obviously wrong.
I used the provided input shapes for the data and [64, 50] for the target.
However, the Keras code throws a shape mismatch error:

y_pred = y_pred.reshape(y_pred.shape[0]//self.opt.nCrops, self.opt.nCrops, y_pred.shape[1]);
ValueError: cannot reshape array of size 3200 into shape (6,10,50)

while this seems to work fine in your PyTorch implementation.

For the keras code, batch size for training is 64.
For validation, it should be (64//10)*10 = 60

My dataset has:
X = (1600, 1, 66650, 1) for testing #(samples, hight, width, channels)
Y = (1600, 50)

For Testing:
X = (4000, 1, 66650, 1) # Every 10 samples are randomly cropped from the same sample
Y = (4000, 50)
The original dataset has 400 test samples.
Thus, I reshape the test output like this: (samples/crops, crops, nClass)
Hence, a output (4000, 50) becomes (4000/10, 10, 50) = (400, 10, 40)

I hope it is clear now

@ptrblck this is a little more breakdown of what I am seeing for a training set of 1600 samples, each with length 66650 and a test set of 4000 samples with length 66650.
Epoch: 1/2000 | Time: 4m12s (Train 2m23s, Val 1m48s)

Epoch: 1/2000 | Time: 1m52s (Train 1m07s, Val 0m44s)

Using my mackbook, so no gpu support. Just CPU. It seems that I am running out of time looking at every possible glitch I may have. I ran it on cluster having GPUs. The scenario is the same, where as the tensorflow model takes only 7 seconds to complete one epoch. Thus, can’t afford to have my torch model training for 5 days in the cluster.

Anyway, thanks a lot for your effort on it. Please let me know if you have any suggestion.

Hi, I think we can’t do too much with this slowness of pytorch in cpu. I had to do some fixes in my code like while using KLDIVLoss as loss function you either use log_softmax as your output activation or do a loss = KLDIVLoss(reduction=‘batchmean’)(output.log(), target) to make sure that you follow the mathematics of KLDIVLoss.

Anyway, for speed up the training and validation, I ended up implementing it for gpu and now it is identical with tensorflow training on GPU.

Thus, all good for now.
@ptrblck thanks for your effort on looking at my issue.

1 Like