Batch size matters

Hi, I am mystified by the effect of batch size on the speed per iteration. I am using to generate batches. If I change the batch size from 359 to 10770, the whole forward-backward pass for a batch will change from 60 seconds to 2 seconds. Is there a reason why a smaller batch size needs so much longer for one round of parameter update? I am only talking about 1 batch here, not the whole dataset.

seems like a programming error.

The weird slowdown went away after I made the following update. There was something weird with cuda tensors. If I access a cuda tensor in a variable, the program gets really slow. A simple print( takes half a minute when it was very small. I couldn’t pin down what exactly was the problem, but the new version solved it.

pytorch: 0.1.12-py36_2cu80 soumith [cuda80] --> 0.1.12-py361_2cu80 soumith [cuda80]

I thought it was fixed, but I installed the latest dev version and it came back. The system would randomly hang for a few seconds. It seemed to happen after a optimizer step and when some CUDA operation is done. I will try to make a minimal case to repro this behavior.

1 Like

The following script is a minimal example that would give different performance with different PyTorch releases. I attached the profile flowcharts with this thread. The two versions are the current master and the one from anaconda -c soumith. Both versions need a few seconds for the first forward call between β€˜1’ and β€˜2’ in the code, but the current dev version spends a good half a minute between β€˜4’ and β€˜5’ for every epoch. In two systems that I observed this, they are both Linux with CUDA 8.0.44 and Python 3.6.1.
The profile charts:
Anaconda version:
Master version:
The weird thing is that it seems the slowdown is caused by print but it isn’t. If the print statement is changed to something else, say a = == 3, then the slowdown will move to the next operation related to CUDA. In this script, it will move to x.cuda(). I also observe that the GPU is super busy after the backward call, which may have caused the slowdown, but I can’t say what is causing the GPU to be completely occupied for half a minute as no other program is running on it.

from torch import nn
from torch.autograd import Variable
import torch
import torch.nn.functional as F
import numpy as np

class CNN_Text(nn.Module):
    def __init__(self):
        super(CNN_Text, self).__init__()

        V = 2000
        D = 300
        Co = 300
        Ks = [3,4,5]

        C = 359
        Ci = 1
        self.embed = nn.Embedding(V, D, padding_idx=1)

        self.convs1 = nn.ModuleList([nn.Conv2d(Ci, Co, (K, D)) for K in Ks])

        self.dropout = nn.Dropout(0.5)
        self.fc1 = nn.Linear(len(Ks) * Co, C)

    def forward(self, x):
        x = self.confidence(x)
        logit = F.log_softmax(x)  # (N,C)
        return logit

    def confidence(self, x):
        x = self.embed(x)  # (N,W,D)

        x = x.unsqueeze(1)  # (N,Ci,W,D)
        x = [F.relu(conv(x)).squeeze(3) for conv in self.convs1]  # [(N,Co,W), ...]*len(Ks)

        x = [F.max_pool1d(i, i.size(2)).squeeze(2) for i in x]  # [(N,Co), ...]*len(Ks)
        x =, 1)

        x = self.dropout(x)  # (N,len(Ks)*Co)
        linear_out = self.fc1(x)
        return linear_out

cnn = CNN_Text()

for i in range(10):
    x = np.random.randint(0, 1500, (359, 15))
    x[:, 0:5] = 1
    x[:, -3:] = 1
    y = np.random.randint(0, 359, (359,))
    x = torch.from_numpy(x)
    y = torch.from_numpy(y)
    x = Variable(x.cuda())
    y = Variable(y.cuda())
    x = F.log_softmax(cnn(x))
    loss = F.nll_loss(x, y)

Also the anaconda version has quite a few places different from the documentation (norm does not support keepdim, there is no matmul etc.). Hopefully it will be updated soon.

whenever you add benchmarks with CUDA, you need to make sure you call torch.cuda.synchronize() right before collecting the time.

For example:

x = x.cuda()
y = x ** 2

In your case, add torch.cuda.synchronize() right before print(β€˜2’) and print(β€˜3’) and print(β€˜4’)

Yes, you are right. I have added torch.cuda.synchronize() to the example and now the majority of time spent in this case is at the backward call. The torch_C._cuda_synchronize is the most time consuming operation, spending 469 seconds. So the weirdness of print taking a long time is gone, but why is backward so slow with the dev version?
The other version works just fine. The flowchart looks like the one posted. Nothing interesting.