Batch size matters

Lifeng_Jin · June 26, 2017, 4:22pm

Hi, I am mystified by the effect of batch size on the speed per iteration. I am using torchtext.data.Iterator to generate batches. If I change the batch size from 359 to 10770, the whole forward-backward pass for a batch will change from 60 seconds to 2 seconds. Is there a reason why a smaller batch size needs so much longer for one round of parameter update? I am only talking about 1 batch here, not the whole dataset.

smth · July 2, 2017, 11:19pm

seems like a programming error.

Lifeng_Jin · July 3, 2017, 5:32am

The weird slowdown went away after I made the following update. There was something weird with cuda tensors. If I access a cuda tensor in a variable, the program gets really slow. A simple print(x.data) takes half a minute when it was very small. I couldn’t pin down what exactly was the problem, but the new version solved it.

pytorch: 0.1.12-py36_2cu80 soumith [cuda80] --> 0.1.12-py361_2cu80 soumith [cuda80]

Lifeng_Jin · July 3, 2017, 6:55am

I thought it was fixed, but I installed the latest dev version and it came back. The system would randomly hang for a few seconds. It seemed to happen after a optimizer step and when some CUDA operation is done. I will try to make a minimal case to repro this behavior.

Lifeng_Jin · July 3, 2017, 7:20pm

The following script is a minimal example that would give different performance with different PyTorch releases. I attached the profile flowcharts with this thread. The two versions are the current master and the one from anaconda -c soumith. Both versions need a few seconds for the first forward call between ‘1’ and ‘2’ in the code, but the current dev version spends a good half a minute between ‘4’ and ‘5’ for every epoch. In two systems that I observed this, they are both Linux with CUDA 8.0.44 and Python 3.6.1.
The profile charts:
Anaconda version: https://osu.box.com/s/i3154wrjhplw7z9uzf4h9ojiwsftrbgm
Master version: https://osu.box.com/s/poa8b66o6w7c7bdq7aoi0yrvy8z95uvt
The weird thing is that it seems the slowdown is caused by print but it isn’t. If the print statement is changed to something else, say a = y.data == 3, then the slowdown will move to the next operation related to CUDA. In this script, it will move to x.cuda(). I also observe that the GPU is super busy after the backward call, which may have caused the slowdown, but I can’t say what is causing the GPU to be completely occupied for half a minute as no other program is running on it.

from torch import nn
from torch.autograd import Variable
import torch
import torch.nn.functional as F
import numpy as np

class CNN_Text(nn.Module):
    def __init__(self):
        super(CNN_Text, self).__init__()

        V = 2000
        D = 300
        Co = 300
        Ks = [3,4,5]

        C = 359
        Ci = 1
        self.embed = nn.Embedding(V, D, padding_idx=1)

        self.convs1 = nn.ModuleList([nn.Conv2d(Ci, Co, (K, D)) for K in Ks])

        self.dropout = nn.Dropout(0.5)
        self.fc1 = nn.Linear(len(Ks) * Co, C)

    def forward(self, x):
        x = self.confidence(x)
        logit = F.log_softmax(x)  # (N,C)
        return logit

    def confidence(self, x):
        x = self.embed(x)  # (N,W,D)

        x = x.unsqueeze(1)  # (N,Ci,W,D)
        x = [F.relu(conv(x)).squeeze(3) for conv in self.convs1]  # [(N,Co,W), ...]*len(Ks)

        x = [F.max_pool1d(i, i.size(2)).squeeze(2) for i in x]  # [(N,Co), ...]*len(Ks)
        x = torch.cat(x, 1)

        x = self.dropout(x)  # (N,len(Ks)*Co)
        linear_out = self.fc1(x)
        return linear_out

cnn = CNN_Text()
cnn.cuda()

for i in range(10):
    x = np.random.randint(0, 1500, (359, 15))
    x[:, 0:5] = 1
    x[:, -3:] = 1
    y = np.random.randint(0, 359, (359,))
    x = torch.from_numpy(x)
    y = torch.from_numpy(y)
    print('1')
    x = Variable(x.cuda())
    y = Variable(y.cuda())
    x = F.log_softmax(cnn(x))
    print('2')
    loss = F.nll_loss(x, y)
    print('3')
    loss.backward()
    print('4')
    print(y.data)
    print('5')

Lifeng_Jin · July 3, 2017, 7:38pm

Also the anaconda version has quite a few places different from the documentation (norm does not support keepdim, there is no matmul etc.). Hopefully it will be updated soon.

smth · July 3, 2017, 8:41pm

whenever you add benchmarks with CUDA, you need to make sure you call torch.cuda.synchronize() right before collecting the time.

For example:

x = x.cuda()
print('1')
y = x ** 2
torch.cuda.synchronize()
print('2')

In your case, add torch.cuda.synchronize() right before print(‘2’) and print(‘3’) and print(‘4’)

Lifeng_Jin · July 3, 2017, 9:06pm

Yes, you are right. I have added torch.cuda.synchronize() to the example and now the majority of time spent in this case is at the backward call. The torch_C._cuda_synchronize is the most time consuming operation, spending 469 seconds. So the weirdness of print taking a long time is gone, but why is backward so slow with the dev version?
The other version works just fine. The flowchart looks like the one posted. Nothing interesting.