Hi, I am mystified by the effect of batch size on the speed per iteration. I am using torchtext.data.Iterator to generate batches. If I change the batch size from 359 to 10770, the whole forward-backward pass for a batch will change from 60 seconds to 2 seconds. Is there a reason why a smaller batch size needs so much longer for one round of parameter update? I am only talking about 1 batch here, not the whole dataset.
seems like a programming error.
The weird slowdown went away after I made the following update. There was something weird with cuda tensors. If I access a cuda tensor in a variable, the program gets really slow. A simple
print(x.data) takes half a minute when it was very small. I couldn’t pin down what exactly was the problem, but the new version solved it.
pytorch: 0.1.12-py36_2cu80 soumith [cuda80] --> 0.1.12-py361_2cu80 soumith [cuda80]
I thought it was fixed, but I installed the latest dev version and it came back. The system would randomly hang for a few seconds. It seemed to happen after a optimizer step and when some CUDA operation is done. I will try to make a minimal case to repro this behavior.
The following script is a minimal example that would give different performance with different PyTorch releases. I attached the profile flowcharts with this thread. The two versions are the current master and the one from anaconda -c soumith. Both versions need a few seconds for the first forward call between ‘1’ and ‘2’ in the code, but the current dev version spends a good half a minute between ‘4’ and ‘5’ for every epoch. In two systems that I observed this, they are both Linux with CUDA 8.0.44 and Python 3.6.1.
The profile charts:
Anaconda version: https://osu.box.com/s/i3154wrjhplw7z9uzf4h9ojiwsftrbgm
Master version: https://osu.box.com/s/poa8b66o6w7c7bdq7aoi0yrvy8z95uvt
The weird thing is that it seems the slowdown is caused by
a = y.data == 3, then the slowdown will move to the next operation related to CUDA. In this script, it will move to
x.cuda(). I also observe that the GPU is super busy after the backward call, which may have caused the slowdown, but I can’t say what is causing the GPU to be completely occupied for half a minute as no other program is running on it.
from torch import nn from torch.autograd import Variable import torch import torch.nn.functional as F import numpy as np class CNN_Text(nn.Module): def __init__(self): super(CNN_Text, self).__init__() V = 2000 D = 300 Co = 300 Ks = [3,4,5] C = 359 Ci = 1 self.embed = nn.Embedding(V, D, padding_idx=1) self.convs1 = nn.ModuleList([nn.Conv2d(Ci, Co, (K, D)) for K in Ks]) self.dropout = nn.Dropout(0.5) self.fc1 = nn.Linear(len(Ks) * Co, C) def forward(self, x): x = self.confidence(x) logit = F.log_softmax(x) # (N,C) return logit def confidence(self, x): x = self.embed(x) # (N,W,D) x = x.unsqueeze(1) # (N,Ci,W,D) x = [F.relu(conv(x)).squeeze(3) for conv in self.convs1] # [(N,Co,W), ...]*len(Ks) x = [F.max_pool1d(i, i.size(2)).squeeze(2) for i in x] # [(N,Co), ...]*len(Ks) x = torch.cat(x, 1) x = self.dropout(x) # (N,len(Ks)*Co) linear_out = self.fc1(x) return linear_out cnn = CNN_Text() cnn.cuda() for i in range(10): x = np.random.randint(0, 1500, (359, 15)) x[:, 0:5] = 1 x[:, -3:] = 1 y = np.random.randint(0, 359, (359,)) x = torch.from_numpy(x) y = torch.from_numpy(y) print('1') x = Variable(x.cuda()) y = Variable(y.cuda()) x = F.log_softmax(cnn(x)) print('2') loss = F.nll_loss(x, y) print('3') loss.backward() print('4') print(y.data) print('5')
Also the anaconda version has quite a few places different from the documentation (norm does not support keepdim, there is no matmul etc.). Hopefully it will be updated soon.
whenever you add benchmarks with CUDA, you need to make sure you call
torch.cuda.synchronize() right before collecting the time.
x = x.cuda() print('1') y = x ** 2 torch.cuda.synchronize() print('2')
In your case, add
torch.cuda.synchronize() right before print(‘2’) and print(‘3’) and print(‘4’)
Yes, you are right. I have added
torch.cuda.synchronize() to the example and now the majority of time spent in this case is at the backward call. The
torch_C._cuda_synchronize is the most time consuming operation, spending 469 seconds. So the weirdness of
The other version works just fine. The flowchart looks like the one posted. Nothing interesting.