Hello,
I found that periodically forward step of LSTM is much slower than the normal cases. (about 100x to 1000x) Like this:
0 0.6872286860016175
1 0.03016717400169
2 0.12600803600798827
3 0.0197318460122915
4 0.01949280800181441
5 0.020448374008992687
6 0.018867813996621408
7 0.019007549999514595
8 0.018927561002783477
9 0.021016058992245235
10 0.019149257001117803
11 0.01897870100219734
12 0.021137274001375772
13 0.01889458800724242
14 0.018980594002641737
15 0.018866767000872642
16 0.021382328995969146
17 0.018798255012370646
18 0.018863944002077915
19 0.018828870001016185
20 0.023177682000095956
21 0.6736630689993035
22 1.2398238299938384
23 0.030545044006430544
24 0.11513781499525066
25 0.019403386002522893
26 0.019165677003911696
27 0.019054023010539822
28 0.018622093004523776
29 0.019038957005250268
30 0.018983667003340088
31 0.01874833799956832
32 0.02260433901392389
33 0.01899709399731364
34 0.01882772501267027
35 0.018957025997224264
36 0.0190426729968749
37 0.018901817995356396
38 0.018845165992388502
39 0.018859390998841263
40 0.018843275000108406
41 0.018885923011112027
42 0.018775675998767838
43 1.3015025630011223
And code for reproduce this:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.autograd import Variable
from time import perf_counter
class LanguageModel(nn.Module):
def __init__(self, n_input, n_hidden):
super().__init__()
self.rnn = nn.LSTM(n_input, n_hidden, batch_first=True)
self.n_hidden = n_hidden
def forward(self, input, hidden, training=True):
output, hidden = self.rnn(input, hidden)
return output, hidden
def init_hidden(self, batch_size):
weight = next(self.parameters()).data
return (Variable(weight.new(1, batch_size, self.n_hidden).zero_()),
Variable(weight.new(1, batch_size, self.n_hidden).zero_()))
def repackage_hidden(h):
if type(h) == Variable:
return Variable(h.data)
else:
return tuple(repackage_hidden(v) for v in h)
i_size = 16
o_size = 16
model = LanguageModel(i_size, o_size)
model.cuda()
criterion = nn.MSELoss().cuda()
optimizer = optim.SGD(model.parameters(), lr=30, momentum=0)
def train():
hidden = model.init_hidden(128)
X_tensor = torch.randn(128, 20, i_size).cuda()
Y_tensor = torch.randn(128, 20, o_size).cuda()
for i in range(5000):
X_var, Y_var = Variable(X_tensor), Variable(Y_tensor.view(-1))
hidden = repackage_hidden(hidden)
model.zero_grad()
torch.cuda.synchronize()
start = perf_counter()
output, hidden = model(X_var, hidden)
torch.cuda.synchronize()
end = perf_counter()
print(i, end - start)
if __name__ == '__main__':
train()
I tested this on PyTorch 0.3 (precompiled) and 0.4 (master) with cuda 9. with PyTorch 0.3 and cuda 8 this phenomenon was not occurred. Also I found this on RNN and LSTM, but not found in LSTM with autograd and another regular neural network models. So I suspect that this is somewhat related to Cudnn LSTM.
Also, slowdown is similar between different size of LSTMs. For example, even with much larger or smaller models slowdown is about same. (1.2 seconds) So I think there is some constant delay between the forward steps.
Is there any potential cause of this phenomenon? Can cuda & cudnn environment affect performance of pytorch like this?