Periodic slowdown of LSTM

Hello,

I found that periodically forward step of LSTM is much slower than the normal cases. (about 100x to 1000x) Like this:

0 0.6872286860016175
1 0.03016717400169
2 0.12600803600798827
3 0.0197318460122915
4 0.01949280800181441
5 0.020448374008992687
6 0.018867813996621408
7 0.019007549999514595
8 0.018927561002783477
9 0.021016058992245235
10 0.019149257001117803
11 0.01897870100219734
12 0.021137274001375772
13 0.01889458800724242
14 0.018980594002641737
15 0.018866767000872642
16 0.021382328995969146
17 0.018798255012370646
18 0.018863944002077915
19 0.018828870001016185
20 0.023177682000095956
21 0.6736630689993035
22 1.2398238299938384
23 0.030545044006430544
24 0.11513781499525066
25 0.019403386002522893
26 0.019165677003911696
27 0.019054023010539822
28 0.018622093004523776
29 0.019038957005250268
30 0.018983667003340088
31 0.01874833799956832
32 0.02260433901392389
33 0.01899709399731364
34 0.01882772501267027
35 0.018957025997224264
36 0.0190426729968749
37 0.018901817995356396
38 0.018845165992388502
39 0.018859390998841263
40 0.018843275000108406
41 0.018885923011112027
42 0.018775675998767838
43 1.3015025630011223

And code for reproduce this:

import torch
import torch.nn as nn
import torch.optim as optim
from torch.autograd import Variable
from time import perf_counter


class LanguageModel(nn.Module):
    def __init__(self, n_input, n_hidden):
        super().__init__()

        self.rnn = nn.LSTM(n_input, n_hidden, batch_first=True)
        self.n_hidden = n_hidden

    def forward(self, input, hidden, training=True):
        output, hidden = self.rnn(input, hidden)

        return output, hidden

    def init_hidden(self, batch_size):
        weight = next(self.parameters()).data

        return (Variable(weight.new(1, batch_size, self.n_hidden).zero_()),
                Variable(weight.new(1, batch_size, self.n_hidden).zero_()))

def repackage_hidden(h):
    if type(h) == Variable:
        return Variable(h.data)

    else:
        return tuple(repackage_hidden(v) for v in h)


i_size = 16
o_size = 16

model = LanguageModel(i_size, o_size)
model.cuda()
criterion = nn.MSELoss().cuda()
optimizer = optim.SGD(model.parameters(), lr=30, momentum=0)

def train():
    hidden = model.init_hidden(128)
    X_tensor = torch.randn(128, 20, i_size).cuda()
    Y_tensor = torch.randn(128, 20, o_size).cuda()
    
    for i in range(5000):
        X_var, Y_var = Variable(X_tensor), Variable(Y_tensor.view(-1))
        hidden = repackage_hidden(hidden)
        model.zero_grad()
        torch.cuda.synchronize()
        start = perf_counter()
        output, hidden = model(X_var, hidden)
        torch.cuda.synchronize()
        end = perf_counter()
        print(i, end - start)


if __name__ == '__main__':
    train()

I tested this on PyTorch 0.3 (precompiled) and 0.4 (master) with cuda 9. with PyTorch 0.3 and cuda 8 this phenomenon was not occurred. Also I found this on RNN and LSTM, but not found in LSTM with autograd and another regular neural network models. So I suspect that this is somewhat related to Cudnn LSTM.

Also, slowdown is similar between different size of LSTMs. For example, even with much larger or smaller models slowdown is about same. (1.2 seconds) So I think there is some constant delay between the forward steps.

Is there any potential cause of this phenomenon? Can cuda & cudnn environment affect performance of pytorch like this?

cc: @ngimel is this a known problem?

No, but I also have not observed this on cuda 9/master/P100:

0 0.002871701493859291
1 0.002220509573817253
2 0.00203584972769022
3 0.002132742665708065
4 0.0020711934193968773
5 0.0021552909165620804
6 0.0020852675661444664
7 0.0021129408851265907
8 0.0020989077165722847
9 0.0020763883367180824
10 0.002113018184900284
11 0.002088014967739582
12 0.0021434975787997246
13 0.002110414206981659
14 0.0021486449986696243
15 0.0020739007741212845
16 0.002066992223262787
17 0.002080370672047138
18 0.0020767981186509132
19 0.0020976243540644646
20 0.0021136710420250893
21 0.002109520137310028
22 0.0020855749025940895
23 0.002149587497115135
24 0.002090880647301674
25 0.0020865993574261665
26 0.002097831107676029
27 0.002075604163110256
28 0.002096814103424549
29 0.0020804982632398605
30 0.002074296586215496
31 0.002080225385725498
32 0.0021444177255034447
33 0.0021434472873806953
34 0.002110465429723263
35 0.0021207425743341446
36 0.002166486345231533
37 0.002124168910086155
38 0.0021085841581225395
39 0.0020610038191080093
40 0.0020665526390075684
41 0.002103119157254696
42 0.002092542126774788
43 0.002048852853477001
44 0.00206853449344635
45 0.0020881779491901398
46 0.002047020010650158
47 0.002078133635222912
48 0.0020792195573449135
49 0.002088909037411213
50 0.0021005412563681602
51 0.002128974534571171
52 0.0020538605749607086
53 0.0020755771547555923
54 0.002057150937616825
55 0.0020484374836087227
56 0.0020883483812212944
57 0.0020784419029951096
58 0.0020894045010209084
59 0.0021256282925605774
60 0.0021245935931801796
61 0.002068900503218174
62 0.0020725466310977936
63 0.0021116798743605614
64 0.0021283915266394615
65 0.002084537409245968
66 0.0020663877949118614