Getting modification to LSTMcell to pass over to cuda

I made some modifications to the LSTMCell function in nn_functions\rnn.py which works as intended when I run the network on my CPU. But when I use cuda the network seems to skip the LSTMCell function.

The network will still run when I use cuda, but when I throw in a print(‘working’) into the LSTMCell function, it doesn’t print, implying that the LSTMCell function gets skipped when using cuda.

In which other files must I make modifications for my changes to carry over when the network is run on GPU?

Illustration:

def LSTMCell(input, hidden, w_ih, w_hh, b_ih=None, b_hh=None):
    if input.is_cuda: 
        igates = F.linear(input, w_ih)
        hgates = F.linear(hidden[0], w_hh)
        state = fusedBackend.LSTMFused.apply
        return state(igates, hgates, hidden[1]) if b_ih is None else state(igates, hgates, hidden[1], b_ih, b_hh)
        #I can comment the above section out entirely and the the network still runs

    hx, cx = hidden
    gates = F.linear(input, w_ih, b_ih) + F.linear(hx, w_hh, b_hh) 
    ingate, forgetgate, cellgate, outgate, choosegate = gates.chunk(5, 1)  

    choosegate_1 = (F.sigmoid(choosegate)).round()
    choosegate_2 = (1-choosegate_1)
    cellgate_1 = F.tanh(cellgate)      
    cellgate_2 = F.relu(cellgate)   

    ingate = F.sigmoid(ingate)
    forgetgate = F.sigmoid(forgetgate)
    outgate = F.sigmoid(outgate)
    cellgate = ((choosegate_1 * cellgate_1) + (choosegate_2 * cellgate_2))

    cy = (forgetgate * cx) + ingate*cellgate
    hy = outgate * F.tanh(cy)
    return hy, cy

My modifications of the LSTMCell work properly when run without cuda, but it seems that cuda may have its own construction of the LSTMCell that I cannot find.

If someone could point out where cuda runs the LSTMCell layers through activation functions, or more generally constructs the entire LSTMCell, I would be greatly appreciative!

Thanks!!

You’d have to comment out entire if input.is_cuda section for your modifications to take effect, otherwise, a hardcoded fused kernel is called that calculated the standard lstm cell. If performance in inadequate after that, you might want to look at creating your own extension with the LSTM cell architecture you want. https://github.com/pytorch/tutorials/pull/214/files

Hi ngimel,
Thank you so much for your reply!

I do comment out the if input.is_cuda section, and when I run the network normally, my modifications are having an effect.
but when I activate Cuda (e.g. USE_GPU = True), the code runs super fast, but it seems to entirely skip the entire function (I can still comment out that section and it still runs with Cuda).

I have even put the ‘print(‘working’)’ line inside the if input.is_cuda loop (when cuda is activated) and it still does not print.

To clarify, my extension works perfectly fine if I do not activate cuda. But it seems that turning my model and tensors into cuda form starts a process in which the LSTM cell architecture is generated by cuda.

I have traced the information flow to this class: class CudnnRNN(NestedIOFunction), within the same nn_functions\rnn.py file, and from there the information flows to torch\backends\cudnn\rnn.py, but nowhere do I find Cuda building the actual architecture for me to modify.

But I am very new to coding, so perhaps I misunderstood what you meant by creating my own extension.

That means you are not using LSTMCell, you are using nn.LSTM, and it directly calls cudnn LSTM. If you want to bypass cudnn when using nn.LSTM you have to set torch.backends.cudnn.enabled = False

1 Like

Thanks again for your reply!

So I don’t want to bypass cudnn, because I still want it to get that full GPU speed-boost.
But I want to figure out where the cudnn LSTM is built, so that I can apply my modifications there.
Or is there no way to change how cudnn builds the LSTM?

Interestingly though, I tried your suggestion (torch.backends.cudnn.enabled = False) and it works!
My modifications take effect, and It’s about 5 times faster than without Cudnn (1 min vs 5 min per epoch).
However, it’s also about 3 times slower than using the cudnn backend (20 sec per epoch).

So if there’s a way to modify the architecture of the LSTM in the cuda backend, that’d be the best option.
I’m gonna be running this network for 100 epochs 60 times, so I need every little GPU boost I can get.

Regardless, even if this is the best you can do for me, THANK YOU, you probably saved me hundreds of hours.

Sorry, this is the best you can do with little effort. Cudnn has a hard-coded LSTM cell, and you cannot modify it.

Alright, this will have to do. Thanks so much for your help!