Dropout Layers with Packed Sequences

(PyTorch 0.4)

How does one apply a manual dropout layer to a packed sequence (specifically in an LSTM on a GPU)? Passing the packed sequence (which comes from the lstm layer) directly does not work, as the dropout layer doesn’t know quite what to do with it and returns something not a packed sequence. Passing the data of the packed sequence seems like it should work, but results in the attribute error shown below the code sample.

Perversely, I can make this an inplace operation (again, on the data directly, not the full packed sequence) and it technically works (i.e., it runs) on the CPU, but gives a warning on the GPU that the inplace operation is modifying a needed gradient.


  1. Are the different behaviors between CPU and GPU expected?
  2. What is the overall correct way to do this on a GPU?
  3. What is the overall correct way to do this on a CPU?
    def __init__ (self, ....):
        super(Model1, self).__init__()
        self.drop     = torch.nn.Dropout(p=0.5, inplace=False)

    def forward(self, inputs, lengths):
        pack1              = nn.utils.rnn.pack_padded_sequence(inputs, lengths, batch_first=True)
        out1, self.hidden1 = self.lstm1(pack1, (self.hidden1[0].detach(), self.hidden1[1].detach()))
        out1.data = self.drop(out1.data)
AttributeError: can't set attribute

Does anyone use dropout with packed sequences?

I have a tentative workaround for this, but I am very curious to know what the standard PyTorch way of doing this is, and what is going on the the different behaviors on GPU and CPU. That’s the sort of thing that can really make you question the validity of your results.

I usually create a new packed sequence when I apply an op (reusing the old batchsizes). The docs tell you not to, but it works just fine.
Note, however, that dropout for sequences is something where there are several options which - depending on whom you ask - work to varying degree (e.g. “variational dropout”)

Best regards



Thanks for the response, and your point about Dropout styles is well-taken.

Can I prevail on you for a code snippet? All I am trying to do is add dropout in the simplest possible way, between the layers after the activations, and I had already hit on the idea of packing, running the LSTM, padding, running the dropout, etc. This seems similar to your idea of creating new packed sequences.

However, I have a three-test benchmark, which aims to learn an identity function of complex multi-coeffcient data:

  1. Run a single layer LSTM network (no dropout layer)
  2. Run a two-layer LSTM network (no dropout layer)
  3. Run a two-layer LSTM network (dropout layer between L1 and L2, dropout set to 0, i.e., deactivated)

What I see in cases 1 and 2 is the network quickly learning to output what it gets in, while in case 3 I get substantially degraded performance. It never learns to mimic the input data at all. What I would expect, though, is effectively identical performance between cases 2 and 3, up to the shuffling of minibatches in my standard implementations.

My best guess is that I’ve somehow broken the gradient flow, but I can’t see how, or where, or how to fix it.

I’m implementing as follows, where the h1_kernel, c1_kernel, etc are hidden and cell state kernels, for learning initial hidden layers. (I prefer to learn kernels so that I can easily change batch sizes later; the full hidden and cell states are just repetitions of the learned kernels.)

    def forward(self, inputs, lengths, batch_size):
        self.h1, self.c1   = self.init_hidden(batch_size, self.h1_kernel, self.c1_kernel)
        self.h2, self.c2   = self.init_hidden(batch_size, self.h2_kernel, self.c2_kernel)

        pack1              = nn.utils.rnn.pack_padded_sequence(inputs, lengths, batch_first=True)
        out1, _            = self.lstm1(pack1, (self.h1, self.c1))
        pad1               = nn.utils.rnn.pad_packed_sequence(out1, batch_first=True)[0]
        drop1              = self.drop(pad1.data) 

        pack2              = nn.utils.rnn.pack_padded_sequence(drop1, lengths, batch_first=True)
        out2, _            = self.lstm2(pack2, (self.h2, self.c2))
        pad2               = nn.utils.rnn.pad_packed_sequence(out2, batch_first=True)
        dense_out          = self.dense(pad2[0])
        pack_dense         = nn.utils.rnn.pack_padded_sequence(dense_out, lengths, batch_first=True)
        pad_dense          = nn.utils.rnn.pad_packed_sequence(pack_dense, batch_first=True)

In this case, the dropout is set elsewhere to be 0, i.e., present, but de-activated.
If I remove the dropout altogether (and adjust the rest of the forward accordingly) the observed behavior in training is significantly (and repeatably) different. Absent dropout works better than deactivated dropout.

Can anyone please shed some light on this? Why does this happen? What is the correct way to do this if I want actual non-zero dropout?

The solution appears to be, “Define the dropout layer as in_place,” which I leave here for posterity.


In case of ambiguity for how to use inplace:

a_packed_seq = torch.nn.utils.rnn.pack_sequence([torch.randn(3, 1), torch.randn(2,1), torch.randn(1,1)])
dropout_layer = torch.nn.Dropout(p=0.999, inplace=True)