Dropout of whole tensor

jgdo · August 12, 2020, 8:16am

I’m trying to port some tensorflow code to pytorch where a whole tensor is dropped out occasionally. This is done by multiplying the whole tensor with zero.

If I do the same in pytorch, I get errors from autograd about an invalid inplace operation:

RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [64, 32, 32, 32]], which is output 0 of LeakyReluBackward0, is at version 1; expected version 0 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).

Here is the original tensorfow code (“context” is the tensor):

maybe_dropout = tf.cast(tf.math.greater(tf.random.uniform([]), self._drop_out_rate),tf.float32)
context *= maybe_dropout

Here is my pytorch implementation:

maybe_dropout = (torch.rand([]) > self._drop_out_rate).type(torch.get_default_dtype())
context *= maybe_dropout

I also tried to simply set the content to zero by

context[:] = 0

but this resulted in NaN gradients.
The only thing that worked was detaching the tensor before setting it to zero:

context = context.detach()
context[:] = 0

but this kind of doesn’t seem right. What would be the right way to implement such a dropout behavior?

albanD · August 12, 2020, 2:09pm

Hi,

The error mentions that you modify inplace a Tensor whole value is needed in the backward pass.
In particular here, it seems that the content of context is needed. And so if you do context *= maybe_dropout (which modifies it inplace), then you can’t backward anymore.
You need to make sure not to modify the original context:

context = context.clone() # Get new memory
context *= maybe_dropout # Now we can safely write inplace into it

Or

context = context * maybe_dropout # Use the out of place version

jgdo · August 13, 2020, 7:22am

Thanks, that worked! Just one more question - why does

context = context * maybe_dropout

work but not

context = context * 0

?

albanD · August 13, 2020, 3:03pm

How does the second “not work”? They should be doing very similar things.