Autograd for relu over a part of tensor

Please, advise how to make this work.

class HalfReLU(nn.Module):
	def __init__(self):
		super(HalfReLU, self).__init__()

	def forward(self, x):
		half = x.shape[1]//2
		x[:, :half] = nn.functional.relu(x[:, :half], inplace=False)
		return x

I’m getting

RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [128, 128, 8, 8]], which is output 0 of SliceBackward, is at version 1; expected version 0 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!

Whether relu is inplace or not doesn’t matter. Works (or just doesn’t give errors, don’t know for sure) in case of things like

		x[:, :half] *= 2

and

		x[:, :half] = nn.functional.sigmoid(x[:, :half])

but not in case of relu. I’m guessing it’s sort of a bug/edge case? For leaky_relu also doesn’t work.

I’d just use torch.cat of the pristine and relued parts.

1 Like

Works. Very slow vs plain ReLU, though. But even simply

x = nn.functional.relu(x[:, :], inplace=True)

is itself halfway slower vs

x = nn.functional.relu(x, inplace=True)

Numbers? (input sizes, cpu vs. gpu, …)

For me and a decently sized array, it doesn’t matter:

x = torch.randn(1000,1000)
%timeit torch.nn.functional.relu(x[:, :], inplace=True)
71.8 µs ± 59.1 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

%timeit torch.nn.functional.relu(x, inplace=True)
69 µs ± 141 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

If relu is slowing down your network, you’re doing something wrong.

You can be a tiny bit faster than relu + cat if you clone + use inplace relu (which works with autograd). But TBH I’d stay in the functional world. Optimizing ReLU likely isn’t a huge win.

Interesting. I’ve tried changing that for a large network, so I can’t say what specific size it was slow for. Regarding slowdowns and autograd, what can you say about TensorFlow’s XLA performance? From benchmarks I’ve seen it seems to deliver quite fantastic optimizations. And that’s with partly due to optimizing things like relu (to cut on memory movements in this case), right? Relus and stuff seem to matter much more in the Tensor Core days than they did before Volta/Turing, especially on unlocked GPUs (not 20xx, 30xx series).