Minimize memory usage for masked convolutions

I am working on a feedforward autoencoder network that takes variable-sized images as input, and I having trouble with memory usage. I assemble roughly size-matched minibatches by zero-padding the inputs and maintaining a 0-1 mask tensor. So that the output of the network is independent of the padding, I mask the output of my convolutions. My basic layer looks something like

output = elu(conv2d(x)) * mask[:,None].expand_as(x) for 4D input x.

I am having a lot of trouble with memory usage as my images can be large (up to 300x300), limiting both the size of minibatches and the number of layers.

How can I reduce the memory usage of the line above? Are relu’s more memory-efficient in the backward pass than elu’s since it only requires the sign of the activations? Is there a more memory- or computation-efficient way to apply a binary mask?

In the ideal case and if I switch from elu to relu, I would expect that the non-transient memory usage of the layer above would be NCH*W/8 bytes, where x has size (N,C,H,W). Is there any easy way to check if PyTorch achieves that bound?