Relu by default allocate new memory for output. You can modify input directly by setting inplace=True flag. Although, I am not sure this is the only reason or not but for sure Relu will consume memory.
If the only change was introducing Relu, I cannot really figure it out at the moment.
One proper way to debug such issues is that to use profiler. It helps you to find bottlenecks even when no error/warning is happening so you can optimize your code much more.
Sorry for the lack of knowledge. https://pytorch.org/tutorials/recipes/recipes/profiler.html
Based on my own understanding of graphic mode, the ReLU will introduce extra memory when training even the inplace flag is set.
The inplace flag only indicates the output resue the input memory in the forward round. However, it need to save extra information for the latter backward propogation. It either saves the input tensor X or positions where X>=0. I do not know where it happens, but I believe the saving for backward is indeed required.
If only in inference mode, I think ReLU(True) will not introduce extra memory.