PyTorch memory consumption

You might be forgetting the intermediate activations, which need to be stored for the gradient computation. This post describes a similar use case for a ResNet.