HI all,

I have a question about Pytorch GPU memory usage. I am playing with a couple of different architectures for my fully connected network, each of which have n hidden units in each hidden layer, though the number of hidden layers varies. I was expecting the memory footprint for each of the models to be roughly the same during inference (no backprop). The model itself is quite small and does not take up much GPU memory, but, my deeper networks seem to use much more VRAM. Given that they have the same number of hidden units the matrix multiplication operations are done on matrices of the same size, I was expecting all the models to use roughly the same amount of VRAM. What am I missing here? Why are the deeper networks need more memory to train?

Thanks in advance!