Let’s say we have 10 neural network models. In order to decide how to partition these models amongst a set of GPUs, we need to know the size of each model. How do we calculate that?
You can estimate the memory footprint of the model itself by summing the number of parameters, buffers (, and other tensors, if needed) and multiply it by the
dtype factor (e.g. 4 for
However, this would not give you the “complete” memory usage, since the forward activations (intermediates) as well as the gradients would also use memory. That being said, if you are using cudnn, then note that different algorithms would also consume a different amount of memory for their workspace, especially if you are using
torch.backends.cudnn.benchmark = True.
I would thus recommend to perform an example training step with the shapes you are planning to use and check the memory usage e.g. via
The problem is usually these models I am referring to are too big to fit in memory, and the point of calculating the memory is we are deciding how to distribute each sub-model onto multiple GPUs. So, I can’t load the entire model first to calculate the size.
Do you have any code for the first method of calculating memory?