I’m training a Transformer language model with 5000 vocabularies using a single M60 GPU (w/ actually usable memory about 7.5G).
The number of tokens per batch is about 8000, and the hidden dimension to the softmax layer is 512. In other words, the input to nn.Linear(256, 5000) is of size [256, 32, 256]. So, if I understand correctly, a fully-connected layer theoretically consumes 5000x8000x512x4=81.92GB of GPU memory for a forward pass (4 is for float32). But the GPU performed the forward and backward passes without any problem, and it says the GPU memory usage is less than 7GB in total.

Memory for network parameters:
(256*5000 + 5000) * 4 * 2 = 10 Mbytes, where the factor of 2 is because the network has 1 tensor for weights and 1 tensor for gradients, and the additional 5000 is for biases.

Memory for data:
8192 * 512 * 4 * 2 = 32 Mbytes

So by those rough calculations, the memory consumption for the softmax layer is roughly 42 Mbytes.

Hi, what I understood from your answer is that the number of parameters (weights and biases) are stored twice in pytorch as per your Memory for network parameters. However, I didn’t quite get the Memory for data part. Shouldn’t the total calculation for a generic network in Pytorch be something like this so that it takes care of both the output features/activations at each layer of the network and the network parameters, each of which is stored twice?

total_gpu_usage = 2 x batch_size x (input_data_size + \
sum(feature_size_at_each_network_layer)) + 2 x parameter_size