Params stored in GPU during training

During training on GPUs, what are the exact parameters which take up the GPU memory? From my understanding, I think these are the ones:

  1. The model weights (and bias terms) which need to be loaded into memory.
  2. The (batch size x input_seq_length) for the inputs.
  3. The activations when using forward pass.
  4. The gradients when doing backward pass.
  5. The optimizer states assuming adam (first moment for every weight param, second moment for every weight param).

Did I miss something else? Please shed some light if something is wrong or I missed out something. Would like to understand what is consuming GPU memory.

Your summary looks correct. Besides that some method could also create intermediate tensors on the specified device in the backend, which would (temporarily) use more memory. Some libraries also use workspaces (e.g. cuDNN, cublas), others need to initialize some buffers (NCCL), so a small overhead could also come from these.

