Very high forward/backward pass size

Besides the parameters, the forward activations will also be stored during training in order to compute the gradients during the backward pass.
This post explains it in more detail with an example.
Based on the posted summary, I wouldn’t expect to see ~15GB of memory usage, but I also don’t know how the summary is calculated and how the forward activation memory usage is computed.