The OOM might be expected since the forward activations, needed to compute the gradients, could take the majority of the memory as described here.
The OOM might be expected since the forward activations, needed to compute the gradients, could take the majority of the memory as described here.