In pytorch tutorial, I found this article.
It says, if you have dataset of variable lengths then pre-allocating memory with the maximum length of input can help avoiding OOM error.
Pre-allocation of memory can be done by the following steps:
- generate a (usually random) batch of inputs with maximum sequence length (either corresponding to max length in the training dataset or to some predefined threshold)
- execute a forward and a backward pass with the generated batch, do not execute an optimizer or a learning rate scheduler, this step pre-allocates buffers of maximum size, which can be reused in subsequent training iterations
- zero out gradients
- proceed to regular training
My question is, should I do the step 1-3 at the beginning of every iteration? or only once at the beginning of the training code?
Also, is it related to this fragmentation problem?