I got the following warning message when I use LSTM with nn.DataParallel.
RuntimeWarning: RNN module weights are not part of single contiguous chunk of memory.
This means they need to be compacted at every call, possibly greatly increasing memory usage.
To compact weights again call flatten_parameters().
I found the error is gone when I put self.lstm.flatten_parameters() at the top of forward function, but I wonder why do we need it.
Why is the weight of RNN non-contiguous on memory when we use nn.DataParallel?
And also I found the error would be gone if we replace DataParallel with DistributedDataParallel, then why isn’t the weight non-contiguous in latter case?
I found some similar questions but none of them had the answer.
After reading some related codes, I think I almost get it but still have few questions.
So what I understand is
Everytime when we make new RNN module instance, it allocates new w_ih, w_hh, b_ih, b_hh tensors and register them as Parameter for each layer, direction.
Make each weight tensor’s internal data pointer indicatingweight_buf + offset
(The real execution steps are 1->3->2 in real code)
But when we use nn.DataParallel, it replicates original module(which is allocated only on certain GPU device) to every GPU it uses, then weight tensors are fragmented again since there’s no gurantee that replicated tensors are still contiguous on memory space.
Therefore we should flatten_parameters again everytime the module is replicated to another GPU, and the best place to put function call would be the head of forward function (of nn.Module), because forward function of nn.Module on each GPU is called only one time when forward of nn.DataParallel is called.
Although I never used nn.DistributedDataParallel, I guess that the reason why it doesn’t need the flatten_parameters call is because when it allocates new instance of RNN module, flatten_parameters are automatically called, then it doesn’t move internal data position on memory unlike nn.DataParallel, but it only copies some values into it.
And questions are
Do I understand right? Is there any misunderstood point?
When we do the step 3 of aggregation(=Make each weight tensor’s internal data pointer indicating weight_buf + offset), we call the get_parameters function and it
calls cudnnGetRNNLinLayerMatrixParams so that matrix_pointer indicates the GPU memory position of original, un-aggregated weight tensor,
sets offset as the difference of matrix_pointer and start of weight_buf,
make internal data pointer of weight tensor indicatingweight_buf + offset
Then isn’t it indicating matrix_pointer again? Why don’t we replcate
Tensor param = at::empty({0}, weight_buf.options()).set_(weight_buf.storage(), offset, size);
with
Tensor param = at::empty({0}, weight_buf.options()).set_(weight_buf.storage(), cumsum, size); cumsum += size;
?
Or does that function calculate expected position of given component with respect to the given (start) data pointer?
That’s the conclusion I came to as well, except that I actually observe a larger VRAM usage and loss compute time when I put flatten_parameters in the forward pass (and I get no warning) vs. putting it in the __init__ function of the model (and then I get the warning only when using DataParallel).