Why do we need "flatten_parameters" when using RNN with DataParallel

I got the following warning message when I use LSTM with nn.DataParallel.

RuntimeWarning: RNN module weights are not part of single contiguous chunk of memory.
This means they need to be compacted at every call, possibly greatly increasing memory usage.
To compact weights again call flatten_parameters().

I found the error is gone when I put self.lstm.flatten_parameters() at the top of forward function, but I wonder why do we need it.

Why is the weight of RNN non-contiguous on memory when we use nn.DataParallel?

And also I found the error would be gone if we replace DataParallel with DistributedDataParallel, then why isn’t the weight non-contiguous in latter case?

I found some similar questions but none of them had the answer.

After reading some related codes, I think I almost get it but still have few questions.
So what I understand is

Everytime when we make new RNN module instance, it allocates new w_ih, w_hh, b_ih, b_hh tensors and register them as Parameter for each layer, direction.

But it’s not guranteed that new tensors are contiguous on GPU memory, performance can be dropped due to the fragmentation. So we call flatten_parameters function at the end of constructor to aggregate all the weight tensors into continuous space of GPU memory.

This task is done as

  1. Allocate one big buffer tensor called weight_buf
  2. Copy values of weight tensor into weight_buf
  3. Make each weight tensor’s internal data pointer indicating weight_buf + offset

(The real execution steps are 1->3->2 in real code)

But when we use nn.DataParallel, it replicates original module(which is allocated only on certain GPU device) to every GPU it uses, then weight tensors are fragmented again since there’s no gurantee that replicated tensors are still contiguous on memory space.

Therefore we should flatten_parameters again everytime the module is replicated to another GPU, and the best place to put function call would be the head of forward function (of nn.Module), because forward function of nn.Module on each GPU is called only one time when forward of nn.DataParallel is called.

Although I never used nn.DistributedDataParallel, I guess that the reason why it doesn’t need the flatten_parameters call is because when it allocates new instance of RNN module, flatten_parameters are automatically called, then it doesn’t move internal data position on memory unlike nn.DataParallel, but it only copies some values into it.

And questions are

  1. Do I understand right? Is there any misunderstood point?

  2. When we do the step 3 of aggregation(=Make each weight tensor’s internal data pointer indicating weight_buf + offset), we call the get_parameters function and it

    • calls cudnnGetRNNLinLayerMatrixParams so that matrix_pointer indicates the GPU memory position of original, un-aggregated weight tensor,
    • sets offset as the difference of matrix_pointer and start of weight_buf,
    • make internal data pointer of weight tensor indicating weight_buf + offset

    Then isn’t it indicating matrix_pointer again? Why don’t we replcate

    Tensor param = at::empty({0}, weight_buf.options()).set_(weight_buf.storage(), offset, size);

    Tensor param = at::empty({0}, weight_buf.options()).set_(weight_buf.storage(), cumsum, size); cumsum += size;
    Or does that function calculate expected position of given component with respect to the given (start) data pointer?

That’s the conclusion I came to as well, except that I actually observe a larger VRAM usage and loss compute time when I put flatten_parameters in the forward pass (and I get no warning) vs. putting it in the __init__ function of the model (and then I get the warning only when using DataParallel).