Why do we need "flatten_parameters" when using RNN with DataParallel

wwiiiii · May 30, 2019, 6:57am

After reading some related codes, I think I almost get it but still have few questions.
So what I understand is

Everytime when we make new RNN module instance, it allocates new w_ih, w_hh, b_ih, b_hh tensors and register them as Parameter for each layer, direction.

But it’s not guranteed that new tensors are contiguous on GPU memory, performance can be dropped due to the fragmentation. So we call flatten_parameters function at the end of constructor to aggregate all the weight tensors into continuous space of GPU memory.

This task is done as

Allocate one big buffer tensor called weight_buf
Copy values of weight tensor into weight_buf
Make each weight tensor’s internal data pointer indicating weight_buf + offset

(The real execution steps are 1->3->2 in real code)

But when we use nn.DataParallel, it replicates original module(which is allocated only on certain GPU device) to every GPU it uses, then weight tensors are fragmented again since there’s no gurantee that replicated tensors are still contiguous on memory space.

Therefore we should flatten_parameters again everytime the module is replicated to another GPU, and the best place to put function call would be the head of forward function (of nn.Module), because forward function of nn.Module on each GPU is called only one time when forward of nn.DataParallel is called.

Although I never used nn.DistributedDataParallel, I guess that the reason why it doesn’t need the flatten_parameters call is because when it allocates new instance of RNN module, flatten_parameters are automatically called, then it doesn’t move internal data position on memory unlike nn.DataParallel, but it only copies some values into it.

And questions are

Do I understand right? Is there any misunderstood point?
When we do the step 3 of aggregation(=Make each weight tensor’s internal data pointer indicating weight_buf + offset), we call the get_parameters function and it
- calls cudnnGetRNNLinLayerMatrixParams so that matrix_pointer indicates the GPU memory position of original, un-aggregated weight tensor,
- sets offset as the difference of matrix_pointer and start of weight_buf,
- make internal data pointer of weight tensor indicating weight_buf + offset
Then isn’t it indicating matrix_pointer again? Why don’t we replcate

Tensor param = at::empty({0}, weight_buf.options()).set_(weight_buf.storage(), offset, size);
with

Tensor param = at::empty({0}, weight_buf.options()).set_(weight_buf.storage(), cumsum, size); cumsum += size;
?
Or does that function calculate expected position of given component with respect to the given (start) data pointer?