After reading some related codes, I think I almost get it but still have few questions.
So what I understand is
Everytime when we make new RNN module instance, it allocates new w_ih, w_hh, b_ih, b_hh
tensors and register them as Parameter
for each layer, direction.
But it’s not guranteed that new tensors are contiguous on GPU memory, performance can be dropped due to the fragmentation. So we call flatten_parameters
function at the end of constructor to aggregate all the weight tensors into continuous space of GPU memory.
This task is done as
-
Allocate one big buffer tensor called
weight_buf
-
Copy values of weight tensor into
weight_buf
- Make each weight tensor’s internal data pointer indicating
weight_buf + offset
(The real execution steps are 1->3->2 in real code)
But when we use nn.DataParallel, it replicates original module(which is allocated only on certain GPU device) to every GPU it uses, then weight tensors are fragmented again since there’s no gurantee that replicated tensors are still contiguous on memory space.
Therefore we should flatten_parameters
again everytime the module is replicated to another GPU, and the best place to put function call would be the head of forward
function (of nn.Module), because forward
function of nn.Module
on each GPU is called only one time when forward
of nn.DataParallel
is called.
Although I never used nn.DistributedDataParallel
, I guess that the reason why it doesn’t need the flatten_parameters
call is because when it allocates new instance of RNN module, flatten_parameters
are automatically called, then it doesn’t move internal data position on memory unlike nn.DataParallel
, but it only copies some values into it.
And questions are
-
Do I understand right? Is there any misunderstood point?
-
When we do the step 3 of aggregation(=Make each weight tensor’s internal data pointer indicating
weight_buf + offset
), we call the get_parameters function and it- calls cudnnGetRNNLinLayerMatrixParams so that
matrix_pointer
indicates the GPU memory position of original, un-aggregated weight tensor, -
sets offset as the difference of
matrix_pointer
and start ofweight_buf
, - make internal data pointer of weight tensor indicating
weight_buf + offset
Then isn’t it indicating
matrix_pointer
again? Why don’t we replcateTensor param = at::empty({0}, weight_buf.options()).set_(weight_buf.storage(), offset, size);
withTensor param = at::empty({0}, weight_buf.options()).set_(weight_buf.storage(), cumsum, size); cumsum += size;
?
Or does that function calculate expected position of given component with respect to the given (start) data pointer? - calls cudnnGetRNNLinLayerMatrixParams so that