Why do we need "flatten_parameters" when using RNN with DataParallel

wwiiiii · May 29, 2019, 6:50am

I got the following warning message when I use LSTM with nn.DataParallel.

RuntimeWarning: RNN module weights are not part of single contiguous chunk of memory.
This means they need to be compacted at every call, possibly greatly increasing memory usage.
To compact weights again call flatten_parameters().

I found the error is gone when I put self.lstm.flatten_parameters() at the top of forward function, but I wonder why do we need it.

Why is the weight of RNN non-contiguous on memory when we use nn.DataParallel?

And also I found the error would be gone if we replace DataParallel with DistributedDataParallel, then why isn’t the weight non-contiguous in latter case?

github.com/pytorch/pytorch

Multi-GPU autograd error with Pytorch 0.4

opened 09:42AM - 30 Apr 18 UTC

closed 08:18PM - 30 Oct 20 UTC

erogol

todo

After updating pytorch 0.4 I am getting the following error when I try to train …my model here: https://github.com/mozilla/TTS with multi-gpus. I have no idea about what it means unfortunately. A bug or just a problem that I need some feedback on. Thx. ``` Traceback (most recent call last): File "train.py", line 403, in <module> main(args) File "train.py", line 393, in main model, criterion, train_loader, optimizer, epoch) File "train.py", line 111, in train model.forward(text_input, mel_spec) File "/home/erogol/miniconda3/envs/pytorch4/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 114, in forward outputs = self.parallel_apply(replicas, inputs, kwargs) File "/home/erogol/miniconda3/envs/pytorch4/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 124, in parallel_apply return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)]) File "/home/erogol/miniconda3/envs/pytorch4/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 65, in parallel_apply raise output File "/home/erogol/miniconda3/envs/pytorch4/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 41, in _worker output = module(*input, **kwargs) File "/home/erogol/miniconda3/envs/pytorch4/lib/python3.6/site-packages/torch/nn/modules/module.py", line 491, in __call__ result = self.forward(*input, **kwargs) File "/home/erogol/projects/TTS/models/tacotron.py", line 28, in forward encoder_outputs = self.encoder(inputs) File "/home/erogol/miniconda3/envs/pytorch4/lib/python3.6/site-packages/torch/nn/modules/module.py", line 491, in __call__ result = self.forward(*input, **kwargs) File "/home/erogol/projects/TTS/layers/tacotron.py", line 205, in forward return self.cbhg(inputs) File "/home/erogol/miniconda3/envs/pytorch4/lib/python3.6/site-packages/torch/nn/modules/module.py", line 491, in __call__ result = self.forward(*input, **kwargs) File "/home/erogol/projects/TTS/layers/tacotron.py", line 183, in forward outputs, _ = self.gru(x) File "/home/erogol/miniconda3/envs/pytorch4/lib/python3.6/site-packages/torch/nn/modules/module.py", line 491, in __call__ result = self.forward(*input, **kwargs) File "/home/erogol/miniconda3/envs/pytorch4/lib/python3.6/site-packages/torch/nn/modules/rnn.py", line 192, in forward output, hidden = func(input, self.all_weights, hx, batch_sizes) File "/home/erogol/miniconda3/envs/pytorch4/lib/python3.6/site-packages/torch/nn/_functions/rnn.py", line 323, in forward return func(input, *fargs, **fkwargs) File "/home/erogol/miniconda3/envs/pytorch4/lib/python3.6/site-packages/torch/nn/_functions/rnn.py", line 287, in forward dropout_ts) RuntimeError: torch/csrc/autograd/variable.cpp:115: get_grad_fn: Assertion `output_nr == 0` failed. ```

I found some similar questions but none of them had the answer.

wwiiiii · May 30, 2019, 6:57am

After reading some related codes, I think I almost get it but still have few questions.
So what I understand is

Everytime when we make new RNN module instance, it allocates new w_ih, w_hh, b_ih, b_hh tensors and register them as Parameter for each layer, direction.

But it’s not guranteed that new tensors are contiguous on GPU memory, performance can be dropped due to the fragmentation. So we call flatten_parameters function at the end of constructor to aggregate all the weight tensors into continuous space of GPU memory.

This task is done as

Allocate one big buffer tensor called weight_buf
Copy values of weight tensor into weight_buf
Make each weight tensor’s internal data pointer indicating weight_buf + offset

(The real execution steps are 1->3->2 in real code)

But when we use nn.DataParallel, it replicates original module(which is allocated only on certain GPU device) to every GPU it uses, then weight tensors are fragmented again since there’s no gurantee that replicated tensors are still contiguous on memory space.

Therefore we should flatten_parameters again everytime the module is replicated to another GPU, and the best place to put function call would be the head of forward function (of nn.Module), because forward function of nn.Module on each GPU is called only one time when forward of nn.DataParallel is called.

Although I never used nn.DistributedDataParallel, I guess that the reason why it doesn’t need the flatten_parameters call is because when it allocates new instance of RNN module, flatten_parameters are automatically called, then it doesn’t move internal data position on memory unlike nn.DataParallel, but it only copies some values into it.

And questions are

Do I understand right? Is there any misunderstood point?
When we do the step 3 of aggregation(=Make each weight tensor’s internal data pointer indicating weight_buf + offset), we call the get_parameters function and it
- calls cudnnGetRNNLinLayerMatrixParams so that matrix_pointer indicates the GPU memory position of original, un-aggregated weight tensor,
- sets offset as the difference of matrix_pointer and start of weight_buf,
- make internal data pointer of weight tensor indicating weight_buf + offset
Then isn’t it indicating matrix_pointer again? Why don’t we replcate

Tensor param = at::empty({0}, weight_buf.options()).set_(weight_buf.storage(), offset, size);
with

Tensor param = at::empty({0}, weight_buf.options()).set_(weight_buf.storage(), cumsum, size); cumsum += size;
?
Or does that function calculate expected position of given component with respect to the given (start) data pointer?

Valiox · November 6, 2019, 2:36pm

That’s the conclusion I came to as well, except that I actually observe a larger VRAM usage and loss compute time when I put flatten_parameters in the forward pass (and I get no warning) vs. putting it in the __init__ function of the model (and then I get the warning only when using DataParallel).