Failing to replicate model on MultipleGPUs

Previously I raised an issue #34941. After debugging the issue I found there is a bug in function take_tensors https://github.com/pytorch/pytorch/blob/master/torch/csrc/utils/tensor_flatten.cpp#L10
The functions lost the track of a buffer in the following scenario.

Function Input:
-> Tensor of size 356, size_limit , fine_grained = false

Function steps:
At line https://github.com/pytorch/pytorch/blob/master/torch/csrc/utils/tensor_flatten.cpp#L55
we have 3 groups:
group 1 -> 1 element, size 0
group 2 -> 238 elements
group 3 -> 117 elements

So due to size 0 of group 1, this https://github.com/pytorch/pytorch/blob/master/torch/csrc/utils/tensor_flatten.cpp#L57 condition becomes True and ignores group 1.

At line https://github.com/pytorch/pytorch/blob/master/torch/csrc/utils/tensor_flatten.cpp#L62
The result becomes:
group 2 -> 238 elements
group 3 -> 117 elements

so the tensor size is 238 + 117 = 355

Which causes this to assert failure because of (356 != 355)

Here is the traceback of a bug

"/opt/conda/envs/m2release/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__ result = self.forward(*input, **kwargs) File "/opt/conda/envs/m2release/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 154, in forward replicas = self.replicate(self.module, self.device_ids[:len(inputs)]) File "/opt/conda/envs/m2release/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 159, in replicate return replicate(module, device_ids, not torch.is_grad_enabled()) File "/opt/conda/envs/m2release/lib/python3.6/site-packages/torch/nn/parallel/replicate.py", line 102, in replicate buffer_copies_not_rg = _broadcast_coalesced_reshape(buffers_not_rg, devices, detach=True) File "/opt/conda/envs/m2release/lib/python3.6/site-packages/torch/nn/parallel/replicate.py", line 66, in _broadcast_coalesced_reshape return comm.broadcast_coalesced(tensors, devices) File "/opt/conda/envs/m2release/lib/python3.6/site-packages/torch/cuda/comm.py", line 39, in broadcast_coalesced return torch._C._broadcast_coalesced(tensors, devices, buffer_size) RuntimeError: tensors.size() == order.size() INTERNAL ASSERT FAILED at ../torch/csrc/utils/tensor_flatten.cpp:74, please report a bug to PyTorch.