Torch.bool + nn.DataParallel + >1 GPU = RuntimeError: Unconvertible NCCL type

Python 3.6.9
pytorch 1.2.0
I’m trying to implement a modified Conv2d (long story), so I subclassed it. However, I wish to save some additional parameters (discrete, rarely changes) in the state_dict, so I have extra nn.Parameter(…, requires_grad=False). Code stub:

class test(nn.Conv2d):
    def __init__(self, **kwargs):
        super(test, self).__init__(**kwargs) = nn.Parameter(torch.tensor(False), requires_grad=False)
    def forward(self, x):
        return F.conv2d(x, self.weight, self.bias, self.stride, self.padding, self.dilation, self.groups)

net = nn.Sequential(test(in_channels=3, out_channels=8, kernel_size=224),nn.Flatten())
net = nn.DataParallel(net)

For the forward pass, this works fine. However, for the backwards pass, if there are >1 GPU, I get:

  File "/workspace/pytorch_mdk/", line 86, in train
  File "/opt/conda/lib/python3.6/site-packages/torch/", line 118, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/opt/conda/lib/python3.6/site-packages/torch/autograd/", line 93, in backward
    allow_unreachable=True)  # allow_unreachable flag
  File "/opt/conda/lib/python3.6/site-packages/torch/autograd/", line 77, in apply
    return self._forward_cls.backward(self, *args)
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/parallel/", line 32, in backward
    return (None,) + ReduceAddCoalesced.apply(ctx.input_device, ctx.num_inputs, *grad_outputs)
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/parallel/", line 43, in forward
    return comm.reduce_add_coalesced(grads, destination)
  File "/opt/conda/lib/python3.6/site-packages/torch/cuda/", line 121, in reduce_add_coalesced
    flat_result = reduce_add(flat_tensors, destination)
  File "/opt/conda/lib/python3.6/site-packages/torch/cuda/", line 77, in reduce_add
    nccl.reduce(inputs, outputs, root=nccl_root)
  File "/opt/conda/lib/python3.6/site-packages/torch/cuda/", line 60, in reduce
    torch._C._nccl_reduce(inputs, outputs, root, op, streams, comms)
RuntimeError: Unconvertible NCCL type

Adding some print statements to torch/cuda/, we can see that it’s given a bunch of tensor([False], device=‘cuda:0’), tensor([False], device=‘cuda:1’), etc. The C code that it calls apparently does not like torch.bool, so it dumps the above RuntimeError.

Is this a bug? Am I doing it wrong?
In the short term, I’ll probably have to change bool parameters to torch.uint8 or something.

For my use case, I think the correct answer is to use self.register_buffer instead of nn.Parameter.

As for the title, I don’t know if that combination has any use cases (and hence does not need to be supported) or is a bug. IMO, it should at least be documented with a warning of some sort.