Torch.scatter_add seems not to backprop gradients

Dear pytorch gurus,
I’m building a neural paraphrasing model and
trying to implement copy mechanism, which uses some tokens in the input sequence as they are when producing output tokens.

Easily put, copy mechanism produces a probability distribution over input tokens. After getting the probability, I try to add this to the probability distribution generated by GRU decoder. In the process, I need to scatter copy distribution into a vector of vocab size, where I used torch.scatter_add.

After that I compute mixture of two probability distribution and compute the loss afterward. The problem is, that loss doesn’t decrease after a few epochs. I doubt the gradient doesn’t flow over this operation. Can anyone suggest an alternative process for this implementation?

Thanks deeply in advance.

scatter_add_ should correctly propagate the gradients as seen in this example:

x = torch.rand(2, 5, requires_grad=True)
out = torch.ones(3, 5).scatter_add_(0, torch.tensor([[0, 1, 2, 0, 0], [2, 0, 0, 1, 2]]), x)
> tensor([[1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1.]])

Could you check the .grad attribute after calling .backward() for some parameters you are concerned about and make sure they have valid gradients?