Index_put_ really slow on gpu

@ptrblck Same issue in here. In my case, the index_put_ on Cuda is much slower than CPU when calculating the confusion matrix.

Here is the snip code:

    def add_batch(self, x, y):
        x_row = x.reshape(-1)
        y_row = y.reshape(-1)

        idxs = torch.stack([x_row, y_row], dim=0)

        if self.ones is None or self.last_scan_size != idxs.shape[-1]:
            self.ones = torch.ones((idxs.shape[-1]), device=x.device, dtype=torch.long)
            self.last_scan_size = idxs.shape[-1]

        self.conf_matrix = self.conf_matrix.index_put_(tuple(idxs), self.ones, accumulate=True)

With CPU version, it takes around 5 mins [02:14<03:05, 18.62it/s], while the time become 12x times slower on CUDA [00:16<1:05:09, 1.53it/s].

Since translating the tensor from CUDA to CPU results extra cost, any idea to solve this problem will be very helpful for my project.

Thanks.