@ptrblck Same issue in here. In my case, the index_put_ on Cuda is much slower than CPU when calculating the confusion matrix.
Here is the snip code:
def add_batch(self, x, y):
x_row = x.reshape(-1)
y_row = y.reshape(-1)
idxs = torch.stack([x_row, y_row], dim=0)
if self.ones is None or self.last_scan_size != idxs.shape[-1]:
self.ones = torch.ones((idxs.shape[-1]), device=x.device, dtype=torch.long)
self.last_scan_size = idxs.shape[-1]
self.conf_matrix = self.conf_matrix.index_put_(tuple(idxs), self.ones, accumulate=True)
With CPU version, it takes around 5 mins [02:14<03:05, 18.62it/s], while the time become 12x times slower on CUDA [00:16<1:05:09, 1.53it/s].
Since translating the tensor from CUDA to CPU results extra cost, any idea to solve this problem will be very helpful for my project.
Thanks.