Is there any non-blocking version of torch.nonzero / aten::nonzero?

I’m creating my own torch.autograd.Function with torch API. One of the API used is torch.nonzero and I found it becoming the speed bottleneck since it causes host-device synchronization blocking my CPU stream, noted in torch.nonzero(). I tried to use torch.masked_select or indexing with foo[bar] instead, yet I found all of them will call aten::nonzero and cause a sync. The source code of aten::nonzero is here, with the line 73 causing a sync when copying the number of nonzero items from device to host, calling at::cuda::memcpy_and_sync. I’m confused about the necessity of this sync and wondering is there any non-blocking version of nonzero?

Thanks! : )

No: the output size (needed on the CPU) depends on the input data (on the GPU) and this forces the sync. Depending your application and the sizes involved, it might be better to work with a mask or similar.

Best regards

Thomas

1 Like