I’m creating my own torch.autograd.Function
with torch API. One of the API used is torch.nonzero
and I found it becoming the speed bottleneck since it causes host-device synchronization blocking my CPU stream, noted in torch.nonzero(). I tried to use torch.masked_select
or indexing with foo[bar]
instead, yet I found all of them will call aten::nonzero
and cause a sync. The source code of aten::nonzero
is here, with the line 73 causing a sync when copying the number of nonzero items from device to host, calling at::cuda::memcpy_and_sync
. I’m confused about the necessity of this sync and wondering is there any non-blocking version of nonzero
?
Thanks! : )