Synchronize() didn't wait for topk()


#1

Hi~
I start several threads with different cuda streams to execute topk() on different tensors created in default stream. code snippet inside thread is,

y = x[::stride]
values, _ = torch.topk(y, k, 0, largest=True, sorted=False)
i = torch.ge(x, values.min()).nonzero()

However, it frequently gives me a random number returned by values.min() and follows by no dimension tensor i, even if I add torch.cuda.current_stream().synchronize().
I have to fix code as follows,

stream = torch.cuda.current_stream()
y = x[::stride]
values, _ = torch.topk(y, k, 0, largest=True, sorted=False)
num_i, loop_count = 0, 0
while num_i <= 0 or num_i >= x.numel():
    stream.synchronize()
    loop_count += 1
    if loop_count % 10 == 0:
        values, _ = torch.topk(y, k, 0, largest=True, sorted=False)
        num_i = 0
        continue
    i = torch.ge(x, values.min()).nonzero()
    num_i = i.numel()

It could work for a while but then whole process freezes – seems to be a deadlock somewhere.
I want to reproduce this problem with clear code snippet, but I cannot reproduce the problem with simple codes.


#2

Can anyone help me please?