Asynchronous execution multiple GPUs

Dear friends, I am using pytorch for linear algebra task to accelerate some calculations with GPUs. I have some function which do some calculations with given two tensors for example A and B. I have created two instances of this function with two pairs of tensors allocated on two different GPUs

some_fun(Tensor_A1_GPU0,Tensor_B1_GPU0,GPU_0) # First instance
some_fun(Tensor_A2_GPU1,Tensor_B2_GPU1,GPU_1) # Second instance

As I understand pytorch by default executes commands asynchronous, but for my case for some reason it waits first instance to be done only than it runs second one, even though they are completely independent. What could be the cause for this behavior?

some_fun might be synchronizing the CPU and is thus blocking it from advancing and scheduling the second some_fun call.
You could use torch.cuda.set_sync_debug_mode to debug if any synchronizing calls are used in some_fun.

Thank You, I will try.