import spconv.pytorch will slow down torch.multiprocessing.Pool().apply_async()

Hi, when I try using spconv-cu116(2.3.6) and torch.multiprocessing.Pool().apply_async() in 1.13 Pytorch version together, I find that the time for executing multiprocess mission is much longer after importing spconv (4.728647470474243 seconds before importing spconv, 11.85951018333435 after importing spconv, while the singleprocess takes 6.01546311378479 seconds to run the codes below with a NVIDIA GeForce 4090 GPU). Do you have any ideas how to solve this problem? Thanks!

import spconv.pytorch as spconv
import time
import torch.multiprocessing as mp

def multipreocess_func(x, y):
    time.sleep(1.5)
    return x, y, x + y

if __name__ == '__main__':
    torch.multiprocessing.set_start_method('spawn')
    with mp.Pool(processes=4) as pool:
        pytorch_device = torch.device('cuda:0')
        num = torch.ones(1)
        
        a = time.time()
        res = [pool.apply_async(multipreocess_func, (num, i)) for i in range(4)]
        for i, x in enumerate(res):
            pam_x = x.get()[0]
            pam_y = x.get()[1]
            pam_xy = x.get()[2]
            print(f"{pam_x},{pam_y},{pam_xy}")
        b = time.time()

        c = time.time()
        for i in range(4):
            res = multipreocess_func(num, i)
            pam_x = res[0]
            pam_y = res[1]
            pam_xy = res[2]
            print(f"{pam_x},{pam_y},{pam_xy}")
        d = time.time()
        print(f"multiprocess time:{b - a}, singleprocess time:{d - c}")

CUDA operations are executed asynchronously so could you describe what exactly you are trying to profile in your code? Do you want to profile the kernel dispatching or launching only or the actual execution time? In the latter case you would need to synchronize the code before starting and stopping the timers.

Thanks for your reply. Actually I want to use the multiprocessing to calculate the variance in parallel so as to reduce the execution time(see Algorithms for calculating variance - Wikipedia). But I found that when trying to do so, it was slower with multiprocessing.

The code above is a tiny demo summarized after I found out the time differences before and after importing spconv.pytorch. Sorry I’m not quite clear about how to synchronize the code before starting and stopping the timers.