Loading CUDA and C++ extension slow

I am using the load function from torch.utils.cpp_extension. However, the compile time is very long even when I make small changes to my .cu files. When I check the processes, it seems that it is only using one thread.

Is there a way to speed it up? My argument list looks like this for now:

mymodule = load(name='mymodule',
                  sources=[f'{SOURCE_PATH}/torch_extension/mymodule.cpp', f'{SOURCE_PATH}/torch_extension/mymodule_kernel.cu'])