Is there a danger when doing DDP that different processes compiling kernels will overwrite each other?

divinho · December 10, 2024, 6:34pm

I’m in a situation where I have dynamic shapes, it’s a limited number, but there is more than one shape so I’m wondering what happens if I’m training on many GPUs. What if different processes are writing out kernels to the same location could that cause issues?

Would it be better for each process to use a different local disk location for its kernels (rather than setting PYTORCH_KERNEL_CACHE_PATH equally for all kernels)?